Lucene or Nutch

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Lucene or Nutch

Klaus Schaefers
Hello,

 

my name is Klaus and I'm a new member in this mailing list. I'm currently
working on my master thesis. One of my tasks is to implement a full text
search into an existing information system. Browsing the web, I found lucene
and nutch. Unfortunately I'm not sure which of these tools fits best into my
project. Let me outline the requirements shortly:

 

1)       Integration into an existing informations system

2)       Full text search on all objects of the information system

3)       Full text search on pdf and word documents appended to the objects
of the is.

 


From my point of view, I would prefer lucene, because I don't need the ui
etc. On the other hand  I would like to use word and pdf parser and the
LanguageIdentifier. Do you see any problems using these classes within
lucene?


Thanks


Klaus


 

Reply | Threaded
Open this post in threaded view
|

Re: Lucene or Nutch

Erik Hatcher
Yes, Lucene is the best fit for what you're after.  Nutch is built on  
Lucene, and adds web crawling on top.  You don't need a web crawler,  
so using Lucene directly is the best fit - of course you'll have to  
write code to integrate Lucene.

     Erik


On 9 Nov 2005, at 08:48, Klaus wrote:

> Hello,
>
>
>
> my name is Klaus and I'm a new member in this mailing list. I'm  
> currently
> working on my master thesis. One of my tasks is to implement a full  
> text
> search into an existing information system. Browsing the web, I  
> found lucene
> and nutch. Unfortunately I'm not sure which of these tools fits  
> best into my
> project. Let me outline the requirements shortly:
>
>
>
> 1)       Integration into an existing informations system
>
> 2)       Full text search on all objects of the information system
>
> 3)       Full text search on pdf and word documents appended to the  
> objects
> of the is.
>
>
>
>
>
>> From my point of view, I would prefer lucene, because I don't need  
>> the ui
>>
> etc. On the other hand  I would like to use word and pdf parser and  
> the
> LanguageIdentifier. Do you see any problems using these classes within
> lucene?
>
>
> Thanks
>
>
> Klaus
>
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Lucene or Nutch

Jérôme Charron
> Yes, Lucene is the best fit for what you're after. Nutch is built on
> Lucene, and adds web crawling on top. You don't need a web crawler,
> so using Lucene directly is the best fit - of course you'll have to
> write code to integrate Lucene.

Erik,

I was thinking about it for a while, but don't take time to. This mail is a
good oportunity...
In fact, I think it could be a good idea to move the nutch language
identifier core code
to a standalone library or to lucene code.
Does it make sense? What do you think about it? What is the best solution
(standalone vs lucene)?
Doug?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

RE: Lucene or Nutch

Rajan, Renuka
In reply to this post by Klaus Schaefers
Hello All

My question is kind of related to the email below.  I was exploring the option to full-text index a fairly large database that's 40G in size (data alone minus indices etc).  This data resides in Oracle which has its own full text indexing engine.  Does anyone have a recommendation between the Oracle FT indexing engine and Lucene?  It just seems that reading data from Oracle and creating Lucene indexes is an expensive operation but using Oracle's own FTI may be more efficient.  Especially because the data store and the FT indices would reside in the same repository (so to speak).

Of course I am just speculating at this point.  Does anyone have any metrics/recommendations?  I am huge fan of Lucene and would love to use it but I got to justify it to management especially because they have spent a load of money on Oracle. So I am keeping my fingers crossed hoping that someone has used Lucene to index data from Oracle and has found it far superior!

As always thanks in advance
Renuka

-----Original Message-----
From: Jérôme Charron [mailto:[hidden email]]
Sent: Wednesday, November 09, 2005 9:24 AM
To: [hidden email]
Subject: Re: Lucene or Nutch

> Yes, Lucene is the best fit for what you're after. Nutch is built on
> Lucene, and adds web crawling on top. You don't need a web crawler,
> so using Lucene directly is the best fit - of course you'll have to
> write code to integrate Lucene.

Erik,

I was thinking about it for a while, but don't take time to. This mail is a
good oportunity...
In fact, I think it could be a good idea to move the nutch language
identifier core code
to a standalone library or to lucene code.
Does it make sense? What do you think about it? What is the best solution
(standalone vs lucene)?
Doug?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited.  If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.
Reply | Threaded
Open this post in threaded view
|

Re: Lucene or Nutch

Doug Cutting-2
In reply to this post by Jérôme Charron
J?r?me Charron wrote:
> In fact, I think it could be a good idea to move the nutch language
> identifier core code
> to a standalone library or to lucene code.
> Does it make sense? What do you think about it? What is the best solution
> (standalone vs lucene)?

One could put it in the lucene contrib directory.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Lucene or Nutch

Andrzej Białecki-2
Doug Cutting wrote:

> J?r?me Charron wrote:
>
>> In fact, I think it could be a good idea to move the nutch language
>> identifier core code
>> to a standalone library or to lucene code.
>> Does it make sense? What do you think about it? What is the best
>> solution
>> (standalone vs lucene)?
>
>
> One could put it in the lucene contrib directory.


I would be disappointed by this move - language identifier is an
important component in Nutch. Now the mere fact that it's bundled with
Nutch encourages its proper maintenance. If there is enough drive in
terms of willingness and long-term commitment it would make sense to
move it to a separate project on its own (or maybe as a part of Jakarta
Commons), but moving it into a catch-all purely optional category like
Lucene contrib would increase risks that it slides into oblivion...

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Lucene or Nutch

Jérôme Charron
> I would be disappointed by this move - language identifier is an
> important component in Nutch. Now the mere fact that it's bundled with
> Nutch encourages its proper maintenance. If there is enough drive in
> terms of willingness and long-term commitment it would make sense to
> move it to a separate project on its own (or maybe as a part of Jakarta
> Commons), but moving it into a catch-all purely optional category like
> Lucene contrib would increase risks that it slides into oblivion...

Ok, Andrzej, I really understand your meaning.
But more and more people are contacting me directly in order to use the
language-identifier, but not as a nutch plugin, simply as a standalone
library. They get confused when I explain them that they need the nutch jar
in order to use the language-identifier. That's why I would like to make it
a standalone
jar. A short-term solutions could be to move the core classes (which have no
dependencies on
nutch) to a new lib-plugin (lib-lang for instance and adding a dependecy to
this plugin in the
language-identifier), so that this code could be used as a standalone lib.

Are you ok, with such changes?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: Lucene or Nutch

Andrzej Białecki-2
J?r?me Charron wrote:

> > I would be disappointed by this move - language identifier is an
> > important component in Nutch. Now the mere fact that it's bundled
> > with Nutch encourages its proper maintenance. If there is enough
> > drive in terms of willingness and long-term commitment it would
> > make sense to move it to a separate project on its own (or maybe as
> > a part of Jakarta Commons), but moving it into a catch-all purely
> > optional category like Lucene contrib would increase risks that it
> > slides into oblivion...
>
>
>  Ok, Andrzej, I really understand your meaning. But more and more
>  people are contacting me directly in order to use the
>  language-identifier, but not as a nutch plugin, simply as a
>  standalone library. They get confused when I explain them that they
>  need the nutch jar in order to use the language-identifier. That's
>  why I would like to make it a standalone jar. A short-term solutions
>  could be to move the core classes (which have no dependencies on
>  nutch) to a new lib-plugin (lib-lang for instance and adding a
>  dependecy to this plugin in the language-identifier), so that this
>  code could be used as a standalone lib.
>
>  Are you ok, with such changes?
>

Yes, certainly, it's a good intermediate step before moving it to a
separate project.

There are some other things that Doug mentioned that he would like to
separate from Nutch, like the IO and mapred frameworks. A similar
approach could be taken with these parts - this would encourage good
separation in design, and also prepare these parts to be separated into
their own projects.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Lucene or Nutch

Sami Siren
In reply to this post by Jérôme Charron
J?r?me Charron wrote:
> jar. A short-term solutions could be to move the core classes (which have no
> dependencies on
> nutch) to a new lib-plugin (lib-lang for instance and adding a dependecy to
> this plugin in the
> language-identifier), so that this code could be used as a standalone lib.
>
> Are you ok, with such changes?

Perhaps you could isolate ngram specific stuff to own plugin and the
lang-id into other.

Or the other option would be (what I implemented some time ago)
something like this (as ngram categorizer can also used for other
interesting stuff):

new package in core nutch containing classes like:

NGramProfile - pretty much as is
Categorizer - generic configurable ngram categorizer, configure
profiles, ngram sizes etc.
CategorizerFactory - to get hold of different categorizers

In LangId plugin you just get a correct ( configured to use lang ngram
profiles and predefined settings for ngramsizes etc ) categorizer from
factory and tell it to do it's job when needed.

--
  Sami Siren
Reply | Threaded
Open this post in threaded view
|

Re: Lucene or Nutch

Doug Cutting-2
In reply to this post by Andrzej Białecki-2
Andrzej Bialecki wrote:
> I would be disappointed by this move - language identifier is an
> important component in Nutch. Now the mere fact that it's bundled with
> Nutch encourages its proper maintenance. If there is enough drive in
> terms of willingness and long-term commitment it would make sense to
> move it to a separate project on its own (or maybe as a part of Jakarta
> Commons), but moving it into a catch-all purely optional category like
> Lucene contrib would increase risks that it slides into oblivion...

In 1.9 and beyond the plan is to build and distribute the contrib with
Lucene.  So 'ant test' in Lucene should test contrib too, etc.  The
point is to make sure that this stuff is maintained, but not to merge it
into the core.  So stuff in contrib should not slide into oblivion.

Doug