can you incrementally build an index?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

can you incrementally build an index?

Jesse Hires
Does "bin/nutch merge" only create a whole new index out of several smaller
indexes, or can it be used to incrementally update a single large index with
newly fetched and indexed smaller segments?



Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
                // Guaranteed to be random
} // xkcd.com
Reply | Threaded
Open this post in threaded view
|

Re: can you incrementally build an index?

Andrzej Białecki-2
Jesse Hires wrote:
> Does "bin/nutch merge" only create a whole new index out of several smaller
> indexes, or can it be used to incrementally update a single large index with
> newly fetched and indexed smaller segments?

It can do either - the tool merges indexes as-is without de-duplicating
them, e.g. if you have more recent versions of the same page you will
get multiple documents with the same url. Then you need to do the
de-duplication.

The best workflow I'm aware of is to keep per-segment indexes, and then
throw away the master index each time you want to refresh, and rebuild
it from all per-segment indexes plus that most recent one. And then
deduplicate. If this sounds wasteful, please keep in mind that when
Lucene merges indexes it needs to re-write the main index anyway, so in
terms of disk IO it should be nearly the same.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com