incremental index task

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

incremental index task

Derek Young-2
Hello - I asked this on nutch-user but didn't get a response.  I am
using nutch-0.8.  I would like to fetch a few segments each night,
then update one large index.  Is it safe to run index on a group of
segments, then run index again on a different group of segments, then
merge?  I haven't found where this procedure is documented.  I would
like to do something like this:

assume I have four segments - I'll label them s0 s1 s2 s3 instead of
their timestamp names.

The first night I would index s0, s1 and rename the index to "A":
  nutch index crawl/indexes crawl/crawldb crawl/linkdb
crawl/segments/s0 crawl/segments/s1
  mv crawl/indexes/part-00000 crawl/indexes/A

Then on the second night I would index s2, s3 and rename the index to "B":
  nutch index crawl/indexes crawl/crawldb crawl/linkdb
crawl/segments/s2 crawl/segments/s3
  mv crawl/indexes/part-00000 crawl/indexes/B

Finally I would merge the two:
  nutch merge crawl/index crawl/indexes

Is this safe to do?  Is this how you're supposed to crawl nightly?
Any docs I'm missing on this?  Again, this is all for nutch-0.8, so
some of the docs from 0.7 no longer apply.

Thank you

-- Derek Young