MergeCrawl

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

MergeCrawl

Boris Lau-2
Hi all,

wondering if anybody else had been having problem with the script at:

http://wiki.apache.org/nutch/MergeCrawl

with nutch-0.9?

I am doing the simple crawl like this:

bin/nutch url1 -dir crawl1 -depth 2
bin/nutch url2 -dir crawl2 -depth 2
# cwd at /nutch/search - since mergecrawl require absolute path
bin/mergecrawl /nutch/search/merged /nutch/search/crawl1 /nutch/search/crawl2

The individual crawl result was fine but however the merged result was not.

I suspect the result is with the final merge stage with index, since
if i manually reindex with:

bin/nutch index merged/indexes merged/crawldb merged/linkdb
merged/segments/<the_merged_segmentid>

then the will work perfectly fine (i.e. searchable via the nutch searcher).

How would one go about debugging this?  Is there any way to read the
index similar to the readdb for reading crawldb?

Many thanks in advance
boris