Nutch script feedback

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Nutch script feedback

Max Lynch
Now that I'm getting a hang of Nutch, I've started building a simple script
that satisfies my crawling needs.  However, the concept of repeating crawls
and how Nutch deals with duplicates and such is still not clear to me.

Basically, I've got a set of seed urls already injected, so that's not part
of this script, but I would like to constantly hit the same domains to
update my index when new documents are found.  I have been successful in
restricting crawls to those domains, so that's not a problem.  Here is my

export JAVA_HOME=/usr/lib/jvm/java-6-sun
echo "Using $1 as the crawl folder"

set -e


if [ ! -e "$lockfile" ]; then
    touch "$lockfile"
    echo "Already running!"

# Go to a depth of 5
for i in 1 2 3 4 5
s1=`ls -d $1/segments/2* | tail -1`
echo $s1
bin/nutch generate $1/crawldb $1/segments -topN 50000
s1=`ls -d $1/segments/2* | tail -1`
echo $s1
bin/nutch fetch $s1 -noParsing -threads 100
bin/nutch parse $s1
bin/nutch updatedb $1/crawldb $s1 -filter -normalize
bin/nutch invertlinks $1/linkdb -dir $1/segments
bin/nutch solrindex $1/crawldb $1/linkdb

What happens with my segments and fetches after this script runs?  If I run
it again, will new segments be created that possibly contain duplicate
documents or links that other segments already had?  Do I need to run
mergesegs?  Again, my sole goal is to constantly hit a set of domains and
find new content when it is available.  As such I'm not really concerned
with search depth, just that I'm getting most of the pages.

I would greatly appreciate any feedback or help.  Thanks.