Recrawl, New URLS and Nutch on multiple machines !
I wanted to try out Nutch and understand how to setup the whole
Internet crawling. It was very easy to follow the tutorial for
Whole-web Crawling but I got some questions:
1. I have read that by default Nutch will recrawl urls every 30 days.
I have said "Nutch" but I really don't know who is triggering the
recrawl? Fetcher thread is stopping as soon as all fetcher threads are
done. Tutorial advises to perform different steps in order to do the
"Whole-web Crawling": generate, inject, fecth, index.
What command (component ) will create thread which will
remain alive and trigger the recrawl?
2. How newly discovered URLs are being crawled?
3. How can I run Nutch crawler on multiple machines?
Hopefully with the new stuff Doug is working on perhaps "fetch/spider"
boxes can have a rule they apply against the DB for constant
fetching/updates without this much manual intervention.
From: "Daniel D." <[hidden email]>
To: [hidden email] Date: Sun, 5 Jun 2005 17:32:53 -0400
Subject: Recrawl, New URLS and Nutch on multiple machines !
> I wanted to try out Nutch and understand how to setup the whole
> Internet crawling. It was very easy to follow the tutorial for
> Whole-web Crawling but I got some questions:
> 1. I have read that by default Nutch will recrawl urls every 30 days.
> I have said "Nutch" but I really don't know who is triggering the
> recrawl? Fetcher thread is stopping as soon as all fetcher threads are
> done. Tutorial advises to perform different steps in order to do the
> "Whole-web Crawling": generate, inject, fecth, index.
> What command (component ) will create thread which will
> remain alive and trigger the recrawl?
> 2. How newly discovered URLs are being crawled?
> 3. How can I run Nutch crawler on multiple machines?
> Will appreciate your help!!
Thanks a lot for your clarifications. I have spent some time
understanding the Fetcher code and now will need to understand how I
can crawl initial set of URLS and then re-fetching:
· URLS that are due to be fetched (based on
db.default.fetch.interval) – maintenance.
· Fetching newly discovered (in the last fetch/re-fetch) URLS.
Unfortunately I couldn't find documentation that will explain all
options I can use. Searching in the forums also didn't help me much as
I have seen people asking similar questions and not getting clear
answers. In some cases messages have presented controversial
I will start running tests and look in the code but I assume it will
be difficult to track URLS being fetched, re-fetched and added after
couple of rounds of re-fetching.
I will post my questions here with the hope that good "Nutch" people
will help me to understand some elements of the software before I will
spend some night hours looking in the code.
1. Tutorial for the whole web crawling suggesting running generate,
fetch and update db couple of times. I think it's getting referred as
a depth. Is there any document that explaining the benefit of having
different depth of the results? I hope I'm using the right terminology
2. Were can I found the descriptions for bin/nutch generate options
3. What does term "top pages" mean? Where can I found description of
the "Scoring" algorithm?
4. If I would not specify for bin/nutch generate –refetchonly the
–topN parameter would I re-fetch all my URLS that are due.
5. I know that there is another discussion (subject: Intranet crawl
and re-fetch) but it still seams that we don't have a clear answer.
Would bin/nutch generate –refetchonly include new URLS (not fetched
yet) in the fetchlist?
6. If I initially have only one segment and I will keep running the
re-fetch (assuming that I will re-fetch existing and add some new
URLS) I will eventual create bigger and bigger fetchlist. Is there
known (or configurable) max size for the segment fethclist and after
reaching this size bin/nutch generate will create additional segment?
If it's not, I assume I can play with the –numFetchers parameter if I
would know number of URLS in the WebDB.
7. Is there a suggested size for the segment fetchlist for better
performance? Is there known maximum size when performance will
8. Where can I found memory (dick) usage for the WebDB and CPU usage
for bin/nutch updatedb? I'm looking for something like: for 1,000,000
documents WebDB will take approximately XX GB and running bin/nutch
updatedb on 1,000,000 will use up to XX MB of RAM.
It's a lot of questions for busy people to answer but I hope somebody
will drop a word.
> Nutch doesn't do anything by itself, you have to initiate the refetch
> process by running something like:
> bin/nutch generate -refetchonly db segments -numFetchers 30 -topN 30000000
> Something like that would do your refetch of the top 30 million docuements
> and give you roughly 30 segments of 1 million +/- urls in each segment.
> YOu could then move these segments (or nfs mount them) on your spider
> boxes and fetch them concurrently (on segment per box or something)
> machine 1: bin/nutch fetch segments/200505012345-0
> machine 2: bin/nutch fetch segments/200505012345-1
> .... so on and so forth....
> Hopefully with the new stuff Doug is working on perhaps "fetch/spider"
> boxes can have a rule they apply against the DB for constant
> fetching/updates without this much manual intervention.