Recrawl, New URLS and Nutch on multiple machines !
I wanted to try out Nutch and understand how to setup the whole
Internet crawling. It was very easy to follow the tutorial for
Whole-web Crawling but I got some questions:
1. I have read that by default Nutch will recrawl urls every 30 days.
I have said "Nutch" but I really don't know who is triggering the
recrawl? Fetcher thread is stopping as soon as all fetcher threads are
done. Tutorial advises to perform different steps in order to do the
"Whole-web Crawling": generate, inject, fecth, index.
What command (component ) will create thread which will
remain alive and trigger the recrawl?
2. How newly discovered URLs are being crawled?
3. How can I run Nutch crawler on multiple machines?