continuous crawling?

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

continuous crawling?

Daniel Naber-10

how do people use Nutch to crawl continuously? Do you use the "recrawl"
script from the Wiki and start that via cronjob? I'd prefer a process that
runs forever and that makes sure the index is always mostly up-to-date.

In my case, I'm not trying to index the complete web but only interesting
sites. What an interesting site is will be decided during crawling using a
plugin I'm planning to write. Does anybody have experience with this kind
of use case? To my understanding, I'll need to modify the Generator class
so that it completely ignores pages (and their links) if the page is
considered irrelevant by my plugin.