Incremental Crawling / Revisting Pages

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Incremental Crawling / Revisting Pages


There is wonderful discussion in Heritrix mailist. I cannot help
FWDing some information here. And hope it helps for nutch

Dennis Hotson wrote:

> I'm just wondering whether anyone has written a filter or module to do
> incremental crawling.

You've see the AdaptiveRevisitingFrontier Frontier?  Its described in
outline here,,
and in detail, here:

> What I mean is something that will do a HEAD request on pages and then
> only fetch the actual content if the page has been updated (newer last-
> modified date or similar). This technique saves a lot of bandwidth and
> can speed up crawling for sites that aren't updated very often.
> I've written a proof of concept filter class that does this (well
> actually, it's not quite working yet).

How does your filter work?


> If somebody else has already solved this problem it would save me a lot
> of effort. Thanks! :D
> Cheers,
> Dennis


Keep Discovering ... ...