Incremental Crawling / Revisting Pages

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Incremental Crawling / Revisting Pages

Jack.Tang
Hi

There is wonderful discussion in Heritrix mailist. I cannot help
FWDing some information here. And hope it helps for nutch

---------------------------------------------------------------------------------------------------------
Dennis Hotson wrote:

> I'm just wondering whether anyone has written a filter or module to do
> incremental crawling.

You've see the AdaptiveRevisitingFrontier Frontier?  Its described in
outline here, http://crawler.archive.org/articles/user_manual.html#arf,
and in detail, here: http://vefsofnun.bok.hi.is/thesis/ar.pdf.

> What I mean is something that will do a HEAD request on pages and then
> only fetch the actual content if the page has been updated (newer last-
> modified date or similar). This technique saves a lot of bandwidth and
> can speed up crawling for sites that aren't updated very often.
>
> I've written a proof of concept filter class that does this (well
> actually, it's not quite working yet).

How does your filter work?

St.Ack

>
> If somebody else has already solved this problem it would save me a lot
> of effort. Thanks! :D
>
> Cheers,
> Dennis
>
>
>
>

Regards
/Jack


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars