updatedb in nutch-2.0 increases fetch time of all pages

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

updatedb in nutch-2.0 increases fetch time of all pages

alxsss
Hello,

updatedb in nutch-2.0 increases fetch time of all pages independent of if they have already been fetched or not.
For example if updatedb is applied in depth 1 and page A is fetched and its fetchTime is 30 days from now, then as a result of running updatedb in depth 2 fetch time of page A will be 60 days from now and so on.

Also, I wondered if it is possible to remove pages that do not pass filters from hbase datastore by using updatedb?.

Thanks.
Alex.
Reply | Threaded
Open this post in threaded view
|

Re: updatedb in nutch-2.0 increases fetch time of all pages

Ferdy Galema
Hi,

The fetchtime increasing is a bug indeed. There is already an issue for it:
https://issues.apache.org/jira/browse/NUTCH-1457

About removing urls, I'm not sure what the best solution is. It is
difficult to handle changes to normalizing/filtering rules over time. For
know it is best to not change rules in an existing crawl, otherwise you
have to run a custom delete tool or something like that.

Ferdy.

On Mon, Sep 17, 2012 at 8:57 PM, <[hidden email]> wrote:

> Hello,
>
> updatedb in nutch-2.0 increases fetch time of all pages independent of if
> they have already been fetched or not.
> For example if updatedb is applied in depth 1 and page A is fetched and
> its fetchTime is 30 days from now, then as a result of running updatedb in
> depth 2 fetch time of page A will be 60 days from now and so on.
>
> Also, I wondered if it is possible to remove pages that do not pass
> filters from hbase datastore by using updatedb?.
>
> Thanks.
> Alex.
>