RE: refetching interval

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

RE: refetching interval

Ledio Ago
Hi Michael! Did you get a answer on this one?  It seems like the refetch interval
is hardcoded, no matter what you set it in the config file, since FETCH_GENERATION_DELAY_MS takes effect after the first fetch.

Anybody out there, is this correct, or we are reading this wrong.  If this is correct
then the refeching feature doesn't work.

Thanks,
Ledio

-----Original Message-----
From: Michael Ji [mailto:[hidden email]]
Sent: Friday, April 21, 2006 1:26 PM
To: [hidden email]
Subject: refetching interval


Hi,

I am using nutch 07 and found the following code in
FetchListTool.java

private static final long FETCH_GENERATION_DELAY_MS =
7 * 24 * 60 * 60 * 1000;

that means next refetching time is always 7 days
later, no matter what fetch interval setting in
nutch-site.xml,

I feel puzzled. Could any one give me a hint?

thanks,

Michael,


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
Reply | Threaded
Open this post in threaded view
|

Re: refetching interval

Andrzej Białecki-2
Ledio Ago wrote:
> Hi Michael! Did you get a answer on this one?  It seems like the refetch interval
> is hardcoded, no matter what you set it in the config file, since FETCH_GENERATION_DELAY_MS takes effect after the first fetch.
>
> Anybody out there, is this correct, or we are reading this wrong.  If this is correct
> then the refeching feature doesn't work.
>  

This is not the case (i.e. you are reading this wrong :) ). The
FETCH_GENERATION_DELAY_MS constant specifies how much time needs to pass
before Pages already selected to be included in a fetchlist will be
re-considered for selection again, UNLESS they have been updated with
updatedb (after fetching).

This is to prevent selecting the same pages, if you run FetchListTool
twice in a rapid succession - but at the same time, if you lost or
discarded that fetchlist, not to wait indefinitely. 7 days was
considered to be a good optimum (some large fetch jobs may run for days,
so it could be a couple days before you have a chance to run updatedb
with the results of fetching).

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

RE: refetching interval

Ledio Ago
In reply to this post by Ledio Ago
Andrzej, it makes sense now.  Thank you for your reply, this really helps.

-Ledio


-----Original Message-----
From: Andrzej Bialecki [mailto:[hidden email]]
Sent: Tuesday, May 16, 2006 2:18 PM
To: [hidden email]
Subject: Re: refetching interval


Ledio Ago wrote:
> Hi Michael! Did you get a answer on this one?  It seems like the refetch interval
> is hardcoded, no matter what you set it in the config file, since FETCH_GENERATION_DELAY_MS takes effect after the first fetch.
>
> Anybody out there, is this correct, or we are reading this wrong.  If this is correct
> then the refeching feature doesn't work.
>  

This is not the case (i.e. you are reading this wrong :) ). The
FETCH_GENERATION_DELAY_MS constant specifies how much time needs to pass
before Pages already selected to be included in a fetchlist will be
re-considered for selection again, UNLESS they have been updated with
updatedb (after fetching).

This is to prevent selecting the same pages, if you run FetchListTool
twice in a rapid succession - but at the same time, if you lost or
discarded that fetchlist, not to wait indefinitely. 7 days was
considered to be a good optimum (some large fetch jobs may run for days,
so it could be a couple days before you have a chance to run updatedb
with the results of fetching).

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: refetching interval

luti
In reply to this post by Ledio Ago
It is an interesting 'webgraph project':
http://webgraph.dsi.unimi.it/

I don't know it is usable or not for nutch link database for link database.
Reply | Threaded
Open this post in threaded view
|

webgraph

luti
In reply to this post by Ledio Ago
It is an interesting 'webgraph project':
http://webgraph.dsi.unimi.it/

I don't know it is usable or not for nutch link database for link database.