db.fetch.schedule.adaptive.min_interval not respected by Nutch 1.13

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

db.fetch.schedule.adaptive.min_interval not respected by Nutch 1.13

Zoltán Zvara
Dear Community,

db.fetch.schedule.adaptive.min_interval is not respected by Nutch 1.13. It is set to "86400", but a specific index of a site is fetched every 1-2 hours. What could be the problem?

Other configurations are:
db.fetch.schedule.class = "org.apache.nutch.crawl.AdaptiveFetchSchedule"
db.fetch.schedule.adaptive.min_interval = "86400"
db.fetch.schedule.adaptive.inc_rate = "0.4"
db.fetch.schedule.adaptive.dec_rate = "0.2"
db.fetch.schedule.adaptive.sync_delta = "true"
db.fetch.schedule.adaptive.sync_delta_rate = "0.3"

On generate the top is: 50000, number-of-lists: 50, number-of-segments: 1

Thanks,
Zoltán
Reply | Threaded
Open this post in threaded view
|

Re: db.fetch.schedule.adaptive.min_interval not respected by Nutch 1.13

Sebastian Nagel
Hi Zoltán,

it's probably a bug (NUTCH-1564), try to set sync_delta to false.

Best,
Sebastian

On 11/10/2017 04:12 PM, Zoltán Zvara wrote:

> Dear Community,
>
> db.fetch.schedule.adaptive.min_interval is not respected by Nutch 1.13. It is set to "86400", but a specific index of a site is fetched every 1-2 hours. What could be the problem?
>
> Other configurations are:
> db.fetch.schedule.class = "org.apache.nutch.crawl.AdaptiveFetchSchedule"
> db.fetch.schedule.adaptive.min_interval = "86400"
> db.fetch.schedule.adaptive.inc_rate = "0.4"
> db.fetch.schedule.adaptive.dec_rate = "0.2"
> db.fetch.schedule.adaptive.sync_delta = "true"
> db.fetch.schedule.adaptive.sync_delta_rate = "0.3"
>
> On generate the top is: 50000, number-of-lists: 50, number-of-segments: 1
>
> Thanks,
> Zoltán
>

Reply | Threaded
Open this post in threaded view
|

Re: db.fetch.schedule.adaptive.min_interval not respected by Nutch 1.13

Zoltán Zvara
Hi Sebastian,

We tried it but sites still get fetched every 1-2 hours, which is roughly one iteration.

Any other ideas? Maybe on how to debug it?

Thanks,
Zoltán
On 2017-11-12 15:34:45, Sebastian Nagel <[hidden email]> wrote:
Hi Zoltán,

it's probably a bug (NUTCH-1564), try to set sync_delta to false.

Best,
Sebastian

On 11/10/2017 04:12 PM, Zoltán Zvara wrote:

> Dear Community,
>
> db.fetch.schedule.adaptive.min_interval is not respected by Nutch 1.13. It is set to "86400", but a specific index of a site is fetched every 1-2 hours. What could be the problem?
>
> Other configurations are:
> db.fetch.schedule.class = "org.apache.nutch.crawl.AdaptiveFetchSchedule"
> db.fetch.schedule.adaptive.min_interval = "86400"
> db.fetch.schedule.adaptive.inc_rate = "0.4"
> db.fetch.schedule.adaptive.dec_rate = "0.2"
> db.fetch.schedule.adaptive.sync_delta = "true"
> db.fetch.schedule.adaptive.sync_delta_rate = "0.3"
>
> On generate the top is: 50000, number-of-lists: 50, number-of-segments: 1
>
> Thanks,
> Zoltán
>

Reply | Threaded
Open this post in threaded view
|

Re: db.fetch.schedule.adaptive.min_interval not respected by Nutch 1.13

Zoltán Zvara
We got the problem. Looking into the code of `AdaptiveFetchSchedule`, a `defaultInterval` will be used for the first time for each record, which is evaluated from configuration parameter "db.fetch.interval.default". It was not set in our configuration, and `AbstractFetchSchedule` implementation takes 0, which forced a re-fetch in every consecutive fetch phase. Sneaky. :-)

To avoid banal issues like this, default values in-code should be the same to the defaults of "nutch-site.xml".
Otherwise you never know what will happen.

Cheers,
Zoltán

On 2017-11-18 15:48:06, Zoltán Zvara <[hidden email]> wrote:
Hi Sebastian,

We tried it but sites still get fetched every 1-2 hours, which is roughly one iteration.

Any other ideas? Maybe on how to debug it?

Thanks,
Zoltán
On 2017-11-12 15:34:45, Sebastian Nagel <[hidden email]> wrote:
Hi Zoltán,

it's probably a bug (NUTCH-1564), try to set sync_delta to false.

Best,
Sebastian

On 11/10/2017 04:12 PM, Zoltán Zvara wrote:

> Dear Community,
>
> db.fetch.schedule.adaptive.min_interval is not respected by Nutch 1.13. It is set to "86400", but a specific index of a site is fetched every 1-2 hours. What could be the problem?
>
> Other configurations are:
> db.fetch.schedule.class = "org.apache.nutch.crawl.AdaptiveFetchSchedule"
> db.fetch.schedule.adaptive.min_interval = "86400"
> db.fetch.schedule.adaptive.inc_rate = "0.4"
> db.fetch.schedule.adaptive.dec_rate = "0.2"
> db.fetch.schedule.adaptive.sync_delta = "true"
> db.fetch.schedule.adaptive.sync_delta_rate = "0.3"
>
> On generate the top is: 50000, number-of-lists: 50, number-of-segments: 1
>
> Thanks,
> Zoltán
>