how to deal with large/slow sites

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

how to deal with large/slow sites

AJ Chen-2
In vertical crawling, there are always some large sites that have tens
of thousands of pages. Fetching a page from these large sites very often
returns "retry later" because http.max.delays is exceeded.  Setting
appropriate values for http.max.delays and fetcher.server.delay can
minimize this kind of url dropping. However, with my application , I
still see 20-50% urls got dropped from a few large sites even with
pretty long delay setting, http.max.delays=20, fetcher.server.delay=5.0,
effectively 100 sec per host.

Two questions:
(1) Is there a better approach to deep-crawl large sites?  Should we
treat large sites differently from smaller sites?  I notice Doug and
Andrzej had discussed potential solutions to this problem.  But, anybody
has a good short-term solution?

(2) Will the dropped urls be picked up again in subsequent cycles of
fetchlist/segment/fetch/updatedb?  If this is true, running more cycles
should eventually fetch the dropped urls.  Does
db.default.fetch.interval (default is 30 days) influence when the
dropped urls will be fetched again?

Appreciate your advice.
AJ

Reply | Threaded
Open this post in threaded view
|

Re: how to deal with large/slow sites

Doug Cutting-2
AJ Chen wrote:
> Two questions:
> (1) Is there a better approach to deep-crawl large sites?

If a site with N pages which require T seconds each on average to fetch,
then fetching the entire site will require N*T seconds.  If that's
longer than you're willing to wait then you'll won't be able to fetch
the entire site.  If you are willing to wait, then set http.max.delays
to Integer.MAX_VALUE and wait.  In this case there's no shortcut.

> (2) Will the dropped urls be picked up again in subsequent cycles of
> fetchlist/segment/fetch/updatedb?

They will be retried in the next cycle, up to db.fetch.retry.max.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: how to deal with large/slow sites

em-13
Doug Cutting wrote:

> > (2) Will the dropped urls be picked up again in subsequent cycles of
> fetchlist/segment/fetch/updatedb?
>
> They will be retried in the next cycle, up to db.fetch.retry.max.
>
After the next bin/nutch generate... or are you using 'cycle' for
something else?

EM
Reply | Threaded
Open this post in threaded view
|

Re: how to deal with large/slow sites

Doug Cutting-2
EM wrote:
> Doug Cutting wrote:
>
>> > (2) Will the dropped urls be picked up again in subsequent cycles of
>> fetchlist/segment/fetch/updatedb?
>>
>> They will be retried in the next cycle, up to db.fetch.retry.max.
>>
> After the next bin/nutch generate... or are you using 'cycle' for
> something else?

updatedb + generate

Doug