large number of urls from Generator are not fetched?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

large number of urls from Generator are not fetched?

AJ Chen-2
Any idea why nutch (0.9-dev) does not try to fetch every url generated? For
example, if Generator generates 200,000 urls, maybe <100,000 urls will be
fetched, succeeded or failed. This is a big difference, which is obvious by
checking the number of urls in the log or run readseg -list. What causes a
large number of urls get thrown out by the Fetcher?

Thanks,
--
AJ Chen, PhD
http://web2express.org
Reply | Threaded
Open this post in threaded view
|

Re: large number of urls from Generator are not fetched?

Sami Siren-2
Are you saying that generator generates 200k urls but fetcher fetches
around 100k or are you saying that you generate (-topN 200000) 200k urls
and fetcher fetches only around  100k.

If latter and you are running with LocalJobRunner you need to generate
with -numFetchers 1.

--
  Sami Siren

AJ Chen wrote:
> Any idea why nutch (0.9-dev) does not try to fetch every url generated? For
> example, if Generator generates 200,000 urls, maybe <100,000 urls will be
> fetched, succeeded or failed. This is a big difference, which is obvious by
> checking the number of urls in the log or run readseg -list. What causes a
> large number of urls get thrown out by the Fetcher?
>
> Thanks,

Reply | Threaded
Open this post in threaded view
|

Re: large number of urls from Generator are not fetched?

Andrzej Białecki-2
In reply to this post by AJ Chen-2
AJ Chen wrote:
> Any idea why nutch (0.9-dev) does not try to fetch every url
> generated? For
> example, if Generator generates 200,000 urls, maybe <100,000 urls will be
> fetched, succeeded or failed. This is a big difference, which is
> obvious by
> checking the number of urls in the log or run readseg -list. What
> causes a
> large number of urls get thrown out by the Fetcher?

Are you running with "local" jobtracker?

Please run 'nutch readseg -list <segmentName> ' on this segment -
especially when it's only freshly generated, to check the number of
entries in crawl_generate (you can simulate this by creating a new dir,
and copy only crawl_generate there).

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

RE: large number of urls from Generator are not fetched?

Ledio Ago
In reply to this post by AJ Chen-2
The problem may be that "http.max.delays" is set too low...??

-Ledio

-----Original Message-----
From: AJ Chen [mailto:[hidden email]]
Sent: Tuesday, October 31, 2006 10:42 AM
To: [hidden email]
Subject: large number of urls from Generator are not fetched?

Any idea why nutch (0.9-dev) does not try to fetch every url generated?
For example, if Generator generates 200,000 urls, maybe <100,000 urls
will be fetched, succeeded or failed. This is a big difference, which is
obvious by checking the number of urls in the log or run readseg -list.
What causes a large number of urls get thrown out by the Fetcher?

Thanks,
--
AJ Chen, PhD
http://web2express.org
Reply | Threaded
Open this post in threaded view
|

Re: large number of urls from Generator are not fetched?

Andrzej Białecki-2
In reply to this post by AJ Chen-2
AJ Chen wrote:
> Any idea why nutch (0.9-dev) does not try to fetch every url
> generated? For
> example, if Generator generates 200,000 urls, maybe <100,000 urls will be
> fetched, succeeded or failed. This is a big difference, which is
> obvious by
> checking the number of urls in the log or run readseg -list. What
> causes a
> large number of urls get thrown out by the Fetcher?

Please see rev. 469660 (trunk) and rev. 469667 (branch-0.8) for a
possible fix.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: large number of urls from Generator are not fetched?

AJ Chen-2
In reply to this post by Sami Siren-2
It's numFetchers=-1 causing the unexpected result.  Setting it to 1 for
Generator solves the problem.  Thanks.
If one runs nutch carwler with default configuration on a single machine,
the unexpected difference between topN and number of urls fetched will cause
confusion. It may be a good idea to use numFetchers=1 as default.

AJ

On 10/31/06, Sami Siren <[hidden email]> wrote:

>
> Are you saying that generator generates 200k urls but fetcher fetches
> around 100k or are you saying that you generate (-topN 200000) 200k urls
> and fetcher fetches only around  100k.
>
> If latter and you are running with LocalJobRunner you need to generate
> with -numFetchers 1.
>
> --
>   Sami Siren
>
> AJ Chen wrote:
> > Any idea why nutch (0.9-dev) does not try to fetch every url generated?
> For
> > example, if Generator generates 200,000 urls, maybe <100,000 urls will
> be
> > fetched, succeeded or failed. This is a big difference, which is obvious
> by
> > checking the number of urls in the log or run readseg -list. What causes
> a
> > large number of urls get thrown out by the Fetcher?
> >
> > Thanks,
>
>


--
AJ Chen, PhD
http://web2express.org
Reply | Threaded
Open this post in threaded view
|

Re: large number of urls from Generator are not fetched?

Dennis Kubes
In reply to this post by AJ Chen-2
For anyone searching this thread in the future.  One possible cause of
this is when the hadoop nodes are not time synchronized with ntp or
something similar.  

For example if one or more of the slave nodes is a few minutes ahead of
the others and an inject job is run on one of those nodes (and this is
pretty much random and up to the system as to where a job is placed so
it wouldn't happen every time if only some of the nodes are out of sync)
after which a generate job is run on any node that is behind the out of
sync nodes (again random), then then some of the urls may not get
fetched because their starting fetch time in crawl db is later than the
current time on the machine that is doing the generate task.

Being out of sync also seems to affect other thing such as task stalling
for a couple of minutes, etc. but  I don't have specific information on
that.  The fix for this is to setup the nodes to access a a time server
in your network or setup the nodes to access a public time server and in
either case make sure your nodes are time synchronized by having ntp run
on startup.

Dennis

AJ Chen wrote:

> Any idea why nutch (0.9-dev) does not try to fetch every url
> generated? For
> example, if Generator generates 200,000 urls, maybe <100,000 urls will be
> fetched, succeeded or failed. This is a big difference, which is
> obvious by
> checking the number of urls in the log or run readseg -list. What
> causes a
> large number of urls get thrown out by the Fetcher?
>
> Thanks,