some urls in fetch list is not being fetched

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

some urls in fetch list is not being fetched

Feng Ji
hi there,

I running on Nutch 0.8.

A weird thing is that some urls is generated in fetchlist ( I dubugging
print out url in map() of generator.java and checked the dumped text from
/crawl_generate ). These urls are in fetchlist.

But I couldn't find them in the log/hadoop for fetcher segment.
because in fetcher.java, we have  "if (LOG.isInfoEnabled()) {
LOG.info("fetching
" + url); " (I saw other urls being fetched in log/hadoop)

Seems some urls are in fetchlist but not being fetched. Did I miss something
important in setup?

thanks,

Michael,
Reply | Threaded
Open this post in threaded view
|

Re: some urls in fetch list is not being fetched

Feng Ji
Sorry,

I missed an important error log:

2006-08-30 09:07:17,343 FATAL fetcher.Fetcher - at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:116)
which is the line of " if (!input.next(key, datum)) " in fetcher.java

Seems IO error happened when fetcher try to read data from /crawl_generate

Any hint you could provide?

thanks,

Michael,


On 8/30/06, Feng Ji <[hidden email]> wrote:

>
>  hi there,
>
> I running on Nutch 0.8.
>
> A weird thing is that some urls is generated in fetchlist ( I dubugging
> print out url in map() of generator.java and checked the dumped text from
> /crawl_generate ). These urls are in fetchlist.
>
> But I couldn't find them in the log/hadoop for fetcher segment.
> because in fetcher.java, we have  "if (LOG.isInfoEnabled()) { LOG.info("fetching
> " + url); " (I saw other urls being fetched in log/hadoop)
>
> Seems some urls are in fetchlist but not being fetched. Did I miss
> something important in setup?
>
> thanks,
>
> Michael,
>