Error at end of MapReduce run with indexing

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Error at end of MapReduce run with indexing

kkrugler
Hello fellow Nutchers,

I followed the steps described here by Doug:
  <http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200509.mbox/%3c4330C061.4030706@...%3e>

...to start a test run of the new (0.8, as of 1/12/2006) version of Nutch.

It ran for quite a while on my three machines - started at 111226,
and died at 150937, so almost four hours.

The error occurred during the Indexer phase:

060114 150937 Indexer: starting
060114 150937 Indexer: linkdb: crawl-20060114111226/linkdb
060114 150937 Indexer: adding segment:
/user/crawler/crawl-20060114111226/segments/20060114111918
060114 150937 parsing file:/home/crawler/nutch/conf/nutch-default.xml
060114 150937 parsing file:/home/crawler/nutch/conf/crawl-tool.xml
060114 150937 parsing file:/home/crawler/nutch/conf/mapred-default.xml
060114 150937 parsing file:/home/crawler/nutch/conf/mapred-default.xml
060114 150937 parsing file:/home/crawler/nutch/conf/nutch-site.xml
060114 150937 Indexer: adding segment:
/user/crawler/crawl-20060114111226/segments/20060114122751
060114 150937 Indexer: adding segment:
/user/crawler/crawl-20060114111226/segments/20060114133620
Exception in thread "main" java.io.IOException: timed out waiting for response
         at org.apache.nutch.ipc.Client.call(Client.java:296)
         at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
         at $Proxy1.submitJob(Unknown Source)
         at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
         at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
         at org.apache.nutch.indexer.Indexer.index(Indexer.java:259)
         at org.apache.nutch.crawl.Crawl.main(Crawl.java:121)

1. Any ideas what might have caused it to time out just now, when it
had successfully run many jobs up to that point?

2. What cruft might I need to get rid of because it died? For
example, I see a reference to
/home/crawler/tmp/local/jobTracker/job_18cunz.xml now when I try to
execute some Nutch commands.

3. What's the best way to find out how many pages were actually
crawled, how many links are in the DB, etc? The 0.7-era commands
(readdb, segread, etc) don't seem to be working with the new NDFS
setup.

4. Any idea whether 4 hours is a reasonable amount of time for this
test? It seemed long to me, given that I was starting with a single
URL as the seed.

Thanks,

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
Reply | Threaded
Open this post in threaded view
|

Re: Error at end of MapReduce run with indexing

Florent Gluck
Ken Krugler wrote:

> Hello fellow Nutchers,
>
> I followed the steps described here by Doug:
>  <http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200509.mbox/%3c4330C061.4030706@...%3e>
>
>
> ...to start a test run of the new (0.8, as of 1/12/2006) version of
> Nutch.
>
> It ran for quite a while on my three machines - started at 111226, and
> died at 150937, so almost four hours.
>
> The error occurred during the Indexer phase:
>
> 060114 150937 Indexer: starting
> 060114 150937 Indexer: linkdb: crawl-20060114111226/linkdb
> 060114 150937 Indexer: adding segment:
> /user/crawler/crawl-20060114111226/segments/20060114111918
> 060114 150937 parsing file:/home/crawler/nutch/conf/nutch-default.xml
> 060114 150937 parsing file:/home/crawler/nutch/conf/crawl-tool.xml
> 060114 150937 parsing file:/home/crawler/nutch/conf/mapred-default.xml
> 060114 150937 parsing file:/home/crawler/nutch/conf/mapred-default.xml
> 060114 150937 parsing file:/home/crawler/nutch/conf/nutch-site.xml
> 060114 150937 Indexer: adding segment:
> /user/crawler/crawl-20060114111226/segments/20060114122751
> 060114 150937 Indexer: adding segment:
> /user/crawler/crawl-20060114111226/segments/20060114133620
> Exception in thread "main" java.io.IOException: timed out waiting for
> response
>         at org.apache.nutch.ipc.Client.call(Client.java:296)
>         at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
>         at $Proxy1.submitJob(Unknown Source)
>         at
> org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
>         at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
>         at org.apache.nutch.indexer.Indexer.index(Indexer.java:259)
>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:121)
>
> 1. Any ideas what might have caused it to time out just now, when it
> had successfully run many jobs up to that point?
>
> 2. What cruft might I need to get rid of because it died? For example,
> I see a reference to /home/crawler/tmp/local/jobTracker/job_18cunz.xml
> now when I try to execute some Nutch commands.

I've had the same problem during the invertlinks step when dealing w/ a
large number of urls.  Increasing the ipc.client.timeout value from
60000  to 100000 (cf nutch-default.xml) did the trick.

>
> 3. What's the best way to find out how many pages were actually
> crawled, how many links are in the DB, etc? The 0.7-era commands
> (readdb, segread, etc) don't seem to be working with the new NDFS setup.

The following gives you some stats about the crawl db (#url fetched,
unfetched and "dead" ones):
nutch readdb crawldb -stats

>
> 4. Any idea whether 4 hours is a reasonable amount of time for this
> test? It seemed long to me, given that I was starting with a single
> URL as the seed.
>
How many crawl passes did you do ?

--Flo
Reply | Threaded
Open this post in threaded view
|

Re: Error at end of MapReduce run with indexing

kkrugler
Hi Florent,

[snip]

>  > 1. Any ideas what might have caused it to time out just now, when it
>>  had successfully run many jobs up to that point?
>>
>>  2. What cruft might I need to get rid of because it died? For example,
>>  I see a reference to /home/crawler/tmp/local/jobTracker/job_18cunz.xml
>>  now when I try to execute some Nutch commands.
>
>I've had the same problem during the invertlinks step when dealing w/ a
>large number of urls.  Increasing the ipc.client.timeout value from
>60000  to 100000 (cf nutch-default.xml) did the trick.

Thanks for the idea - we'll give it a try now.

[snip]

>  > 4. Any idea whether 4 hours is a reasonable amount of time for this
>>  test? It seemed long to me, given that I was starting with a single
>  > URL as the seed.
>  >
>How many crawl passes did you do ?

Three deep, as in: bin/nutch crawl seeds -depth 3

This was the same as Doug described in his post here:

http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200509.mbox/%3c4330C061.4030706@...%3e

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
Reply | Threaded
Open this post in threaded view
|

Re: Error at end of MapReduce run with indexing

Florent Gluck
Hi Ken,

>>  > 4. Any idea whether 4 hours is a reasonable amount of time for this
>>
>>>  test? It seemed long to me, given that I was starting with a single
>>
>>  > URL as the seed.
>>  >
>> How many crawl passes did you do ?
>
>
> Three deep, as in: bin/nutch crawl seeds -depth 3
>
> This was the same as Doug described in his post here:
>
> http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200509.mbox/%3c4330C061.4030706@...%3e
>

I assume the time it takes depends on your hardware, bandwidth, how many
urls are being fetched and your mapreduce settings.
4 hours seems a bit long when starting from 1 url though.
Are you using 2 or 3 slave machines?
What values are you using for "fetcher.threads.fetch",
"mapred.map.tasks" and "mapred.reduce.tasks"?
When doing a "nutch readdb crawldb -stats", how many DB_unfetched and
DB_fetched do you have?

--Flo
Reply | Threaded
Open this post in threaded view
|

Re: Error at end of MapReduce run with indexing

Doug Cutting-2
In reply to this post by kkrugler
Ken Krugler wrote:

> 060114 150937 Indexer: adding segment:
> /user/crawler/crawl-20060114111226/segments/20060114122751
> 060114 150937 Indexer: adding segment:
> /user/crawler/crawl-20060114111226/segments/20060114133620
> Exception in thread "main" java.io.IOException: timed out waiting for
> response
>         at org.apache.nutch.ipc.Client.call(Client.java:296)
>         at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
>         at $Proxy1.submitJob(Unknown Source)
>         at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
>         at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
>         at org.apache.nutch.indexer.Indexer.index(Indexer.java:259)
>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:121)
>
> 1. Any ideas what might have caused it to time out just now, when it had
> successfully run many jobs up to that point?

I too have seen this, and found that increasing the ipc timeout fixes
it.  The underlying problem is that the JobTracker computes the input
splits under the submitJob() RPC call.  For sufficiently big jobs, this
can cause an RPC timeout.  The JobTracker should instead return from
submitJob() immediately, and then compute the input splits in a separate
thread.

> 2. What cruft might I need to get rid of because it died? For example, I
> see a reference to /home/crawler/tmp/local/jobTracker/job_18cunz.xml now
> when I try to execute some Nutch commands.

This should get cleaned up the next time the jobtracker is restarted.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Error at end of MapReduce run with indexing

Matt Zytaruk
I am having this same problem during the reduce phase of fetching, and
am now seeing:
 060119 132458 Task task_r_obwceh timed out.  Killing.

Will the jobtracker restart this job? If so, if I change the ipc timeout
in the config, will the tasktracker read in the new value when the job
restarts?
This was a very large crawl and I would be loathe to have to re-fetch it
all over again.

thanks for any info.

-Matt Zytaruk

Doug Cutting wrote:

> Ken Krugler wrote:
>
>> 060114 150937 Indexer: adding segment:
>> /user/crawler/crawl-20060114111226/segments/20060114122751
>> 060114 150937 Indexer: adding segment:
>> /user/crawler/crawl-20060114111226/segments/20060114133620
>> Exception in thread "main" java.io.IOException: timed out waiting for
>> response
>>         at org.apache.nutch.ipc.Client.call(Client.java:296)
>>         at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
>>         at $Proxy1.submitJob(Unknown Source)
>>         at
>> org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
>>         at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
>>         at org.apache.nutch.indexer.Indexer.index(Indexer.java:259)
>>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:121)
>>
>> 1. Any ideas what might have caused it to time out just now, when it
>> had successfully run many jobs up to that point?
>
>
> I too have seen this, and found that increasing the ipc timeout fixes
> it.  The underlying problem is that the JobTracker computes the input
> splits under the submitJob() RPC call.  For sufficiently big jobs,
> this can cause an RPC timeout.  The JobTracker should instead return
> from submitJob() immediately, and then compute the input splits in a
> separate thread.
>
>> 2. What cruft might I need to get rid of because it died? For
>> example, I see a reference to
>> /home/crawler/tmp/local/jobTracker/job_18cunz.xml now when I try to
>> execute some Nutch commands.
>
>
> This should get cleaned up the next time the jobtracker is restarted.
>
> Doug
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Error at end of MapReduce run with indexing

Doug Cutting-2
Matt Zytaruk wrote:
> I am having this same problem during the reduce phase of fetching, and
> am now seeing:
> 060119 132458 Task task_r_obwceh timed out.  Killing.

That is a different problem: a different timeout.  This happens when a
task does not report status for too long then it is assumed to be hung.

> Will the jobtracker restart this job?

It will retry that task up to three times.

> If so, if I change the ipc timeout
> in the config, will the tasktracker read in the new value when the job
> restarts?

The ipc timeout is not the relevant timeout.  The task timeout is what's
involved here.  And, no, at present I think the tasktracker only reads
this when it is started, not per job.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Error at end of MapReduce run with indexing

kkrugler
In reply to this post by Florent Gluck
>Hi Ken,
>
>>>   > 4. Any idea whether 4 hours is a reasonable amount of time for this
>>>
>>>>   test? It seemed long to me, given that I was starting with a single
>>>
>>>   > URL as the seed.
>>>   >
>>>  How many crawl passes did you do ?
>>
>>
>>  Three deep, as in: bin/nutch crawl seeds -depth 3
>>
>>  This was the same as Doug described in his post here:
>>
>>
>>http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200509.mbox/%3c4330C061.4030706@...%3e
>>
>
>I assume the time it takes depends on your hardware, bandwidth, how many
>urls are being fetched and your mapreduce settings.
>4 hours seems a bit long when starting from 1 url though.
>Are you using 2 or 3 slave machines?
>What values are you using for "fetcher.threads.fetch",
>"mapred.map.tasks" and "mapred.reduce.tasks"?
>When doing a "nutch readdb crawldb -stats", how many DB_unfetched and
>DB_fetched do you have?

Sorry for the late reply.

The problem was caused by the default IPC timeout value being too low.

We were getting lots of timeout errors, which was killing our performance.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200