Recovering from Socket closed

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Recovering from Socket closed

Chris Schneider-2
Dear Nutch Users,

We've been using Nutch 0.8 (MapReduce) to perform some internet
crawling. Things seemed to be going well on our 11 machines (1 master
with JobTracker/NameNode, 10 slaves with TaskTrackers/DataNodes)
until...

060129 222409 Lost tracker 'tracker_56288'
060129 222409 Task 'task_m_10gs5f' has been lost.
060129 222409 Task 'task_m_10qhzr' has been lost.
    ........
    ........
060129 222409 Task 'task_r_zggbwu' has been lost.
060129 222409 Task 'task_r_zh8dao' has been lost.
060129 222455 Server handler 8 on 8010 caught:
java.net.SocketException: Socket closed
java.net.SocketException: Socket closed
         at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:99)
         at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
         at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
         at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
         at java.io.DataOutputStream.flush(DataOutputStream.java:106)
         at org.apache.nutch.ipc.Server$Handler.run(Server.java:216)
060129 222455 Adding task 'task_m_cia5po' to set for tracker 'tracker_56288'
060129 223711 Adding task 'task_m_ffv59i' to set for tracker 'tracker_25647'

I'm hoping that someone could explain why task_m_cia5po got added to
tracker_56288 after this tracker was lost.

The TaskTracker currently running on that slave is now called
tracker_25647, and there's nothing in its TaskTracker log referring
to task_m_cia5po. Here's a snippet from this timeframe:

060129 215343 task_r_a61kvl 0.9997992% reduce > reduce
060129 215349 task_r_a61kvl  Recovered from failed datanode connection
060129 215349 task_r_a61kvl 1.0% reduce > reduce
060129 215349 Task task_r_a61kvl is done.
060129 215350 Server connection on port 50050 from 127.0.0.1: exiting
060129 221629 Lost connection to JobTracker
[m1.krugle.net/192.168.1.11:8010]. ex=jav$
060129 222134 Lost connection to JobTracker
[m1.krugle.net/192.168.1.11:8010]. ex=jav$
060129 222455 task_m_1cyb7c done; removing files.
    ........
    ........
060129 222535 task_m_zgpg9c done; removing files.
060129 222711 Stopping server on 50050
060129 222735 Server listener on port 50050: exiting
060129 222844 Server handler 0 on 50050: exiting
060129 222845 Server handler 6 on 50050: exiting
060129 222846 Server handler 1 on 50050: exiting
060129 222847 Server handler 3 on 50050: exiting
060129 222848 Server handler 2 on 50050: exiting
060129 222849 Server handler 4 on 50050: exiting
060129 222849 Server handler 5 on 50050: exiting
060129 222849 Server handler 7 on 50050: exiting
060129 223211 Stopping server on 50040
060129 223226 Client connection to 192.168.1.8:50040: closing
060129 223229 Client connection to 192.168.1.1:50040: closing
060129 223230 Client connection to 127.0.0.1:50040: closing
060129 223230 Server connection on port 50040 from 127.0.0.1: exiting
060129 223230 Client connection to 127.0.0.1:50040: closing
060129 223230 Server connection on port 50040 from 127.0.0.1: exiting
060129 223250 Server connection on port 50040 from 192.168.1.7: exiting
060129 223408 Server connection on port 50040 from 192.168.1.8: exiting
060129 223505 Server listener on port 50040: exiting
060129 223531 Server connection on port 50040 from 192.168.1.1: exiting
060129 223532 Server connection on port 50040 from 192.168.1.3: exiting
060129 223536 Server connection on port 50040 from 192.168.1.6: exiting
060129 223548 Server handler 0 on 50040: exiting
060129 223550 Server handler 6 on 50040: exiting
060129 223554 Server handler 3 on 50040: exiting
060129 223554 Server handler 2 on 50040: exiting
060129 223556 Server connection on port 50040 from 192.168.1.4: exiting
060129 223556 Server handler 7 on 50040: exiting
060129 223559 Server handler 1 on 50040: exiting
060129 223601 Server connection on port 50040 from 192.168.1.2: exiting
060129 223601 Server handler 5 on 50040: exiting
060129 223602 Server connection on port 50040 from 192.168.1.5: exiting
060129 223604 Server handler 4 on 50040: exiting
060129 223648 Client connection to 192.168.1.2:50040: closing
060129 223704 Server connection on port 50040 from 192.168.1.9: exiting
060129 223707 Client connection to 192.168.1.5:50040: closing
060129 223707 Client connection to 192.168.1.4:50040: closing
060129 223710 Client connection to 192.168.1.9:50040: closing
060129 223711 Reinitializing local state
060129 223711 Server listener on port 50050: starting
060129 223711 Server handler 0 on 50050: starting
060129 223711 Server handler 1 on 50050: starting
060129 223711 Server handler 2 on 50050: starting
060129 223711 Server handler 3 on 50050: starting
060129 223711 Server handler 4 on 50050: starting
060129 223711 Server handler 5 on 50050: starting
060129 223711 Server handler 6 on 50050: starting
060129 223711 Server handler 7 on 50050: starting
060129 223711 Server listener on port 50040: starting
060129 223711 Server handler 0 on 50040: starting
060129 223711 Server handler 1 on 50040: starting
060129 223711 Server handler 2 on 50040: starting
060129 223711 Server handler 3 on 50040: starting
060129 223711 Server handler 4 on 50040: starting
060129 223711 Server handler 5 on 50040: starting
060129 223711 Server handler 6 on 50040: starting
060129 223711 Server handler 7 on 50040: starting
060129 223711 parsing file:/home/crawler/nutch/conf/nutch-default.xml

So it does look like this TaskTracker relaunched itself and is now
communicating with the JobTracker.

Unfortunately, the JobTracker is still hung up waiting for
task_m_cia5po (the last of 4960 map tasks in the indexing job_7nflgy)
to complete, but as it's been assigned to a TaskTracker that is no
longer active, no progress is being made.

Also, since we've been running this crawl for quite some time, we'd
like to preserve the segment data if at all possible. Could someone
please recommend a way to recover as gracefully as possible from this
condition? The Crawl .main process died with the following output:

060129 221129 Indexer: adding segment:
/user/crawler/crawl-20060129091444/segments/20060129200246
Exception in thread "main" java.io.IOException: timed out waiting for response
        at org.apache.nutch.ipc.Client.call(Client.java:296)
        at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
        at $Proxy1.submitJob(Unknown Source)
        at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
        at org.apache.nutch.indexer.Indexer.index(Indexer.java:263)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:127)

However, it definitely seems as if the JobTracker is still waiting
for the job to finish (no failed jobs).

My thought would be to first try stopping the TaskTracker process on
the slave. If this didn't result in the JobTracker state changing, I
guess I would just do a stop-all.sh. Afterward, I'd probably
start-all.sh, then execute the following command to toss the
incompletely indexed segment:

bin/nutch ndfs -rm crawl-20060129091444/segments/20060129200246

Other Thoughts?

- Chris
--
------------------------
Chris Schneider
TransPac Software, Inc.
[hidden email]
------------------------
Reply | Threaded
Open this post in threaded view
|

Re: Recovering from Socket closed

Doug Cutting-2
Chris Schneider wrote:

> Also, since we've been running this crawl for quite some time, we'd like
> to preserve the segment data if at all possible. Could someone please
> recommend a way to recover as gracefully as possible from this
> condition? The Crawl .main process died with the following output:
>
> 060129 221129 Indexer: adding segment:
> /user/crawler/crawl-20060129091444/segments/20060129200246
> Exception in thread "main" java.io.IOException: timed out waiting for
> response
>     at org.apache.nutch.ipc.Client.call(Client.java:296)
>     at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
>     at $Proxy1.submitJob(Unknown Source)
>     at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
>     at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
>     at org.apache.nutch.indexer.Indexer.index(Indexer.java:263)
>     at org.apache.nutch.crawl.Crawl.main(Crawl.java:127)
>
> However, it definitely seems as if the JobTracker is still waiting for
> the job to finish (no failed jobs).

Have you looked at the web ui?  It will show if things are still
running.  This is on the jobtracker host at port 50030 by default.

The bug here is that the RPC call times out while the map task is
computing splits.  The fix is that the job tracker should not compute
splits until after it has returned from the submitJob RPC.  Please
submit a bug in Jira to help remind us to fix this.

To recover, first determine if the indexing has completed.  If it has
not, then use the 'index' command to index things, followed by 'dedup'
and 'merge'.  Look at the source for Crawl.java:

http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java?view=markup

All you need to do to complete the crawl is to complete the last few
steps manually.

Cheers,

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Recovering from Socket closed

Chris Schneider-2
Doug, et. al.,

>Chris Schneider wrote:
>>Also, since we've been running this crawl for quite some time, we'd like to preserve the segment data if at all possible. Could someone please recommend a way to recover as gracefully as possible from this condition? The Crawl .main process died with the following output:
>>
>>060129 221129 Indexer: adding segment: /user/crawler/crawl-20060129091444/segments/20060129200246
>>Exception in thread "main" java.io.IOException: timed out waiting for response
>>    at org.apache.nutch.ipc.Client.call(Client.java:296)
>>    at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
>>    at $Proxy1.submitJob(Unknown Source)
>>    at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
>>    at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
>>    at org.apache.nutch.indexer.Indexer.index(Indexer.java:263)
>>    at org.apache.nutch.crawl.Crawl.main(Crawl.java:127)
>>
>>However, it definitely seems as if the JobTracker is still waiting for the job to finish (no failed jobs).
>
>Have you looked at the web ui?  It will show if things are still running.  This is on the jobtracker host at port 50030 by default.

Yes, this is how I know the JobTracker is still waiting for task_m_cia5po to complete.

>The bug here is that the RPC call times out while the map task is computing splits.  The fix is that the job tracker should not compute splits until after it has returned from the submitJob RPC.  Please submit a bug in Jira to help remind us to fix this.

I'll be happy to log a bug for this.

Is there a work-around? Based on some other postings, I've increased ipc.client.timeout to 300000 (5 minutes). Does this property also control the timeout for the RPC call you describe above? If so, should I increase this timeout further? Is there a better way for us to avoid getting caught by the RPC timeout you describe? This crawl was only a medium-sized test. We hope to execute a much larger crawl over the next few days.

>To recover, first determine if the indexing has completed.  If it has not, then use the 'index' command to index things, followed by 'dedup' and 'merge'.  Look at the source for Crawl.java:
>
>http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java?view=markup
>
>All you need to do to complete the crawl is to complete the last few steps manually.

We've done these steps manually before, so I'll get on that now. I was just worried about whether to trust these segments, how best to restart the processes, etc.

Thanks,

- Chris
--
------------------------
Chris Schneider
TransPac Software, Inc.
[hidden email]
------------------------
Reply | Threaded
Open this post in threaded view
|

Re: Recovering from Socket closed

Chris Schneider-2
In reply to this post by Chris Schneider-2
Doug, et. al.,

At 2:48 PM -0800 1/31/06, Chris Schneider wrote:
>Is there a work-around? Based on some other postings, I've increased ipc.client.timeout to 300000 (5 minutes). Does this property also control the timeout for the RPC call you describe above? If so, should I increase this timeout further? Is there a better way for us to avoid getting caught by the RPC timeout you describe? This crawl was only a medium-sized test. We hope to execute a much larger crawl over the next few days.

I was unable to even get the indexing phase started; I would get a timeout right at the beginning. I tried increasing the ipc.client.timeout from 5 minutes to 10 minutes, but that didn't help. In desperation, I increased it to 30 minutes and went to walk the dogs. As it turned out, it apparently took 14 minutes for it to "compute the splits". The job is still running (34% complete). Thus, it does seem like Doug was right about this being the problem.

This is actually the third time we've increased ipc.client.timeout to solve crawl problems. I wonder whether we should just set it to an hour or two until the problem in the code is fixed. I guess I could also look into fixing the problem myself :)

Thanks for the help, Doug!

- Chris
--
------------------------
Chris Schneider
TransPac Software, Inc.
[hidden email]
------------------------
Reply | Threaded
Open this post in threaded view
|

Re: Recovering from Socket closed

Doug Cutting
Chris Schneider wrote:
> I was unable to even get the indexing phase started; I would get a timeout right at the beginning. I tried increasing the ipc.client.timeout from 5 minutes to 10 minutes, but that didn't help. In desperation, I increased it to 30 minutes and went to walk the dogs. As it turned out, it apparently took 14 minutes for it to "compute the splits". The job is still running (34% complete). Thus, it does seem like Doug was right about this being the problem.

I have no idea why this takes so long.  We should profile this operation
to figure out what's going on, because it shouldn't anywhere near that
long.  It should be easy to write a simple program that constructs a
JobConf and InputFormat like those used in this job, and calls
getSplits().  Then profile this as a standalone program to see where the
time is going.  Probably you don't really want to profile something that
takes 14 minutes, so perhaps profile it on a subset of the input.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Recovering from Socket closed

Andrzej Białecki-2
Doug Cutting wrote:

> Chris Schneider wrote:
>> I was unable to even get the indexing phase started; I would get a
>> timeout right at the beginning. I tried increasing the
>> ipc.client.timeout from 5 minutes to 10 minutes, but that didn't
>> help. In desperation, I increased it to 30 minutes and went to walk
>> the dogs. As it turned out, it apparently took 14 minutes for it to
>> "compute the splits". The job is still running (34% complete). Thus,
>> it does seem like Doug was right about this being the problem.
>
> I have no idea why this takes so long.  We should profile this
> operation to figure out what's going on, because it shouldn't anywhere
> near that long.  It should be easy to write a simple program that
> constructs a JobConf and InputFormat like those used in this job, and
> calls getSplits().  Then profile this as a standalone program to see
> where the time is going.  Probably you don't really want to profile
> something that takes 14 minutes, so perhaps profile it on a subset of
> the input.

I have an idea - you remember the old issue of MapFile's "index" being
corrupted, if Fetcher was interrupted. Random accesses to MapFile's
would take ages in that case. Does calculating splits involve random
access to the segment's MapFiles?

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Recovering from Socket closed

Doug Cutting
Andrzej Bialecki wrote:
> I have an idea - you remember the old issue of MapFile's "index" being
> corrupted, if Fetcher was interrupted. Random accesses to MapFile's
> would take ages in that case. Does calculating splits involve random
> access to the segment's MapFiles?

No, calculating splits just lists directories and then gets the size of
each file.  So this could point to an NDFS name node performance
problem, or an RPC performance problem.  Each file size request is an
RPC, and there could be hundreds or even thousands of input files, but
even thousands of RPCs shouldn't take 14 minutes.

Doug
Reply | Threaded
Open this post in threaded view
|

RE: Recovering from Socket closed

Fuad Efendi
In reply to this post by Chris Schneider-2
Nutch-Dev,
What do you use for JobTracker now?
sun.net.www.http.KeepAliveCache will keep HTTP 1.1 persisent connections
open for reuse. It launches a thread that will close the connection after a
timeout. This thread has a strong reference to your classloader (through a
ProtectionDomain). The thread will eventually die, but if you are doing
rapid redeployment, it could still be a problem. The only known solution is
to add "Connection: close" to your HTTP responses.

>However, it definitely seems as if the JobTracker is still waiting
>for the job to finish (no failed jobs).

>My thought would be to first try stopping the TaskTracker process on
>the slave.

Reply | Threaded
Open this post in threaded view
|

Re: Recovering from Socket closed

Chris Schneider-2
In reply to this post by Doug Cutting
Gang,

>Chris Schneider wrote:
>>I was unable to even get the indexing phase started; I would get a
>>timeout right at the beginning. I tried increasing the
>>ipc.client.timeout from 5 minutes to 10 minutes, but that didn't
>>help. In desperation, I increased it to 30 minutes and went to walk
>>the dogs. As it turned out, it apparently took 14 minutes for it to
>>"compute the splits". The job is still running (34% complete).
>>Thus, it does seem like Doug was right about this being the problem.
>
>I have no idea why this takes so long.  We should profile this
>operation to figure out what's going on, because it shouldn't
>anywhere near that long.  It should be easy to write a simple
>program that constructs a JobConf and InputFormat like those used in
>this job, and calls getSplits().  Then profile this as a standalone
>program to see where the time is going.  Probably you don't really
>want to profile something that takes 14 minutes, so perhaps profile
>it on a subset of the input.

Everyone should probably be made aware of the strange behavior we see
during indexing, at least for a relatively large number of large
segments (topN=500K, depth=20) with a relatively large crawldb (50M
URLs). Note that this was all performed with ipc.client.timeout set
to 30 minutes.

After launching the indexing job, the web UI shows all of the
TaskTrackers, but the numbers in the "Secs since heartbeat" column
just keep increasing. This goes on for about 10 minutes until the
JobTracker finally loses all of them (and the tasks they were working
on), as is shown in its log:

060210 224115 parsing file:/home/crawler/nutch/conf/nutch-site.xml
060210 225151 Lost tracker 'tracker_37064'
060210 225151 Task 'task_m_4ftk58' has been lost.
060210 225151 Task 'task_m_6ww2ri' has been lost.

...(snip)...

060210 225151 Task 'task_r_y6d190' has been lost.
060210 225151 Lost tracker 'tracker_92921'
060210 225151 Task 'task_m_9p24at' has been lost.

...(etc)...

At this point, the web UI is still up, the job shows 0% complete, and
the TaskTrackers table is empty. It goes on for an hour or so like
this, during which any rational person would probably want to kill
the job and start over.

Don't do this! Keep the faith!!!

About an hour later, the JobTracker magically reestablishes its
connection to the TaskTrackers (which now have new names), as is
shown in its log:

060210 225151 Task 'task_r_yj3y3o' has been lost.
060210 235403 Adding task 'task_m_k9u9a8' to set for tracker 'tracker_85874'
060210 235404 Adding task 'task_m_pijt4q' to set for tracker 'tracker_61888'

...(etc)...

The web UI also shows that the TaskTrackers are back (with their new names).

There's nothing in the TaskTracker logs during the initial 10
minutes, then a bunch of exiting and closing messages, until finally
the TaskTrackers start "Reinitializing local state":

060210 225403 Stopping server on 50050
060210 230102 Server handler 4 on 50050: exiting

...(snip)...

060210 230105 Server handler 7 on 50050: exiting
060210 232024 Server listener on port 50050: exiting
060210 232403 Stopping server on 50040
060210 234902 Server listener on port 50040: exiting
060210 234925 Server connection on port 50040 from 192.168.1.5: exiting

...(snip)...

060210 235009 Server connection on port 50040 from 192.168.1.10: exiting
060210 235013 Client connection to 192.168.1.4:50040: closing
060210 235014 Client connection to 192.168.1.7:50040: closing
060210 235015 Server connection on port 50040 from 192.168.1.7: exiting
060210 235016 Server handler 0 on 50040: exiting

...(snip)...

060210 235024 Server handler 2 on 50040: exiting
060210 235403 Reinitializing local state
060210 235403 Server listener on port 50050: starting
060210 235403 Server handler 0 on 50050: starting

...(etc)...

During the time that the TaskTrackers are lost, neither the master
nor the slave machines seem to be using much of the CPU or RAM, and
the DataNode logs are quiet. I suppose that it's probably I/O bound
on the master machine, but even that seems mysterious to me. It would
seem particularly inappropriate for the JobTracker to punt the
TaskTrackers because the master was too busy to listen for their
heartbeats.

At any rate, once the TaskTrackers go through the "Reinitializing
local state" thing, the indexing job seems to proceed normally, and
it eventually completes with no errors.

FYI,

- Chris
--
------------------------
Chris Schneider
TransPac Software, Inc.
[hidden email]
------------------------