Failed Task

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Failed Task

Ross Boucher
I'm having my job fail on me just before finishing, and after  
processing for a few hours.  All the relevant log info I can find is  
below, and the only thing I can see as potentially being a problem is  
that the available space on one of the devices is pretty low.  
Additionally, 2 map jobs failed because of malformed input, but I  
would think that hadoop would just ignore those two jobs when trying  
to complete the reduce phase.  Does anyone recognize anything in  
these logs that would help me identify the problem?

I had a task fail on me with the following message:

07/09/24 15:12:15 INFO mapred.JobClient: Task Id :  
task_0001_m_004154_1, Status : FAILED
07/09/24 15:12:20 INFO mapred.JobClient: Task Id :  
task_0001_m_010939_1, Status : FAILED
07/09/24 15:12:24 INFO mapred.JobClient: Task Id :  
task_0001_m_004154_2, Status : FAILED
07/09/24 15:12:31 INFO mapred.JobClient: Task Id :  
task_0001_m_010939_2, Status : FAILED
07/09/24 15:12:35 INFO mapred.JobClient:  map 100% reduce 100%
java.io.IOException: Job failed!
         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:
604)
         at SearchCount.main(SearchCount.java:168)
         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
         at sun.reflect.NativeMethodAccessorImpl.invoke
(NativeMethodAccessorImpl.java:39)
         at sun.reflect.DelegatingMethodAccessorImpl.invoke
(DelegatingMethodAccessorImpl.java:25)
         at java.lang.reflect.Method.invoke(Method.java:585)
         at org.apache.hadoop.util.ProgramDriver
$ProgramDescription.invoke(ProgramDriver.java:69)
         at org.apache.hadoop.util.ProgramDriver.driver
(ProgramDriver.java:140)
         at SearchDriver.main(SearchDriver.java:34)
         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
         at sun.reflect.NativeMethodAccessorImpl.invoke
(NativeMethodAccessorImpl.java:39)
         at sun.reflect.DelegatingMethodAccessorImpl.invoke
(DelegatingMethodAccessorImpl.java:25)
         at java.lang.reflect.Method.invoke(Method.java:585)
         at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

The logs for all of the tasktrackers end in

2007-09-24 15:13:34,205 WARN org.apache.hadoop.ipc.Server: IPC Server  
handler 0 on 50050, call getMapCompletionEvents(job_0001, 17180, 50)  
from 17.231.8.30:63506: output error
java.nio.channels.ClosedChannelException
         at sun.nio.ch.SocketChannelImpl.ensureWriteOpen
(SocketChannelImpl.java:125)
         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:
294)
         at  
org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer
(SocketChannelOutputStream.java:108)
         at org.apache.hadoop.ipc.SocketChannelOutputStream.write
(SocketChannelOutputStream.java:89)
         at java.io.BufferedOutputStream.flushBuffer
(BufferedOutputStream.java:65)
         at java.io.BufferedOutputStream.flush
(BufferedOutputStream.java:123)
         at java.io.DataOutputStream.flush(DataOutputStream.java:106)
         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:592)

with only the Server handler number differing.  Finally, the logs  
from the jobtracker app for the failed reduce jobs show:

2007-09-24 15:12:36,401 WARN org.apache.hadoop.mapred.ReduceTask:  
task_0001_r_000000_0 copy failed: task_0001_m_017073_0 from 17.231.8.33
2007-09-24 15:12:36,403 WARN org.apache.hadoop.mapred.ReduceTask:  
java.io.IOException: Server returned HTTP response code: 500 for URL:  
http://17.231.8.33:50060/mapOutput?map=task_0001_m_017073_0&reduce=0
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream
(HttpURLConnection.java:1152)
        at org.apache.hadoop.mapred.MapOutputLocation.getFile
(MapOutputLocation.java:206)
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
$MapOutputCopier.copyOutput(ReduceTask.java:680)
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
$MapOutputCopier.run(ReduceTask.java:641)

Though they actually continue a few lines after that:

2007-09-24 15:12:37,393 INFO org.apache.hadoop.mapred.ReduceTask:  
task_0001_r_000000_0 Got 0 new map outputs from tasktracker and 0 map  
outputs from previous failures
2007-09-24 15:12:37,393 INFO org.apache.hadoop.mapred.ReduceTask:  
task_0001_r_000000_0 Got 88 known map output location(s); scheduling...
2007-09-24 15:12:37,394 INFO org.apache.hadoop.mapred.ReduceTask:  
task_0001_r_000000_0 Scheduled 2 of 88 known outputs (0 slow hosts  
and 86 dup hosts)

Thanks.

Ross Boucher
[hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Failed Task

Devaraj Das
> Additionally, 2 map jobs failed because of malformed input,
> but I would think that hadoop would just ignore those two
> jobs when trying to complete the reduce phase.

That's not true. Under the default settings, a job will fail even if a
single map/reduce task fails. The number of attempts for a task is set to 4
by default (mapred.{map,reduce}.max.attempts), and the job fails if there
happens to be 4 unsuccessful attempts for any task.

You can, however, tweak the mapred.max.{map,reduce}.failures.percent values
to allow for cases where you expect some percent of maps/reduces to fail.
Note that this config item doesn't appear in hadoop-default.xml (although it
should have).

> -----Original Message-----
> From: Ross Boucher [mailto:[hidden email]]
> Sent: Tuesday, September 25, 2007 4:05 AM
> To: [hidden email]
> Subject: Failed Task
>
> I'm having my job fail on me just before finishing, and after
> processing for a few hours.  All the relevant log info I can
> find is below, and the only thing I can see as potentially
> being a problem is  
> that the available space on one of the devices is pretty low.  
> Additionally, 2 map jobs failed because of malformed input,
> but I would think that hadoop would just ignore those two
> jobs when trying to complete the reduce phase.  Does anyone
> recognize anything in these logs that would help me identify
> the problem?
>
> I had a task fail on me with the following message:
>
> 07/09/24 15:12:15 INFO mapred.JobClient: Task Id :  
> task_0001_m_004154_1, Status : FAILED
> 07/09/24 15:12:20 INFO mapred.JobClient: Task Id :  
> task_0001_m_010939_1, Status : FAILED
> 07/09/24 15:12:24 INFO mapred.JobClient: Task Id :  
> task_0001_m_004154_2, Status : FAILED
> 07/09/24 15:12:31 INFO mapred.JobClient: Task Id :  
> task_0001_m_010939_2, Status : FAILED
> 07/09/24 15:12:35 INFO mapred.JobClient:  map 100% reduce 100%
> java.io.IOException: Job failed!
>          at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:
> 604)
>          at SearchCount.main(SearchCount.java:168)
>          at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>          at sun.reflect.NativeMethodAccessorImpl.invoke
> (NativeMethodAccessorImpl.java:39)
>          at sun.reflect.DelegatingMethodAccessorImpl.invoke
> (DelegatingMethodAccessorImpl.java:25)
>          at java.lang.reflect.Method.invoke(Method.java:585)
>          at org.apache.hadoop.util.ProgramDriver
> $ProgramDescription.invoke(ProgramDriver.java:69)
>          at org.apache.hadoop.util.ProgramDriver.driver
> (ProgramDriver.java:140)
>          at SearchDriver.main(SearchDriver.java:34)
>          at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>          at sun.reflect.NativeMethodAccessorImpl.invoke
> (NativeMethodAccessorImpl.java:39)
>          at sun.reflect.DelegatingMethodAccessorImpl.invoke
> (DelegatingMethodAccessorImpl.java:25)
>          at java.lang.reflect.Method.invoke(Method.java:585)
>          at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
>
> The logs for all of the tasktrackers end in
>
> 2007-09-24 15:13:34,205 WARN org.apache.hadoop.ipc.Server:
> IPC Server handler 0 on 50050, call
> getMapCompletionEvents(job_0001, 17180, 50) from
> 17.231.8.30:63506: output error
> java.nio.channels.ClosedChannelException
>          at sun.nio.ch.SocketChannelImpl.ensureWriteOpen
> (SocketChannelImpl.java:125)
>          at
> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:
> 294)
>          at
> org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer
> (SocketChannelOutputStream.java:108)
>          at org.apache.hadoop.ipc.SocketChannelOutputStream.write
> (SocketChannelOutputStream.java:89)
>          at java.io.BufferedOutputStream.flushBuffer
> (BufferedOutputStream.java:65)
>          at java.io.BufferedOutputStream.flush
> (BufferedOutputStream.java:123)
>          at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:592)
>
> with only the Server handler number differing.  Finally, the
> logs from the jobtracker app for the failed reduce jobs show:
>
> 2007-09-24 15:12:36,401 WARN org.apache.hadoop.mapred.ReduceTask:  
> task_0001_r_000000_0 copy failed: task_0001_m_017073_0 from
> 17.231.8.33
> 2007-09-24 15:12:36,403 WARN org.apache.hadoop.mapred.ReduceTask:  
> java.io.IOException: Server returned HTTP response code: 500
> for URL:  
> http://17.231.8.33:50060/mapOutput?map=task_0001_m_017073_0&reduce=0
> at sun.net.www.protocol.http.HttpURLConnection.getInputStream
> (HttpURLConnection.java:1152)
> at org.apache.hadoop.mapred.MapOutputLocation.getFile
> (MapOutputLocation.java:206)
> at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> $MapOutputCopier.copyOutput(ReduceTask.java:680)
> at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> $MapOutputCopier.run(ReduceTask.java:641)
>
> Though they actually continue a few lines after that:
>
> 2007-09-24 15:12:37,393 INFO org.apache.hadoop.mapred.ReduceTask:  
> task_0001_r_000000_0 Got 0 new map outputs from tasktracker
> and 0 map outputs from previous failures
> 2007-09-24 15:12:37,393 INFO org.apache.hadoop.mapred.ReduceTask:  
> task_0001_r_000000_0 Got 88 known map output location(s);
> scheduling...
> 2007-09-24 15:12:37,394 INFO org.apache.hadoop.mapred.ReduceTask:  
> task_0001_r_000000_0 Scheduled 2 of 88 known outputs (0 slow
> hosts and 86 dup hosts)
>
> Thanks.
>
> Ross Boucher
> [hidden email]
>
>