Reduce jobs being killed

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Reduce jobs being killed

alakshman
Hi

My Reduce jobs do not write any data to disk but fire off a network call to
an RPC server with the data. However all reduce jobs are getting killed with
the following the error message :

Task failed to report status for 606 seconds. Killing.
Task failed to report status for 602 seconds. Killing.
Task failed to report status for 600 seconds. Killing.

What might be causing this ? How do I start addressing this ?

Thanks
A
Reply | Threaded
Open this post in threaded view
|

Re: Reduce jobs being killed

Arun C Murthy-2
On Thu, Aug 16, 2007 at 07:36:48AM -0700, Phantom wrote:

>Hi
>
>My Reduce jobs do not write any data to disk but fire off a network call to
>an RPC server with the data. However all reduce jobs are getting killed with
>the following the error message :
>
>Task failed to report status for 606 seconds. Killing.
>Task failed to report status for 602 seconds. Killing.
>Task failed to report status for 600 seconds. Killing.
>
>What might be causing this ? How do I start addressing this ?
>

I'd bet your RPC calls are taking too long; hence the reduce task isn't reporting any progress and after the default 10 minute timeout the TaskTracker is killing your Reduce task.

Couple of options:
a) Set 'mapred.task.timeout' to a higher value (or 'zero' for infinite)
b) Use http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/Reporter.html#setStatus(java.lang.String) or http://lucene.apache.org/hadoop/api/org/apache/hadoop/util/Progressable.html#progress() from your reducer periodically to tell the TaskTracker that you are alive and kicking.

I'd do (b).

Arun

>Thanks
>A