[jira] Created: (NUTCH-183) MapReduce has a series of problems concerning task-allocation to worker nodes

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-183) MapReduce has a series of problems concerning task-allocation to worker nodes

Steve Loughran (Jira)
MapReduce has a series of problems concerning task-allocation to worker nodes
-----------------------------------------------------------------------------

         Key: NUTCH-183
         URL: http://issues.apache.org/jira/browse/NUTCH-183
     Project: Nutch
        Type: Improvement
 Environment: All
    Reporter: Mike Cafarella
 Attachments: jobtracker.patch

The MapReduce JobTracker is not great at allocating tasks to TaskTracker worker nodes.

Here are the problems:
1) There is no speculative execution of tasks
2) Reduce tasks must wait until all map tasks are completed before doing any work
3) TaskTrackers don't distinguish between Map and Reduce jobs.  Also, the number of
tasks at a single node is limited to some constant.  That means you can get weird deadlock
problems upon machine failure.  The reduces take up all the available execution slots, but they
don't do productive work, because they're waiting for a map task to complete.  Of course, that
map task won't even be started until the reduce tasks finish, so you can see the problem...
4) The JobTracker is so complicated that it's hard to fix any of these.


The right solution is a rewrite of the JobTracker to be a lot more flexible in task handling.
It has to be a lot simpler.  One way to make it simpler is to add an abstraction I'll call
"TaskInProgress".  Jobs are broken into chunks called TasksInProgress.  All the TaskInProgress
objects must be complete, somehow, before the Job is complete.

A single TaskInProgress can be executed by one or more Tasks.  TaskTrackers are assigned Tasks.
If a Task fails, we report it back to the JobTracker, where the TaskInProgress lives.  The TIP can then
decide whether to launch additional  Tasks or not.

Speculative execution is handled within the TIP.  It simply launches multiple Tasks in parallel.  The
TaskTrackers have no idea that these Tasks are actually doing the same chunk of work.  The TIP
is complete when any one of its Tasks are complete.



--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-183) MapReduce has a series of problems concerning task-allocation to worker nodes

Steve Loughran (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-183?page=all ]

Mike Cafarella updated NUTCH-183:
---------------------------------

    Attachment: jobtracker.patch


  Say!  Here's a patch that implements the JobTracker rewrite.  Please take a look and let me know what you think....

> MapReduce has a series of problems concerning task-allocation to worker nodes
> -----------------------------------------------------------------------------
>
>          Key: NUTCH-183
>          URL: http://issues.apache.org/jira/browse/NUTCH-183
>      Project: Nutch
>         Type: Improvement
>  Environment: All
>     Reporter: Mike Cafarella
>  Attachments: jobtracker.patch
>
> The MapReduce JobTracker is not great at allocating tasks to TaskTracker worker nodes.
> Here are the problems:
> 1) There is no speculative execution of tasks
> 2) Reduce tasks must wait until all map tasks are completed before doing any work
> 3) TaskTrackers don't distinguish between Map and Reduce jobs.  Also, the number of
> tasks at a single node is limited to some constant.  That means you can get weird deadlock
> problems upon machine failure.  The reduces take up all the available execution slots, but they
> don't do productive work, because they're waiting for a map task to complete.  Of course, that
> map task won't even be started until the reduce tasks finish, so you can see the problem...
> 4) The JobTracker is so complicated that it's hard to fix any of these.
> The right solution is a rewrite of the JobTracker to be a lot more flexible in task handling.
> It has to be a lot simpler.  One way to make it simpler is to add an abstraction I'll call
> "TaskInProgress".  Jobs are broken into chunks called TasksInProgress.  All the TaskInProgress
> objects must be complete, somehow, before the Job is complete.
> A single TaskInProgress can be executed by one or more Tasks.  TaskTrackers are assigned Tasks.
> If a Task fails, we report it back to the JobTracker, where the TaskInProgress lives.  The TIP can then
> decide whether to launch additional  Tasks or not.
> Speculative execution is handled within the TIP.  It simply launches multiple Tasks in parallel.  The
> TaskTrackers have no idea that these Tasks are actually doing the same chunk of work.  The TIP
> is complete when any one of its Tasks are complete.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-183) MapReduce has a series of problems concerning task-allocation to worker nodes

Steve Loughran (Jira)
In reply to this post by Steve Loughran (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-183?page=comments#action_12363477 ]

byron miller commented on NUTCH-183:
------------------------------------

As Mr Burns would say "eggcelent"  I'll give this a try.  BTW, is it possible to implement functionality that would start jobs that are lagging on nodes that have completed tasks like google does?  for example if your 90% done and the last 10 jobs are hung because of bad hardware, slow response or failure and have the ability to redo the long running jobs in parallel on alternate nodes and complete the first one that finishes?  this way if you have a huge crawl and certain nodes slow or fail those jobs can be alternated on completed nodes to try and wrap up and terminate any dead jobs when done?

hope that makes sense..

> MapReduce has a series of problems concerning task-allocation to worker nodes
> -----------------------------------------------------------------------------
>
>          Key: NUTCH-183
>          URL: http://issues.apache.org/jira/browse/NUTCH-183
>      Project: Nutch
>         Type: Improvement
>  Environment: All
>     Reporter: Mike Cafarella
>  Attachments: jobtracker.patch
>
> The MapReduce JobTracker is not great at allocating tasks to TaskTracker worker nodes.
> Here are the problems:
> 1) There is no speculative execution of tasks
> 2) Reduce tasks must wait until all map tasks are completed before doing any work
> 3) TaskTrackers don't distinguish between Map and Reduce jobs.  Also, the number of
> tasks at a single node is limited to some constant.  That means you can get weird deadlock
> problems upon machine failure.  The reduces take up all the available execution slots, but they
> don't do productive work, because they're waiting for a map task to complete.  Of course, that
> map task won't even be started until the reduce tasks finish, so you can see the problem...
> 4) The JobTracker is so complicated that it's hard to fix any of these.
> The right solution is a rewrite of the JobTracker to be a lot more flexible in task handling.
> It has to be a lot simpler.  One way to make it simpler is to add an abstraction I'll call
> "TaskInProgress".  Jobs are broken into chunks called TasksInProgress.  All the TaskInProgress
> objects must be complete, somehow, before the Job is complete.
> A single TaskInProgress can be executed by one or more Tasks.  TaskTrackers are assigned Tasks.
> If a Task fails, we report it back to the JobTracker, where the TaskInProgress lives.  The TIP can then
> decide whether to launch additional  Tasks or not.
> Speculative execution is handled within the TIP.  It simply launches multiple Tasks in parallel.  The
> TaskTrackers have no idea that these Tasks are actually doing the same chunk of work.  The TIP
> is complete when any one of its Tasks are complete.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-183) MapReduce has a series of problems concerning task-allocation to worker nodes

Steve Loughran (Jira)
In reply to this post by Steve Loughran (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-183?page=comments#action_12363554 ]

Doug Cutting commented on NUTCH-183:
------------------------------------

Byron, that's exactly what Mike means by "speculative execution".

> MapReduce has a series of problems concerning task-allocation to worker nodes
> -----------------------------------------------------------------------------
>
>          Key: NUTCH-183
>          URL: http://issues.apache.org/jira/browse/NUTCH-183
>      Project: Nutch
>         Type: Improvement
>  Environment: All
>     Reporter: Mike Cafarella
>  Attachments: jobtracker.patch
>
> The MapReduce JobTracker is not great at allocating tasks to TaskTracker worker nodes.
> Here are the problems:
> 1) There is no speculative execution of tasks
> 2) Reduce tasks must wait until all map tasks are completed before doing any work
> 3) TaskTrackers don't distinguish between Map and Reduce jobs.  Also, the number of
> tasks at a single node is limited to some constant.  That means you can get weird deadlock
> problems upon machine failure.  The reduces take up all the available execution slots, but they
> don't do productive work, because they're waiting for a map task to complete.  Of course, that
> map task won't even be started until the reduce tasks finish, so you can see the problem...
> 4) The JobTracker is so complicated that it's hard to fix any of these.
> The right solution is a rewrite of the JobTracker to be a lot more flexible in task handling.
> It has to be a lot simpler.  One way to make it simpler is to add an abstraction I'll call
> "TaskInProgress".  Jobs are broken into chunks called TasksInProgress.  All the TaskInProgress
> objects must be complete, somehow, before the Job is complete.
> A single TaskInProgress can be executed by one or more Tasks.  TaskTrackers are assigned Tasks.
> If a Task fails, we report it back to the JobTracker, where the TaskInProgress lives.  The TIP can then
> decide whether to launch additional  Tasks or not.
> Speculative execution is handled within the TIP.  It simply launches multiple Tasks in parallel.  The
> TaskTrackers have no idea that these Tasks are actually doing the same chunk of work.  The TIP
> is complete when any one of its Tasks are complete.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-183) MapReduce has a series of problems concerning task-allocation to worker nodes

Steve Loughran (Jira)
In reply to this post by Steve Loughran (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-183?page=comments#action_12363557 ]

Chris A. Mattmann commented on NUTCH-183:
-----------------------------------------

Guys,

Greg Barish and the folks who worked on the Theseus planning system for information agents at USC did a lot of work on this subject. Theseus is implemented in java, and as I recall, includes the capability to perform speculative execution...

http://www.isi.edu/~barish/

> MapReduce has a series of problems concerning task-allocation to worker nodes
> -----------------------------------------------------------------------------
>
>          Key: NUTCH-183
>          URL: http://issues.apache.org/jira/browse/NUTCH-183
>      Project: Nutch
>         Type: Improvement
>  Environment: All
>     Reporter: Mike Cafarella
>  Attachments: jobtracker.patch
>
> The MapReduce JobTracker is not great at allocating tasks to TaskTracker worker nodes.
> Here are the problems:
> 1) There is no speculative execution of tasks
> 2) Reduce tasks must wait until all map tasks are completed before doing any work
> 3) TaskTrackers don't distinguish between Map and Reduce jobs.  Also, the number of
> tasks at a single node is limited to some constant.  That means you can get weird deadlock
> problems upon machine failure.  The reduces take up all the available execution slots, but they
> don't do productive work, because they're waiting for a map task to complete.  Of course, that
> map task won't even be started until the reduce tasks finish, so you can see the problem...
> 4) The JobTracker is so complicated that it's hard to fix any of these.
> The right solution is a rewrite of the JobTracker to be a lot more flexible in task handling.
> It has to be a lot simpler.  One way to make it simpler is to add an abstraction I'll call
> "TaskInProgress".  Jobs are broken into chunks called TasksInProgress.  All the TaskInProgress
> objects must be complete, somehow, before the Job is complete.
> A single TaskInProgress can be executed by one or more Tasks.  TaskTrackers are assigned Tasks.
> If a Task fails, we report it back to the JobTracker, where the TaskInProgress lives.  The TIP can then
> decide whether to launch additional  Tasks or not.
> Speculative execution is handled within the TIP.  It simply launches multiple Tasks in parallel.  The
> TaskTrackers have no idea that these Tasks are actually doing the same chunk of work.  The TIP
> is complete when any one of its Tasks are complete.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-183) MapReduce has a series of problems concerning task-allocation to worker nodes

Steve Loughran (Jira)
In reply to this post by Steve Loughran (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-183?page=comments#action_12363563 ]

byron miller commented on NUTCH-183:
------------------------------------

so thats what they call it :) thanks

> MapReduce has a series of problems concerning task-allocation to worker nodes
> -----------------------------------------------------------------------------
>
>          Key: NUTCH-183
>          URL: http://issues.apache.org/jira/browse/NUTCH-183
>      Project: Nutch
>         Type: Improvement
>  Environment: All
>     Reporter: Mike Cafarella
>  Attachments: jobtracker.patch
>
> The MapReduce JobTracker is not great at allocating tasks to TaskTracker worker nodes.
> Here are the problems:
> 1) There is no speculative execution of tasks
> 2) Reduce tasks must wait until all map tasks are completed before doing any work
> 3) TaskTrackers don't distinguish between Map and Reduce jobs.  Also, the number of
> tasks at a single node is limited to some constant.  That means you can get weird deadlock
> problems upon machine failure.  The reduces take up all the available execution slots, but they
> don't do productive work, because they're waiting for a map task to complete.  Of course, that
> map task won't even be started until the reduce tasks finish, so you can see the problem...
> 4) The JobTracker is so complicated that it's hard to fix any of these.
> The right solution is a rewrite of the JobTracker to be a lot more flexible in task handling.
> It has to be a lot simpler.  One way to make it simpler is to add an abstraction I'll call
> "TaskInProgress".  Jobs are broken into chunks called TasksInProgress.  All the TaskInProgress
> objects must be complete, somehow, before the Job is complete.
> A single TaskInProgress can be executed by one or more Tasks.  TaskTrackers are assigned Tasks.
> If a Task fails, we report it back to the JobTracker, where the TaskInProgress lives.  The TIP can then
> decide whether to launch additional  Tasks or not.
> Speculative execution is handled within the TIP.  It simply launches multiple Tasks in parallel.  The
> TaskTrackers have no idea that these Tasks are actually doing the same chunk of work.  The TIP
> is complete when any one of its Tasks are complete.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-183) MapReduce has a series of problems concerning task-allocation to worker nodes

Steve Loughran (Jira)
In reply to this post by Steve Loughran (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-183?page=comments#action_12363616 ]

Mike Cafarella commented on NUTCH-183:
--------------------------------------


  One more thing:

  You'll see in this patch that ReduceTasks contain a 2D array of map tasks.  They used to
contain a 1D array, one map task for each map-split of the data.

  This 2D business is less than perfect, so let me explain...

  Each Reduce task needs to know when its map task predecessors have completed.
In the old days, this was easy.  There was one map id for each split of the data, so for
k splits, there were k map ids to know about.

  But in a world with speculative execution and early-start of reduce tasks, there could
be multiple possible map tasks that work on the same split of data.  So for each of the
k splits, there could be up to M tasks working on it.  Thus, the reduce task knows about
k * M map ids.  When any one of the M for each split has completed, the reduce task knows
it can move on.

   But this is a little silly.  The JobTracker has a "TaskInProgress" abstraction that represents
the idea of a "split's worth of work".  A single TIP contains M map task ids.  Instead of the
reduce task looking all over for map task ids, it should just deal with TIP ids.  That way,
we're back to a 1D array, and the reduce task code is easier to understand.

  Anyway, it will still work as is.  I'll improve the code in a future patch.  For the moment,
I'll let this patch stand.  I just wanted to let people know...


> MapReduce has a series of problems concerning task-allocation to worker nodes
> -----------------------------------------------------------------------------
>
>          Key: NUTCH-183
>          URL: http://issues.apache.org/jira/browse/NUTCH-183
>      Project: Nutch
>         Type: Improvement
>  Environment: All
>     Reporter: Mike Cafarella
>  Attachments: jobtracker.patch
>
> The MapReduce JobTracker is not great at allocating tasks to TaskTracker worker nodes.
> Here are the problems:
> 1) There is no speculative execution of tasks
> 2) Reduce tasks must wait until all map tasks are completed before doing any work
> 3) TaskTrackers don't distinguish between Map and Reduce jobs.  Also, the number of
> tasks at a single node is limited to some constant.  That means you can get weird deadlock
> problems upon machine failure.  The reduces take up all the available execution slots, but they
> don't do productive work, because they're waiting for a map task to complete.  Of course, that
> map task won't even be started until the reduce tasks finish, so you can see the problem...
> 4) The JobTracker is so complicated that it's hard to fix any of these.
> The right solution is a rewrite of the JobTracker to be a lot more flexible in task handling.
> It has to be a lot simpler.  One way to make it simpler is to add an abstraction I'll call
> "TaskInProgress".  Jobs are broken into chunks called TasksInProgress.  All the TaskInProgress
> objects must be complete, somehow, before the Job is complete.
> A single TaskInProgress can be executed by one or more Tasks.  TaskTrackers are assigned Tasks.
> If a Task fails, we report it back to the JobTracker, where the TaskInProgress lives.  The TIP can then
> decide whether to launch additional  Tasks or not.
> Speculative execution is handled within the TIP.  It simply launches multiple Tasks in parallel.  The
> TaskTrackers have no idea that these Tasks are actually doing the same chunk of work.  The TIP
> is complete when any one of its Tasks are complete.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-183) MapReduce has a series of problems concerning task-allocation to worker nodes

Steve Loughran (Jira)
In reply to this post by Steve Loughran (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-183?page=comments#action_12363888 ]

Dominik Friedrich commented on NUTCH-183:
-----------------------------------------

I tested this patch and jobs seem to run into deadlocks when one node crashes while others are loading map output data from that node. Here some lines tasktracker on node B that tries to copy map data from node A which has crashed:

060124 181752 task_r_27r56x 0.2212838% reduce > copy > task_m_1ujusx@nodeA:50040
060124 181753 task_r_7jjqag 0.17820947% reduce > copy >
060124 181753 task_r_27r56x 0.2212838% reduce > copy > task_m_1ujusx@nodeA:50040
060124 181754 task_r_7jjqag 0.17820947% reduce > copy >
060124 181754 task_r_27r56x 0.2212838% reduce > copy > task_m_1ujusx@nodeA:50040
060124 181755 task_r_7jjqag 0.17820947% reduce > copy >
060124 181755 task_r_27r56x 0.2212838% reduce > copy > task_m_1ujusx@nodeA:50040
060124 181756 task_r_7jjqag 0.17820947% reduce > copy >
[...]
060124 223510 task_r_27r56x 0.2212838% reduce > copy > task_m_1ujusx@nodeA:50040
060124 223511 task_r_7jjqag 0.17820947% reduce > copy >
060124 223511 task_r_27r56x 0.2212838% reduce > copy > task_m_1ujusx@nodeA:50040
060124 223512 task_r_7jjqag 0.17820947% reduce > copy >
060124 223512 task_r_27r56x 0.2212838% reduce > copy > task_m_1ujusx@nodeA:50040
060124 223513 task_r_7jjqag 0.17820947% reduce > copy >
060124 223513 task_r_27r56x 0.2212838% reduce > copy > task_m_1ujusx@nodeA:50040

Node A was removed from the jobtracker's node list but it seems like not all tasks depending on that node have been killed.

> MapReduce has a series of problems concerning task-allocation to worker nodes
> -----------------------------------------------------------------------------
>
>          Key: NUTCH-183
>          URL: http://issues.apache.org/jira/browse/NUTCH-183
>      Project: Nutch
>         Type: Improvement
>  Environment: All
>     Reporter: Mike Cafarella
>  Attachments: jobtracker.patch
>
> The MapReduce JobTracker is not great at allocating tasks to TaskTracker worker nodes.
> Here are the problems:
> 1) There is no speculative execution of tasks
> 2) Reduce tasks must wait until all map tasks are completed before doing any work
> 3) TaskTrackers don't distinguish between Map and Reduce jobs.  Also, the number of
> tasks at a single node is limited to some constant.  That means you can get weird deadlock
> problems upon machine failure.  The reduces take up all the available execution slots, but they
> don't do productive work, because they're waiting for a map task to complete.  Of course, that
> map task won't even be started until the reduce tasks finish, so you can see the problem...
> 4) The JobTracker is so complicated that it's hard to fix any of these.
> The right solution is a rewrite of the JobTracker to be a lot more flexible in task handling.
> It has to be a lot simpler.  One way to make it simpler is to add an abstraction I'll call
> "TaskInProgress".  Jobs are broken into chunks called TasksInProgress.  All the TaskInProgress
> objects must be complete, somehow, before the Job is complete.
> A single TaskInProgress can be executed by one or more Tasks.  TaskTrackers are assigned Tasks.
> If a Task fails, we report it back to the JobTracker, where the TaskInProgress lives.  The TIP can then
> decide whether to launch additional  Tasks or not.
> Speculative execution is handled within the TIP.  It simply launches multiple Tasks in parallel.  The
> TaskTrackers have no idea that these Tasks are actually doing the same chunk of work.  The TIP
> is complete when any one of its Tasks are complete.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira