[jira] Created: (HADOOP-1485) Metrics should be there for reporting shuffle failures/successes

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (HADOOP-1485) Metrics should be there for reporting shuffle failures/successes

Hudson (Jira)
Metrics should be there for reporting shuffle failures/successes
----------------------------------------------------------------

                 Key: HADOOP-1485
                 URL: https://issues.apache.org/jira/browse/HADOOP-1485
             Project: Hadoop
          Issue Type: Improvement
          Components: mapred
            Reporter: Devaraj Das
            Assignee: Devaraj Das
             Fix For: 0.14.0


It would be nice to have metrics for the shuffle phase which reports the failures/successes for the fetches. This would aid in performance tests and in debugging (shuffle).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1485) Metrics should be there for reporting shuffle failures/successes

Hudson (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1485:
--------------------------------

    Attachment: shuffle-metrics.patch

Attached is a patch. The patch adds two new metrics fetch_successes and fetch_failures. All reporting is done using the Updater interface (introduced in ReduceTask.java as part of this patch).

> Metrics should be there for reporting shuffle failures/successes
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1485
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1485
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.14.0
>
>         Attachments: shuffle-metrics.patch
>
>
> It would be nice to have metrics for the shuffle phase which reports the failures/successes for the fetches. This would aid in performance tests and in debugging (shuffle).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1485) Metrics should be there for reporting shuffle failures/successes

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1485:
--------------------------------

    Attachment: 1485.1.patch

Attached is a more patch with a more detailed metrics reporting. I am still testing this but I would appreciate reviews on the same, especially on the lines of reporting metrics that would help us monitor the shuffle better.

> Metrics should be there for reporting shuffle failures/successes
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1485
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1485
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.14.0
>
>         Attachments: 1485.1.patch, shuffle-metrics.patch
>
>
> It would be nice to have metrics for the shuffle phase which reports the failures/successes for the fetches. This would aid in performance tests and in debugging (shuffle).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1485) Metrics should be there for reporting shuffle failures/successes

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507489 ]

David Bowen commented on HADOOP-1485:
-------------------------------------


Reviewing 1485.1.patch.

Two classes with the name ShuffleMetrics is confusing.  Please rename at least one of them, and add some per-class comments explaining their purposes.

TaskTracker.ShuffleMetrics:

   * shuffle_handler_busy_percent seems to be an absolute value, i.e. it should be using setMetric rather than incrMetric.  Also, shuffle_failed_outputs and shuffle_success_outputs seem to be relative values, and so should be using incrMetric rather than setMetric.
   * It may be an unnecessary optimization, but it couldn't hurt to move the shuffleMetricsRecord.update call out of the synchronized block.  update() has to do a little bit of work, and there's no need to be holding the lock.
   * MapOutputServlet is missing indentation under the first "try {".
   * The final finally may need to call shuffleMetrics.update.

ReduceTask.ReduceCopier.ShuffleMetrics:

   * I think incrMetric should be being used for shuffle_failed_fetches and shuffle_success_fetches.
   * Same comment about moving the shuffleMetrics.update() call out of the synchronized block.




> Metrics should be there for reporting shuffle failures/successes
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1485
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1485
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.14.0
>
>         Attachments: 1485.1.patch, shuffle-metrics.patch
>
>
> It would be nice to have metrics for the shuffle phase which reports the failures/successes for the fetches. This would aid in performance tests and in debugging (shuffle).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1485) Metrics should be there for reporting shuffle failures/successes

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1485:
--------------------------------

    Attachment: 1485.1.patch

Thanks David for the reiew. Looks like I had made a couple of careless copy/paste errors in my previous patch. This patch fixes all those and the other issues pointed out, and is also up-to-date with the trunk.
I forgot to mention the last time the metrics that I added for the shuffle phase.
The shuffle metrics is given out by the TaskTracker and the ReduceTask.
The TaskTracker side is handled by a class called ShuffleServerMetrics and it reports the following metrics:
   (a) shuffle_handler_busy_percent  [this tells us how busy the servlet handler is]
   (b) shuffle_output_bytes [the number of map output bytes read from map output files]
   (c) shuffle_failed_outputs [the number of map output sends that failed]
   (d) shuffle_success_outputs [the number of map output sends that succeeded from the server's point of view]
   These metrics are tagged with the "sessionId" (there is little to gain by tagging them with something like "user" since the tasktracker can potentially serve outputs for maps belonging to different-jobs/different-users concurrently).

The ReduceTask side is handled by a class called ShuffleClientMetrics and it reports the following metrics:
   (a) shuffle_fetchers_busy_percent [this tells us how busy the map output copier subsystem is]
   (b) shuffle_input_bytes [the number of map output bytes read off the wire]
   (c) shuffle_failed_fetches [the number of failed fetches]
   (d) shuffle_success_fetches [the number of successful fetches]
   These metrics are tagged with "user", "jobName", "jobId", "taskId", "sessionId".

> Metrics should be there for reporting shuffle failures/successes
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1485
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1485
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.14.0
>
>         Attachments: 1485.1.patch, 1485.1.patch, shuffle-metrics.patch
>
>
> It would be nice to have metrics for the shuffle phase which reports the failures/successes for the fetches. This would aid in performance tests and in debugging (shuffle).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1485) Metrics should be there for reporting shuffle failures/successes

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507903 ]

David Bowen commented on HADOOP-1485:
-------------------------------------


Please scratch my comment above about moving the update calls outside of the sync blocks.  That is a pattern that works in other places, where we are using a timer-callback to do the update (so only the timer thread calls update).  But in this case, multilple threads are calling update, so it is necessary to keep the updates inside the sync blocks.  My apologies for giving bad advice.




> Metrics should be there for reporting shuffle failures/successes
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1485
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1485
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.14.0
>
>         Attachments: 1485.1.patch, 1485.1.patch, shuffle-metrics.patch
>
>
> It would be nice to have metrics for the shuffle phase which reports the failures/successes for the fetches. This would aid in performance tests and in debugging (shuffle).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1485) Metrics should be there for reporting shuffle failures/successes

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1485:
--------------------------------

    Attachment:     (was: 1485.1.patch)

> Metrics should be there for reporting shuffle failures/successes
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1485
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1485
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.14.0
>
>         Attachments: 1485.1.patch, 1485.1.patch, shuffle-metrics.patch
>
>
> It would be nice to have metrics for the shuffle phase which reports the failures/successes for the fetches. This would aid in performance tests and in debugging (shuffle).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1485) Metrics should be there for reporting shuffle failures/successes

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1485:
--------------------------------

    Attachment: 1485.1.patch

Ok, here is a new patch with some changes to do with implementing the Updater interface.

> Metrics should be there for reporting shuffle failures/successes
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1485
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1485
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.14.0
>
>         Attachments: 1485.1.patch, 1485.1.patch, shuffle-metrics.patch
>
>
> It would be nice to have metrics for the shuffle phase which reports the failures/successes for the fetches. This would aid in performance tests and in debugging (shuffle).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1485) Metrics should be there for reporting shuffle failures/successes

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1485:
--------------------------------

    Status: Patch Available  (was: Open)

> Metrics should be there for reporting shuffle failures/successes
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1485
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1485
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.14.0
>
>         Attachments: 1485.1.patch, 1485.1.patch, shuffle-metrics.patch
>
>
> It would be nice to have metrics for the shuffle phase which reports the failures/successes for the fetches. This would aid in performance tests and in debugging (shuffle).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1485) Metrics should be there for reporting shuffle failures/successes

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508108 ]

Hadoop QA commented on HADOOP-1485:
-----------------------------------

+1

http://issues.apache.org/jira/secure/attachment/12360547/1485.1.patch applied and successfully tested against trunk revision r550635.

Test results:   http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/332/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/332/console

> Metrics should be there for reporting shuffle failures/successes
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1485
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1485
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.14.0
>
>         Attachments: 1485.1.patch, 1485.1.patch, shuffle-metrics.patch
>
>
> It would be nice to have metrics for the shuffle phase which reports the failures/successes for the fetches. This would aid in performance tests and in debugging (shuffle).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1485) Metrics should be there for reporting shuffle failures/successes

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508295 ]

David Bowen commented on HADOOP-1485:
-------------------------------------


+1.  Code reviewed.

A small point: in the TaskTracker constructor, it would be preferable to initialize the workerThreads field before the shuffleServerMetrics field, since the call-back to ShufflerServerMetrics.doUpdates could otherwise occur before workerThreads is initialized.  I don't think that this would do any harm as the code stands, but generally speaking it would be better to not register the callback until the TaskTracker instance is otherwise initialized to reduce the likelihood of future bugs.



> Metrics should be there for reporting shuffle failures/successes
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1485
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1485
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.14.0
>
>         Attachments: 1485.1.patch, 1485.1.patch, shuffle-metrics.patch
>
>
> It would be nice to have metrics for the shuffle phase which reports the failures/successes for the fetches. This would aid in performance tests and in debugging (shuffle).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1485) Metrics should be there for reporting shuffle failures/successes

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1485:
--------------------------------

    Attachment: 1485.2.patch

bq.  A small point: in the TaskTracker constructor, it would be preferable to initialize the workerThreads field before the shuffleServerMetrics field, since the call-back to ShufflerServerMetrics.doUpdates could otherwise occur before workerThreads is initialized.  I don't think that this would do any harm as the code stands, but generally speaking it would be better to not register the callback until the TaskTracker instance is otherwise initialized to reduce the likelihood of future bugs.

Thanks for pointing this out. Although it is harmless to leave the code as is, what you pointed out is surely a better way of writing the same code. The attached patch has this change.

> Metrics should be there for reporting shuffle failures/successes
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1485
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1485
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.14.0
>
>         Attachments: 1485.1.patch, 1485.1.patch, 1485.2.patch, shuffle-metrics.patch
>
>
> It would be nice to have metrics for the shuffle phase which reports the failures/successes for the fetches. This would aid in performance tests and in debugging (shuffle).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1485) Metrics should be there for reporting shuffle failures/successes

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1485:
--------------------------------

    Status: Open  (was: Patch Available)

> Metrics should be there for reporting shuffle failures/successes
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1485
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1485
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.14.0
>
>         Attachments: 1485.1.patch, 1485.1.patch, 1485.2.patch, shuffle-metrics.patch
>
>
> It would be nice to have metrics for the shuffle phase which reports the failures/successes for the fetches. This would aid in performance tests and in debugging (shuffle).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1485) Metrics should be there for reporting shuffle failures/successes

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1485:
--------------------------------

    Status: Patch Available  (was: Open)

Re-submitting to get it reviewed by Hudson.

> Metrics should be there for reporting shuffle failures/successes
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1485
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1485
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.14.0
>
>         Attachments: 1485.1.patch, 1485.1.patch, 1485.2.patch, shuffle-metrics.patch
>
>
> It would be nice to have metrics for the shuffle phase which reports the failures/successes for the fetches. This would aid in performance tests and in debugging (shuffle).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1485) Metrics should be there for reporting shuffle failures/successes

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508423 ]

Hadoop QA commented on HADOOP-1485:
-----------------------------------

+1

http://issues.apache.org/jira/secure/attachment/12360641/1485.2.patch applied and successfully tested against trunk revision r550952.

Test results:   http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/337/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/337/console

> Metrics should be there for reporting shuffle failures/successes
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1485
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1485
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.14.0
>
>         Attachments: 1485.1.patch, 1485.1.patch, 1485.2.patch, shuffle-metrics.patch
>
>
> It would be nice to have metrics for the shuffle phase which reports the failures/successes for the fetches. This would aid in performance tests and in debugging (shuffle).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1485) Metrics should be there for reporting shuffle failures/successes

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-1485:
---------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed this.  Thanks, Devaraj!

> Metrics should be there for reporting shuffle failures/successes
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1485
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1485
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.14.0
>
>         Attachments: 1485.1.patch, 1485.1.patch, 1485.2.patch, shuffle-metrics.patch
>
>
> It would be nice to have metrics for the shuffle phase which reports the failures/successes for the fetches. This would aid in performance tests and in debugging (shuffle).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1485) Metrics should be there for reporting shuffle failures/successes

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508796 ]

Hudson commented on HADOOP-1485:
--------------------------------

Integrated in Hadoop-Nightly #138 (See [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/138/])

> Metrics should be there for reporting shuffle failures/successes
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1485
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1485
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.14.0
>
>         Attachments: 1485.1.patch, 1485.1.patch, 1485.2.patch, shuffle-metrics.patch
>
>
> It would be nice to have metrics for the shuffle phase which reports the failures/successes for the fetches. This would aid in performance tests and in debugging (shuffle).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.