[jira] Updated: (HADOOP-485) allow a different comparator for grouping keys in calls to reduce

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-485) allow a different comparator for grouping keys in calls to reduce

Nick Burch (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tahir Hashmi updated HADOOP-485:
--------------------------------

    Attachment: 485.patch

Added javadoc documentation for get and set methods in JobConf.

> allow a different comparator for grouping keys in calls to reduce
> -----------------------------------------------------------------
>
>                 Key: HADOOP-485
>                 URL: https://issues.apache.org/jira/browse/HADOOP-485
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.5.0
>            Reporter: Owen O'Malley
>         Assigned To: Tahir Hashmi
>         Attachments: 485.patch, 485.patch, 485.patch, Hadoop-485-pre.patch, TestUserValueGrouping.java.patch
>
>
> Some algorithms require that the values to the reduce be sorted in a particular order, but extending the key with the additional fields causes  them to be handled by different calls to reduce. (The user then collects the values until they detect a "real" key change and then processes them.)
> It would be much easier if the framework let you define a second comparator that did the grouping of values for reduces. So your reduce inputs look like:
> A1, V1
> A2, V2
> A3, V3
> B1, V4
> B2, V5
> instead of getting calls to reduce that look like:
> reduce(A1, {V1}); reduce(A2, {V2}); reduce(A3, {V3}); reduce(B1, {V4}); reduce(B2, {V5});
> you could define the grouping comparator to just compare the letters and end up with:
> reduce(A1, {V1,V2,V3}); reduce(B1, {V4,V5});
> which is the desired outcome. Note that this assumes that the "extra" part of the key is just for sorting because the reduce will only see the first representative of each equivalence class.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (HADOOP-485) allow a different comparator for grouping keys in calls to reduce

Nigel Daley

On Apr 30, 2007, at 7:11 AM, Tahir Hashmi (JIRA) wrote:

>
>      [ https://issues.apache.org/jira/browse/HADOOP-485?
> page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Tahir Hashmi updated HADOOP-485:
> --------------------------------
>
>     Attachment: 485.patch
>
> Added javadoc documentation for get and set methods in JobConf.

It would be good to document what happens if null is passed into the  
setter method and what happens if the getter method is called when no  
comparator has been set.  Also, if the  
"mapred.output.value.groupfn.class" is intended to be a public  
property, then I think the javadoc should mention that these methods  
control the value of this property.

>> allow a different comparator for grouping keys in calls to reduce
>> -----------------------------------------------------------------
>>
>>                 Key: HADOOP-485
>>                 URL: https://issues.apache.org/jira/browse/HADOOP-485
>>             Project: Hadoop
>>          Issue Type: New Feature
>>          Components: mapred
>>    Affects Versions: 0.5.0
>>            Reporter: Owen O'Malley
>>         Assigned To: Tahir Hashmi
>>         Attachments: 485.patch, 485.patch, 485.patch, Hadoop-485-
>> pre.patch, TestUserValueGrouping.java.patch
>>
>>
>> Some algorithms require that the values to the reduce be sorted in  
>> a particular order, but extending the key with the additional  
>> fields causes  them to be handled by different calls to reduce.  
>> (The user then collects the values until they detect a "real" key  
>> change and then processes them.)
>> It would be much easier if the framework let you define a second  
>> comparator that did the grouping of values for reduces. So your  
>> reduce inputs look like:
>> A1, V1
>> A2, V2
>> A3, V3
>> B1, V4
>> B2, V5
>> instead of getting calls to reduce that look like:
>> reduce(A1, {V1}); reduce(A2, {V2}); reduce(A3, {V3}); reduce(B1,  
>> {V4}); reduce(B2, {V5});
>> you could define the grouping comparator to just compare the  
>> letters and end up with:
>> reduce(A1, {V1,V2,V3}); reduce(B1, {V4,V5});
>> which is the desired outcome. Note that this assumes that the  
>> "extra" part of the key is just for sorting because the reduce  
>> will only see the first representative of each equivalence class.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>