Sorting values of a key in reduce phase

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Sorting values of a key in reduce phase

Pallavi Palleti
Hi,
   In reduce phase, with outputValueGroupingComparator, we can sort all keys and then group values of a particular key together and send it to reduce() method. Is there a way to sort values of a particular key efficiently before it reaches to reduce method?

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Sorting values of a key in reduce phase

Ted Dunning-3


Why would that be necessary?



On 8/6/07 10:12 PM, "novice user" <[hidden email]> wrote:

>
> Hi,
>    In reduce phase, with outputValueGroupingComparator, we can sort all keys
> and then group values of a particular key together and send it to reduce()
> method. Is there a way to sort values of a particular key efficiently before
> it reaches to reduce method?
>
> Thanks

Reply | Threaded
Open this post in threaded view
|

Re: Sorting values of a key in reduce phase

Toby DiPasquale
In reply to this post by Pallavi Palleti
On 8/7/07, novice user <[hidden email]> wrote:
>
> Hi,
>    In reduce phase, with outputValueGroupingComparator, we can sort all keys
> and then group values of a particular key together and send it to reduce()
> method. Is there a way to sort values of a particular key efficiently before
> it reaches to reduce method?

I'm not sure if this is what you want, but Google's MapReduce
framework has the concept of an optional second key parameter for
subsorting of records (saw it in some slides by Jeff Dean). Perhaps
you could integrate this into Hadoop as a patch and submit it?

--
Toby DiPasquale
Reply | Threaded
Open this post in threaded view
|

Re: Sorting values of a key in reduce phase

Owen O'Malley-4
In reply to this post by Pallavi Palleti

On Aug 6, 2007, at 10:12 PM, novice user wrote:

>    In reduce phase, with outputValueGroupingComparator, we can sort  
> all keys
> and then group values of a particular key together and send it to  
> reduce()
> method. Is there a way to sort values of a particular key  
> efficiently before
> it reaches to reduce method?

There are two comparators that are used for sorting for precisely  
this purpose. In particular:

JobConf.getOutputKeyComparator()
JobConf.getOutputValueGroupingComparator()

The first controls the sort and the second is used to control which  
keys are a single call to reduce.

Therefore, if your data has primary key K1 and secondary K2:

class MyKey implements WritableComparable {
   K1 primary;
   K2 secondary;
   ...
}

you make the map output key MyKey and the OutputKeyComparator uses  
both primary and secondary to pick the order. The  
OuputValueGroupingComparator would just compare the primary keys for  
equality. So if your data looked like:

K1(1), K2(1), V1
K1(1), K2(2), V2
K1(2), K2(1), V3
K1(2), K2(2), V4

the records would be sorted as above, but the reduce would see two  
calls once with K1(1) with values V1 and V2 and once with K1(2) with  
values V3 and V4.

-- Owen

PS. The OutputValueGroupingComparator is a bad name. It should be  
OutputKeyGroupingComparator or something.