[jira] Commented: (HADOOP-475) The value iterator to reduce function should be clonable

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] Commented: (HADOOP-475) The value iterator to reduce function should be clonable

Radim Rehurek (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508519 ]

Vivek Ratan commented on HADOOP-475:

Making the iterator cloneable seems clear enough.

For the optimization, are you suggesting that users be able to define any number of iterators over the values for a given key, each iterator based on a particular comparator function that compares two values? If you look at HADOOP-485, we allow users to provide two comparator functions. One works on composite keys and is used when sorting and merging in the Map phase, and when merging Map outputs in the Reduce phase. The other works on the basic key. These two together allow users to dictate the order of values within a given key, when their reduce function is called.  It seems like what you're asking for is something more general than this: rather than allow just one ordering of values within a key, they can order values in many other ways. Is that right?

If so, this can probably be done just as well in user code: collect all the values for a given key, then sort them in multiple ways using different comparators.

> The value iterator to reduce function should be clonable
> --------------------------------------------------------
>                 Key: HADOOP-475
>                 URL: https://issues.apache.org/jira/browse/HADOOP-475
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Runping Qi
>            Assignee: Owen O'Malley
> In the current framework, when the user implements the reduce method of Reducer class,
> the user can only iterate through the value iterator once.
> This makes it hard for the user to perform join-like operations with in the reduce method.
> To address problem, one approach is to make the input value iterator clonable. Then the user can iterate the values in different ways.
> If the iterator can be reset, then the user can perform nested iterations over the data, thus
> carry out join-likeoperations.
> The user code in reduce method would be something like:
>                   iterator1 = values.clone();
>                   iterator2 = values.clone();
>                  while (iterator1.hasNext()) {
>                       val1 = iterator1.next();
>                       iterator2.reset();
>                       while (iterator2.hasNext()) {
>                            val2 = iterator.next();
>                            do something vased on val1 and val2
>                            .......................
>                       }
>                  }
> One possible optimization is that if the values are sorted based on a secondary key,
> the reset function can take a secondary key as an argument and reset the iterator to the begining
> position of the secondary key. It will be very helpful if there is a utility that returns a list of iterators,
> one per secondary key value, from the given iterator:
>                           TreeMap getIteratorsBasedOnSecondaryKey(iterator);
> Each entry in the returned map object is a pair of <secondary key, iterator for the values with the same secondary key>.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.