[jira] Created: (HADOOP-1440) JobClient should not sort input-splits

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (HADOOP-1440) JobClient should not sort input-splits

Tim Allison (Jira)
JobClient should not sort input-splits
--------------------------------------

                 Key: HADOOP-1440
                 URL: https://issues.apache.org/jira/browse/HADOOP-1440
             Project: Hadoop
          Issue Type: Improvement
          Components: mapred
    Affects Versions: 0.12.3
         Environment: All
            Reporter: Milind Bhandarkar
             Fix For: 0.14.0


Currently, the JobClient sorts the InputSplits returned by InputFormat in descending order, so that the map tasks corresponding to larger input-splits are scheduled first for execution than smaller ones. However, this causes problems in applications that produce data-sets partitioned similarly to the input partition with -reducer NONE.

With -reducer NONE, map task i produces part-i. Howver, in the typical applications that use -reducer NONE it should produce a partition that has the same index as the input parrtition.

(Of course, this requires that each partition should be fed in its entirety to a map, rather than splitting it into blocks, but that is a separate issue.)

Thus, sorting input splits should be either controllable via a configuration variable, or the FileInputFormat should sort the splits and JobClient should honor the order of splits.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1440) JobClient should not sort input-splits

Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499999 ]

Runping Qi commented on HADOOP-1440:
------------------------------------


To address the output file name problem associated with option  -reducer NONE, the only change you need to make
is to change the value for the finalName in the constructor of class DirectMapOutputCollector in MapTask.java

    public DirectMapOutputCollector(TaskUmbilicalProtocol umbilical,
        JobConf job, Reporter reporter) throws IOException {
      this.umbilical = umbilical;
      this.job = job;
      this.reporter = reporter;
-     String finalName = getTipId();
+    String finalName = job.get("map.input.file") +  "_" + getTipId();
      FileSystem fs = FileSystem.get(this.job);

      out = job.getOutputFormat().getRecordWriter(fs, job, finalName, reporter);
    }
This way, the output file names will be the same order as the input file names.
Of course, you will run into a problem that the file names will become longer and longer.
So you actually want to control it in a way like:

    String finalName = getTipId();
    if (need keep same order and file was not splited) {
        finalName = job.get("map.input.file");
    } else if (need keep same order) {
        finalName = job.get("map.input.file") +  "_" + getTipId();
   }


> JobClient should not sort input-splits
> --------------------------------------
>
>                 Key: HADOOP-1440
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1440
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>         Environment: All
>            Reporter: Milind Bhandarkar
>             Fix For: 0.14.0
>
>
> Currently, the JobClient sorts the InputSplits returned by InputFormat in descending order, so that the map tasks corresponding to larger input-splits are scheduled first for execution than smaller ones. However, this causes problems in applications that produce data-sets partitioned similarly to the input partition with -reducer NONE.
> With -reducer NONE, map task i produces part-i. Howver, in the typical applications that use -reducer NONE it should produce a partition that has the same index as the input parrtition.
> (Of course, this requires that each partition should be fed in its entirety to a map, rather than splitting it into blocks, but that is a separate issue.)
> Thus, sorting input splits should be either controllable via a configuration variable, or the FileInputFormat should sort the splits and JobClient should honor the order of splits.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1440) JobClient should not sort input-splits

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500244 ]

Milind Bhandarkar commented on HADOOP-1440:
-------------------------------------------

Runping,

Your suggestion will break the current contract, right ?

> JobClient should not sort input-splits
> --------------------------------------
>
>                 Key: HADOOP-1440
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1440
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>         Environment: All
>            Reporter: Milind Bhandarkar
>             Fix For: 0.14.0
>
>
> Currently, the JobClient sorts the InputSplits returned by InputFormat in descending order, so that the map tasks corresponding to larger input-splits are scheduled first for execution than smaller ones. However, this causes problems in applications that produce data-sets partitioned similarly to the input partition with -reducer NONE.
> With -reducer NONE, map task i produces part-i. Howver, in the typical applications that use -reducer NONE it should produce a partition that has the same index as the input parrtition.
> (Of course, this requires that each partition should be fed in its entirety to a map, rather than splitting it into blocks, but that is a separate issue.)
> Thus, sorting input splits should be either controllable via a configuration variable, or the FileInputFormat should sort the splits and JobClient should honor the order of splits.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1440) JobClient should not sort input-splits

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500245 ]

Milind Bhandarkar commented on HADOOP-1440:
-------------------------------------------

Also, what should map.input.file be, if single map input split is constructed from multiple files ?

> JobClient should not sort input-splits
> --------------------------------------
>
>                 Key: HADOOP-1440
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1440
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>         Environment: All
>            Reporter: Milind Bhandarkar
>             Fix For: 0.14.0
>
>
> Currently, the JobClient sorts the InputSplits returned by InputFormat in descending order, so that the map tasks corresponding to larger input-splits are scheduled first for execution than smaller ones. However, this causes problems in applications that produce data-sets partitioned similarly to the input partition with -reducer NONE.
> With -reducer NONE, map task i produces part-i. Howver, in the typical applications that use -reducer NONE it should produce a partition that has the same index as the input parrtition.
> (Of course, this requires that each partition should be fed in its entirety to a map, rather than splitting it into blocks, but that is a separate issue.)
> Thus, sorting input splits should be either controllable via a configuration variable, or the FileInputFormat should sort the splits and JobClient should honor the order of splits.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1440) JobClient should not sort input-splits

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500287 ]

Runping Qi commented on HADOOP-1440:
------------------------------------


Milind,

What contract are you referring?
The current way of choosing output file names is purely incidental.
The output file names do not bear any meaning, except that they are all distinct.

I don't think we support the case where  a single split is constructed from multiple files.
If we do in the future, we can choose any one of them, as long as they are distinct.



> JobClient should not sort input-splits
> --------------------------------------
>
>                 Key: HADOOP-1440
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1440
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>         Environment: All
>            Reporter: Milind Bhandarkar
>             Fix For: 0.14.0
>
>
> Currently, the JobClient sorts the InputSplits returned by InputFormat in descending order, so that the map tasks corresponding to larger input-splits are scheduled first for execution than smaller ones. However, this causes problems in applications that produce data-sets partitioned similarly to the input partition with -reducer NONE.
> With -reducer NONE, map task i produces part-i. Howver, in the typical applications that use -reducer NONE it should produce a partition that has the same index as the input parrtition.
> (Of course, this requires that each partition should be fed in its entirety to a map, rather than splitting it into blocks, but that is a separate issue.)
> Thus, sorting input splits should be either controllable via a configuration variable, or the FileInputFormat should sort the splits and JobClient should honor the order of splits.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1440) JobClient should not sort input-splits

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500755 ]

Owen O'Malley commented on HADOOP-1440:
---------------------------------------

map.input.file is not always defined. In particular, it is only defined if the InputSplit happens to be a FileSplit. It was defined as a convenience for Mappers before we provided them access to the InputSplit directly. Certainly no part of the framework should assume it will be defined.

> JobClient should not sort input-splits
> --------------------------------------
>
>                 Key: HADOOP-1440
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1440
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>         Environment: All
>            Reporter: Milind Bhandarkar
>             Fix For: 0.14.0
>
>
> Currently, the JobClient sorts the InputSplits returned by InputFormat in descending order, so that the map tasks corresponding to larger input-splits are scheduled first for execution than smaller ones. However, this causes problems in applications that produce data-sets partitioned similarly to the input partition with -reducer NONE.
> With -reducer NONE, map task i produces part-i. Howver, in the typical applications that use -reducer NONE it should produce a partition that has the same index as the input parrtition.
> (Of course, this requires that each partition should be fed in its entirety to a map, rather than splitting it into blocks, but that is a separate issue.)
> Thus, sorting input splits should be either controllable via a configuration variable, or the FileInputFormat should sort the splits and JobClient should honor the order of splits.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1440) JobClient should not sort input-splits

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500833 ]

Runping Qi commented on HADOOP-1440:
------------------------------------


Clearly, it makes sense to use map.input.file only if the input split is a file split.
My point to the original issue this Jira tries to address: it does not sound right that
the user can assume a particular way of mapping mapper tasks to input splits.

It is reasonable to expect that the output file name of a mapper task somehow is derived from
the input split. Either the framework does that based on certain convention, or the framework delegate
to another user plugin (default will be a plugin that uses input file name or taskid, as appropriate).



> JobClient should not sort input-splits
> --------------------------------------
>
>                 Key: HADOOP-1440
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1440
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>         Environment: All
>            Reporter: Milind Bhandarkar
>             Fix For: 0.14.0
>
>
> Currently, the JobClient sorts the InputSplits returned by InputFormat in descending order, so that the map tasks corresponding to larger input-splits are scheduled first for execution than smaller ones. However, this causes problems in applications that produce data-sets partitioned similarly to the input partition with -reducer NONE.
> With -reducer NONE, map task i produces part-i. Howver, in the typical applications that use -reducer NONE it should produce a partition that has the same index as the input parrtition.
> (Of course, this requires that each partition should be fed in its entirety to a map, rather than splitting it into blocks, but that is a separate issue.)
> Thus, sorting input splits should be either controllable via a configuration variable, or the FileInputFormat should sort the splits and JobClient should honor the order of splits.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1440) JobClient should not sort input-splits

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500835 ]

Doug Cutting commented on HADOOP-1440:
--------------------------------------

I'd vote for moving the sort into FileInputFormat, moving InputSplit#getLength() to FileSplit and documenting that the kernel will try to process the splits in the order they're generated.

> JobClient should not sort input-splits
> --------------------------------------
>
>                 Key: HADOOP-1440
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1440
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>         Environment: All
>            Reporter: Milind Bhandarkar
>             Fix For: 0.14.0
>
>
> Currently, the JobClient sorts the InputSplits returned by InputFormat in descending order, so that the map tasks corresponding to larger input-splits are scheduled first for execution than smaller ones. However, this causes problems in applications that produce data-sets partitioned similarly to the input partition with -reducer NONE.
> With -reducer NONE, map task i produces part-i. Howver, in the typical applications that use -reducer NONE it should produce a partition that has the same index as the input parrtition.
> (Of course, this requires that each partition should be fed in its entirety to a map, rather than splitting it into blocks, but that is a separate issue.)
> Thus, sorting input splits should be either controllable via a configuration variable, or the FileInputFormat should sort the splits and JobClient should honor the order of splits.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1440) JobClient should not sort input-splits

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500839 ]

Owen O'Malley commented on HADOOP-1440:
---------------------------------------

I'm ok with moving the sort into FileInputFormat, but I'd like to keep the getLength in the InputSplit so that we can do modeling of required disk space to run the map.

> JobClient should not sort input-splits
> --------------------------------------
>
>                 Key: HADOOP-1440
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1440
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>         Environment: All
>            Reporter: Milind Bhandarkar
>             Fix For: 0.14.0
>
>
> Currently, the JobClient sorts the InputSplits returned by InputFormat in descending order, so that the map tasks corresponding to larger input-splits are scheduled first for execution than smaller ones. However, this causes problems in applications that produce data-sets partitioned similarly to the input partition with -reducer NONE.
> With -reducer NONE, map task i produces part-i. Howver, in the typical applications that use -reducer NONE it should produce a partition that has the same index as the input parrtition.
> (Of course, this requires that each partition should be fed in its entirety to a map, rather than splitting it into blocks, but that is a separate issue.)
> Thus, sorting input splits should be either controllable via a configuration variable, or the FileInputFormat should sort the splits and JobClient should honor the order of splits.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Assigned: (HADOOP-1440) JobClient should not sort input-splits

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Milind Bhandarkar reassigned HADOOP-1440:
-----------------------------------------

    Assignee: Milind Bhandarkar

> JobClient should not sort input-splits
> --------------------------------------
>
>                 Key: HADOOP-1440
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1440
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Milind Bhandarkar
>             Fix For: 0.14.0
>
>
> Currently, the JobClient sorts the InputSplits returned by InputFormat in descending order, so that the map tasks corresponding to larger input-splits are scheduled first for execution than smaller ones. However, this causes problems in applications that produce data-sets partitioned similarly to the input partition with -reducer NONE.
> With -reducer NONE, map task i produces part-i. Howver, in the typical applications that use -reducer NONE it should produce a partition that has the same index as the input parrtition.
> (Of course, this requires that each partition should be fed in its entirety to a map, rather than splitting it into blocks, but that is a separate issue.)
> Thus, sorting input splits should be either controllable via a configuration variable, or the FileInputFormat should sort the splits and JobClient should honor the order of splits.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1440) JobClient should not sort input-splits

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12503017 ]

eric baldeschwieler commented on HADOOP-1440:
---------------------------------------------

I tend to agree with runping.  The framework should reserve the right to execute maps in any order it chooses to.  Nailing down execution order will limit our ability to optimize later.  Also, why put sorting into user code?

It sounds like the need is to name the reduces, not control their order.  So why not address that directly?  Perhaps outputs can be numbered according to their original submission order in the case of reducer none?  This need not  pin down execution order.

Sounds like perhaps we should deprecate map.input.file now that a more uniform mechanism exists to get this info?

> JobClient should not sort input-splits
> --------------------------------------
>
>                 Key: HADOOP-1440
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1440
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Milind Bhandarkar
>             Fix For: 0.14.0
>
>
> Currently, the JobClient sorts the InputSplits returned by InputFormat in descending order, so that the map tasks corresponding to larger input-splits are scheduled first for execution than smaller ones. However, this causes problems in applications that produce data-sets partitioned similarly to the input partition with -reducer NONE.
> With -reducer NONE, map task i produces part-i. Howver, in the typical applications that use -reducer NONE it should produce a partition that has the same index as the input parrtition.
> (Of course, this requires that each partition should be fed in its entirety to a map, rather than splitting it into blocks, but that is a separate issue.)
> Thus, sorting input splits should be either controllable via a configuration variable, or the FileInputFormat should sort the splits and JobClient should honor the order of splits.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1440) JobClient should not sort input-splits

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12503576 ]

Doug Cutting commented on HADOOP-1440:
--------------------------------------

> It sounds like the need is to name the reduces, not control their order.

Rather, it's useful to be able to name the maps when reduce is disabled and outputs correspond directly to splits.  It also may be useful to be able to determine map order, as the application may know things about their relative costs that the kernel cannot.  We should separate these two notions.  The returned order of splits can only be used to represent one or the other, not both.

Some time ago I'd proposed adding a 'float cost()' method to splits.  This could be used for sorting for performance.  Owen argued that 'long length()' was better, since it permitted space allocation.  Perhaps we need both: we should sort by cost() and potentially constrain task allocation by length().  By default, cost() would be length().

Regardless, if sorting is done by the kernel (as it is now) then we should probably use the returned order of splits to determine the output partition when reduce is disabled.  Do we agree on that?

> JobClient should not sort input-splits
> --------------------------------------
>
>                 Key: HADOOP-1440
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1440
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Milind Bhandarkar
>             Fix For: 0.14.0
>
>
> Currently, the JobClient sorts the InputSplits returned by InputFormat in descending order, so that the map tasks corresponding to larger input-splits are scheduled first for execution than smaller ones. However, this causes problems in applications that produce data-sets partitioned similarly to the input partition with -reducer NONE.
> With -reducer NONE, map task i produces part-i. Howver, in the typical applications that use -reducer NONE it should produce a partition that has the same index as the input parrtition.
> (Of course, this requires that each partition should be fed in its entirety to a map, rather than splitting it into blocks, but that is a separate issue.)
> Thus, sorting input splits should be either controllable via a configuration variable, or the FileInputFormat should sort the splits and JobClient should honor the order of splits.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1440) JobClient should not sort input-splits

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12503609 ]

Owen O'Malley commented on HADOOP-1440:
---------------------------------------

It is just an easier explanation to users, if the first map returned from getSplits is map-0, the second is map-1, and so on. The problem from my point of view is just that right now the name of the task controls the scheduling of the task. They should be independent of each other.

> JobClient should not sort input-splits
> --------------------------------------
>
>                 Key: HADOOP-1440
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1440
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Milind Bhandarkar
>             Fix For: 0.14.0
>
>
> Currently, the JobClient sorts the InputSplits returned by InputFormat in descending order, so that the map tasks corresponding to larger input-splits are scheduled first for execution than smaller ones. However, this causes problems in applications that produce data-sets partitioned similarly to the input partition with -reducer NONE.
> With -reducer NONE, map task i produces part-i. Howver, in the typical applications that use -reducer NONE it should produce a partition that has the same index as the input parrtition.
> (Of course, this requires that each partition should be fed in its entirety to a map, rather than splitting it into blocks, but that is a separate issue.)
> Thus, sorting input splits should be either controllable via a configuration variable, or the FileInputFormat should sort the splits and JobClient should honor the order of splits.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1440) JobClient should not sort input-splits

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12503614 ]

Doug Cutting commented on HADOOP-1440:
--------------------------------------

bq. The problem from my point of view is just that right now the name of the task controls the scheduling of the task. They should be independent of each other.

Right.  I agree.  So I think we should, in the short term, to resolve this issue:

1. Use the order returned from getSplits() to determine the map name, and hence the output names when reduce is disabled.

2. Continue to sort by the length of the input to determine task execution order.

Does that make sense?


> JobClient should not sort input-splits
> --------------------------------------
>
>                 Key: HADOOP-1440
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1440
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Milind Bhandarkar
>             Fix For: 0.14.0
>
>
> Currently, the JobClient sorts the InputSplits returned by InputFormat in descending order, so that the map tasks corresponding to larger input-splits are scheduled first for execution than smaller ones. However, this causes problems in applications that produce data-sets partitioned similarly to the input partition with -reducer NONE.
> With -reducer NONE, map task i produces part-i. Howver, in the typical applications that use -reducer NONE it should produce a partition that has the same index as the input parrtition.
> (Of course, this requires that each partition should be fed in its entirety to a map, rather than splitting it into blocks, but that is a separate issue.)
> Thus, sorting input splits should be either controllable via a configuration variable, or the FileInputFormat should sort the splits and JobClient should honor the order of splits.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1440) JobClient should not sort input-splits

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12503622 ]

Runping Qi commented on HADOOP-1440:
------------------------------------


That does not really address the problem this Jira tries to address:

In the reduce=None case, the user wants to control two things:

1. Whether the input files are splittable.
2. If it is set that the input files are not splitable, the number of the output files must be the same as
that of input files, and the relative order of the input files is the same as that of the corresponding
output files.

That is why I proposed to use the input filenames as the prefix for the output filenames.

 

> JobClient should not sort input-splits
> --------------------------------------
>
>                 Key: HADOOP-1440
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1440
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Milind Bhandarkar
>             Fix For: 0.14.0
>
>
> Currently, the JobClient sorts the InputSplits returned by InputFormat in descending order, so that the map tasks corresponding to larger input-splits are scheduled first for execution than smaller ones. However, this causes problems in applications that produce data-sets partitioned similarly to the input partition with -reducer NONE.
> With -reducer NONE, map task i produces part-i. Howver, in the typical applications that use -reducer NONE it should produce a partition that has the same index as the input parrtition.
> (Of course, this requires that each partition should be fed in its entirety to a map, rather than splitting it into blocks, but that is a separate issue.)
> Thus, sorting input splits should be either controllable via a configuration variable, or the FileInputFormat should sort the splits and JobClient should honor the order of splits.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1440) JobClient should not sort input-splits

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12503635 ]

Doug Cutting commented on HADOOP-1440:
--------------------------------------

bq. That does not really address the problem this Jira tries to address:

I think it does.  Whether the input files are splittable is up to the input format.  If reduce is disabled, then I proposed (above) that the order of the input splits should determine the numbering of output files.  So what's not addressed?

Changing the kernel to base the output file names directly on the input file names would break a number of abstraction boundaries.  But I don't see how this is required.  The list of input files and output files should correspond one-to-one.

> JobClient should not sort input-splits
> --------------------------------------
>
>                 Key: HADOOP-1440
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1440
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Milind Bhandarkar
>             Fix For: 0.14.0
>
>
> Currently, the JobClient sorts the InputSplits returned by InputFormat in descending order, so that the map tasks corresponding to larger input-splits are scheduled first for execution than smaller ones. However, this causes problems in applications that produce data-sets partitioned similarly to the input partition with -reducer NONE.
> With -reducer NONE, map task i produces part-i. Howver, in the typical applications that use -reducer NONE it should produce a partition that has the same index as the input parrtition.
> (Of course, this requires that each partition should be fed in its entirety to a map, rather than splitting it into blocks, but that is a separate issue.)
> Thus, sorting input splits should be either controllable via a configuration variable, or the FileInputFormat should sort the splits and JobClient should honor the order of splits.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1440) JobClient should not sort input-splits

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12503666 ]

Runping Qi commented on HADOOP-1440:
------------------------------------


It address the problem only if  the order returned from getSplits()  agrees with the order of the input file names.
That constraint limits the flexibility of the framework.


> JobClient should not sort input-splits
> --------------------------------------
>
>                 Key: HADOOP-1440
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1440
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Milind Bhandarkar
>             Fix For: 0.14.0
>
>
> Currently, the JobClient sorts the InputSplits returned by InputFormat in descending order, so that the map tasks corresponding to larger input-splits are scheduled first for execution than smaller ones. However, this causes problems in applications that produce data-sets partitioned similarly to the input partition with -reducer NONE.
> With -reducer NONE, map task i produces part-i. Howver, in the typical applications that use -reducer NONE it should produce a partition that has the same index as the input parrtition.
> (Of course, this requires that each partition should be fed in its entirety to a map, rather than splitting it into blocks, but that is a separate issue.)
> Thus, sorting input splits should be either controllable via a configuration variable, or the FileInputFormat should sort the splits and JobClient should honor the order of splits.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1440) JobClient should not sort input-splits

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12504338 ]

Doug Cutting commented on HADOOP-1440:
--------------------------------------

> It address the problem only if the order returned from getSplits() agrees with the order of the input file names.

Right.  Under this short-term proposal, if the application is disabling reduce and wishes to align output names with input names, then it must sort the input names.

Alternately and eventually, when HADOOP-1230 is implemented, we could make the InputSplit available to the OutputFormat through the ReduceContext when reduce is disabled.  So we'd add a method OutputContext#getInputSplit() that would return null in normal reduces, but would return the input split when reduce is disabled.


> JobClient should not sort input-splits
> --------------------------------------
>
>                 Key: HADOOP-1440
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1440
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Milind Bhandarkar
>             Fix For: 0.14.0
>
>
> Currently, the JobClient sorts the InputSplits returned by InputFormat in descending order, so that the map tasks corresponding to larger input-splits are scheduled first for execution than smaller ones. However, this causes problems in applications that produce data-sets partitioned similarly to the input partition with -reducer NONE.
> With -reducer NONE, map task i produces part-i. Howver, in the typical applications that use -reducer NONE it should produce a partition that has the same index as the input parrtition.
> (Of course, this requires that each partition should be fed in its entirety to a map, rather than splitting it into blocks, but that is a separate issue.)
> Thus, sorting input splits should be either controllable via a configuration variable, or the FileInputFormat should sort the splits and JobClient should honor the order of splits.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1440) JobClient should not sort input-splits

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Senthil Subramanian updated HADOOP-1440:
----------------------------------------

    Attachment: HADOOP-1440_1.patch

Patch which implements the solution proposed by Doug:
>> 1. Use the order returned from getSplits() to determine the map name, and hence the output names when reduce is disabled.
>> 2. Continue to sort by the length of the input to determine task execution order.



> JobClient should not sort input-splits
> --------------------------------------
>
>                 Key: HADOOP-1440
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1440
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Milind Bhandarkar
>             Fix For: 0.14.0
>
>         Attachments: HADOOP-1440_1.patch
>
>
> Currently, the JobClient sorts the InputSplits returned by InputFormat in descending order, so that the map tasks corresponding to larger input-splits are scheduled first for execution than smaller ones. However, this causes problems in applications that produce data-sets partitioned similarly to the input partition with -reducer NONE.
> With -reducer NONE, map task i produces part-i. Howver, in the typical applications that use -reducer NONE it should produce a partition that has the same index as the input parrtition.
> (Of course, this requires that each partition should be fed in its entirety to a map, rather than splitting it into blocks, but that is a separate issue.)
> Thus, sorting input splits should be either controllable via a configuration variable, or the FileInputFormat should sort the splits and JobClient should honor the order of splits.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1440) JobClient should not sort input-splits

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Senthil Subramanian updated HADOOP-1440:
----------------------------------------

    Status: Patch Available  (was: Open)

> JobClient should not sort input-splits
> --------------------------------------
>
>                 Key: HADOOP-1440
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1440
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Milind Bhandarkar
>             Fix For: 0.14.0
>
>         Attachments: HADOOP-1440_1.patch
>
>
> Currently, the JobClient sorts the InputSplits returned by InputFormat in descending order, so that the map tasks corresponding to larger input-splits are scheduled first for execution than smaller ones. However, this causes problems in applications that produce data-sets partitioned similarly to the input partition with -reducer NONE.
> With -reducer NONE, map task i produces part-i. Howver, in the typical applications that use -reducer NONE it should produce a partition that has the same index as the input parrtition.
> (Of course, this requires that each partition should be fed in its entirety to a map, rather than splitting it into blocks, but that is a separate issue.)
> Thus, sorting input splits should be either controllable via a configuration variable, or the FileInputFormat should sort the splits and JobClient should honor the order of splits.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

12