[jira] Created: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat

ASF GitHub Bot (Jira)
MultiFileSplit, MultiFileInputFormat
------------------------------------

                 Key: HADOOP-1515
                 URL: https://issues.apache.org/jira/browse/HADOOP-1515
             Project: Hadoop
          Issue Type: New Feature
          Components: mapred
    Affects Versions: 0.14.0
            Reporter: Enis Soztutar
            Assignee: Enis Soztutar
             Fix For: 0.14.0


An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file.






--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat

ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar updated HADOOP-1515:
----------------------------------

    Attachment: multiFile_v1.0.patch

{{multiFile_v1.0.patch}}

This patch implements two classes : MultiFileSplit and MultiFileInputFormat. Below are the javadocs :

{code}
/**
 * A sub-collection of input files. Unlike FileSplit, MultiFileSplit
 * class does not represent a split of a file, but a split of input files
 * into smaller sets. The atomic unit of split is a file.  
 * MultiFileSplit can be used to implement RecordReader's, with
 * reading one record per file.
 */
public class MultiFileSplit implements InputSplit
{code}

and

{code}
/**
 * An abstract  InputFormat that returns MultiFileSplit's
 * in  #getSplits(JobConf, int) method. Splits are constructed from
 * the files under the input paths. Each split returned contains nearly
 * equal content length.
 * Subclasses implement #getRecordReader(InputSplit, JobConf, Reporter)
 * to construct RecordReader's for MultiFileSplit's.
 */
public abstract class MultiFileInputFormat extends FileInputFormat
{code}

I have successfully tested this implementations as a part of a job, containing more than 15k input files, one record per file and 2GB of data.


> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
>                 Key: HADOOP-1515
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1515
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.14.0
>
>         Attachments: multiFile_v1.0.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar updated HADOOP-1515:
----------------------------------

    Status: Patch Available  (was: Open)

> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
>                 Key: HADOOP-1515
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1515
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.14.0
>
>         Attachments: multiFile_v1.0.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506915 ]

Hadoop QA commented on HADOOP-1515:
-----------------------------------

+1

http://issues.apache.org/jira/secure/attachment/12360275/multiFile_v1.0.patch applied and successfully tested against trunk revision r549284.

Test results:   http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/317/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/317/console

> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
>                 Key: HADOOP-1515
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1515
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.14.0
>
>         Attachments: multiFile_v1.0.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-1515:
---------------------------------

    Status: Open  (was: Patch Available)

This looks good to me.  But can you please add a unit test?  Something like TestSequenceFileInputFormat or TestTextFileInputFormat, that tests the public methods.  It doesn't need to run a job.  Thanks!

> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
>                 Key: HADOOP-1515
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1515
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.14.0
>
>         Attachments: multiFile_v1.0.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Work started: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on HADOOP-1515 started by Enis Soztutar.

> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
>                 Key: HADOOP-1515
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1515
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.14.0
>
>         Attachments: multiFile_v1.0.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507032 ]

Enis Soztutar commented on HADOOP-1515:
---------------------------------------

> But can you please add a unit test?
I'll be looking into this ASAP.

> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
>                 Key: HADOOP-1515
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1515
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.14.0
>
>         Attachments: multiFile_v1.0.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar updated HADOOP-1515:
----------------------------------

    Attachment: multiFile_v1.1.patch

attaching the patch with the unit test. The test runs in apprx. 30 secs.


> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
>                 Key: HADOOP-1515
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1515
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.14.0
>
>         Attachments: multiFile_v1.0.patch, multiFile_v1.1.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar updated HADOOP-1515:
----------------------------------

    Status: Patch Available  (was: In Progress)

> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
>                 Key: HADOOP-1515
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1515
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.14.0
>
>         Attachments: multiFile_v1.0.patch, multiFile_v1.1.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507947 ]

Hadoop QA commented on HADOOP-1515:
-----------------------------------

+1

http://issues.apache.org/jira/secure/attachment/12360492/multiFile_v1.1.patch applied and successfully tested against trunk revision r549977.

Test results:   http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/326/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/326/console

> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
>                 Key: HADOOP-1515
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1515
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.14.0
>
>         Attachments: multiFile_v1.0.patch, multiFile_v1.1.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-1515:
---------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed this.  Thanks, Enis.

> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
>                 Key: HADOOP-1515
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1515
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.14.0
>
>         Attachments: multiFile_v1.0.patch, multiFile_v1.1.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508150 ]

Hudson commented on HADOOP-1515:
--------------------------------

Integrated in Hadoop-Nightly #136 (See [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/136/])

> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
>                 Key: HADOOP-1515
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1515
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.14.0
>
>         Attachments: multiFile_v1.0.patch, multiFile_v1.1.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.