Deprecated ... damaged?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Deprecated ... damaged?

maha-2
Hi everyone,

  Using Hadoop-0.20.2, I'm trying to use MultiFileInputFormat which is supposed to put each file from the input directory in a SEPARATE split. So the number of Maps is equal to the number of input files. Yet, what I get is that each split contains multiple paths of input files, hence # of maps is < # of input files. Is it because "MultiFileInputFormat" is deprecated?

  In my implemented myMultiFileInputFormat I have only the following:

public RecordReader<LongWritable, Text> getRecordReader(InputSplit split, JobConf job, Reporter reporter){
                return (new myRecordReader((MultiFileSplit) split));
        }

Yet, in myRecordReader, for example one split has the following;
 
  " /tmp/input/file1:0+300
    /tmp/input/file2:0+199  "

  instead of each line in its own split.

    Why? Any clues?

          Thank you,
              Maha
 
Reply | Threaded
Open this post in threaded view
|

Re: Deprecated ... damaged?

maha-2
Actually, I just realized that numSplits can't be modified "definitely". Even if I write numSplits = 5, it's just a hint.

Then how come MultiFileInputFormat claims to use MultiFileSplit to contain one file/split ?? or is that also just a hint?

Maha

On Dec 15, 2010, at 2:13 AM, maha wrote:

> Hi everyone,
>
>  Using Hadoop-0.20.2, I'm trying to use MultiFileInputFormat which is supposed to put each file from the input directory in a SEPARATE split. So the number of Maps is equal to the number of input files. Yet, what I get is that each split contains multiple paths of input files, hence # of maps is < # of input files. Is it because "MultiFileInputFormat" is deprecated?
>
>  In my implemented myMultiFileInputFormat I have only the following:
>
> public RecordReader<LongWritable, Text> getRecordReader(InputSplit split, JobConf job, Reporter reporter){
> return (new myRecordReader((MultiFileSplit) split));
> }
>
> Yet, in myRecordReader, for example one split has the following;
>
>  " /tmp/input/file1:0+300
>    /tmp/input/file2:0+199  "
>
>  instead of each line in its own split.
>
>    Why? Any clues?
>
>          Thank you,
>              Maha

Reply | Threaded
Open this post in threaded view
|

Re: Deprecated ... damaged?

Allen Wittenauer
In reply to this post by maha-2

On Dec 15, 2010, at 2:13 AM, maha wrote:

> Hi everyone,
>
>  Using Hadoop-0.20.2, I'm trying to use MultiFileInputFormat which is supposed to put each file from the input directory in a SEPARATE split.


        Is there some reason you don't just use normal InputFormat with an extremely high min.split.size?

Reply | Threaded
Open this post in threaded view
|

Re: Deprecated ... damaged?

maha-2
Hi Allen and thanks for responding ..
   
   You're answer actually gave me another clue, I set numSplits = numFiles*100; in myInputFormat and it worked :D ... Do you think there are side effects for doing that?

   Thank you,

       Maha

On Dec 15, 2010, at 12:16 PM, Allen Wittenauer wrote:

>
> On Dec 15, 2010, at 2:13 AM, maha wrote:
>
>> Hi everyone,
>>
>> Using Hadoop-0.20.2, I'm trying to use MultiFileInputFormat which is supposed to put each file from the input directory in a SEPARATE split.
>
>
> Is there some reason you don't just use normal InputFormat with an extremely high min.split.size?
>