Multiple final reduced outputs

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Multiple final reduced outputs

Deepak Diwakar
I have setup 2 node clusters and ran many jobs including wordcount.  In all
the output folders i am getting two mutual exclusive output files as
part-00000 and part-00001 instead of single output. A merging should take
place to get into one single output file which is not occurring here .

Could someone point me out where i am going wrong?

Thanks & regards
- Deepak Diwakar,
Reply | Threaded
Open this post in threaded view
|

Re: Multiple final reduced outputs

David Pellegrini-2
Perhaps I'm missing some subtlety, but that's what I would expect.  2
reducer nodes -> 2 outputs.  If you need them in one big file, cat them
together.

my 2 cents

David

On 07/28/2010 12:16 PM, Deepak Diwakar wrote:

> I have setup 2 node clusters and ran many jobs including wordcount.  In all
> the output folders i am getting two mutual exclusive output files as
> part-00000 and part-00001 instead of single output. A merging should take
> place to get into one single output file which is not occurring here .
>
> Could someone point me out where i am going wrong?
>
> Thanks&  regards
> - Deepak Diwakar,
>
>    
Reply | Threaded
Open this post in threaded view
|

Re: Multiple final reduced outputs

chaitanya krishna-2
In reply to this post by Deepak Diwakar
Hi Deepak,

AFAIK, the number of output files depends on the number of reduce tasks (i
hope i'm not missing any other factors). So, If a single output file is the
requirement, then setting number of reduce tasks to 1 should work. Another
solution would be to put another job with these output files as input and
merge them.

Hope this helps,
Chaitanya.

On Thu, Jul 29, 2010 at 12:46 AM, Deepak Diwakar <[hidden email]>wrote:

> I have setup 2 node clusters and ran many jobs including wordcount.  In all
> the output folders i am getting two mutual exclusive output files as
> part-00000 and part-00001 instead of single output. A merging should take
> place to get into one single output file which is not occurring here .
>
> Could someone point me out where i am going wrong?
>
> Thanks & regards
> - Deepak Diwakar,
>
Reply | Threaded
Open this post in threaded view
|

Re: Multiple final reduced outputs

Harsh J
In reply to this post by David Pellegrini-2
Concatenating them is the easiest way to get the result back as a
single file (its grouped/sorted anyway). For files that can't exactly
be 'cat' together (headers, etc.), you may run your job with an
explicit number of Reducers (or write special tools for such cases,
cause else the limited number of reducers may impact the processing
time).

JobConf.setNumReduceTasks(int n); before submitting the job should do it.

In case you've doubts about what 'merge' really means in the
map-to-intermediate-to-reduce phases, this guide should explain it
very well: http://wiki.apache.org/hadoop/HadoopMapReduce

On Thu, Jul 29, 2010 at 12:57 AM, David Pellegrini
<[hidden email]> wrote:

> Perhaps I'm missing some subtlety, but that's what I would expect.  2
> reducer nodes -> 2 outputs.  If you need them in one big file, cat them
> together.
>
> my 2 cents
>
> David
>
> On 07/28/2010 12:16 PM, Deepak Diwakar wrote:
>>
>> I have setup 2 node clusters and ran many jobs including wordcount.  In
>> all
>> the output folders i am getting two mutual exclusive output files as
>> part-00000 and part-00001 instead of single output. A merging should take
>> place to get into one single output file which is not occurring here .
>>
>> Could someone point me out where i am going wrong?
>>
>> Thanks&  regards
>> - Deepak Diwakar,
>>
>>
>



--
Harsh J
www.harshj.com
Reply | Threaded
Open this post in threaded view
|

Re: Multiple final reduced outputs

Deepak Diwakar
Yep Harsh. I was doing the same just wondering why not we have option at
master to combine them into a single file. That could be a feature( and if
its there please let me know ). Similar to setting reduce class to job ,we
may set a merger/master-combiner to that class  into code itself.

Also thanks David and chaitanya for putting your pointers.  Actually i was
more of wondering about having an in-build option to marge after collecting
all reduced outputs .

Thanks & regards
- Deepak Diwakar,




On 29 July 2010 01:12, Harsh J <[hidden email]> wrote:

> Concatenating them is the easiest way to get the result back as a
> single file (its grouped/sorted anyway). For files that can't exactly
> be 'cat' together (headers, etc.), you may run your job with an
> explicit number of Reducers (or write special tools for such cases,
> cause else the limited number of reducers may impact the processing
> time).
>
> JobConf.setNumReduceTasks(int n); before submitting the job should do it.
>
> In case you've doubts about what 'merge' really means in the
> map-to-intermediate-to-reduce phases, this guide should explain it
> very well: http://wiki.apache.org/hadoop/HadoopMapReduce
>
> On Thu, Jul 29, 2010 at 12:57 AM, David Pellegrini
> <[hidden email]> wrote:
> > Perhaps I'm missing some subtlety, but that's what I would expect.  2
> > reducer nodes -> 2 outputs.  If you need them in one big file, cat them
> > together.
> >
> > my 2 cents
> >
> > David
> >
> > On 07/28/2010 12:16 PM, Deepak Diwakar wrote:
> >>
> >> I have setup 2 node clusters and ran many jobs including wordcount.  In
> >> all
> >> the output folders i am getting two mutual exclusive output files as
> >> part-00000 and part-00001 instead of single output. A merging should
> take
> >> place to get into one single output file which is not occurring here .
> >>
> >> Could someone point me out where i am going wrong?
> >>
> >> Thanks&  regards
> >> - Deepak Diwakar,
> >>
> >>
> >
>
>
>
> --
> Harsh J
> www.harshj.com
>