SequenceFile (Text,Text) becomes plain text

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

SequenceFile (Text,Text) becomes plain text

Alejandro Abdelnur-2
I may be missing something silly here,

I have a MR that generates an output type (Text,Text)

Consuming that output for another MR it becomes a plain text file thus the
input is (LongWriteable, Text) with the long key being the line number and
the text value is the key+value separated by a tab and my second MR blow as
it was expecting (Text,Text) plus that the key is wrong.

Doing a cat of the file I see it become a flat file with lines having "key
\t value".

How can I force the output of the first MR to remain a sequence file of
(Text, Text)?

Thxs.

A
Reply | Threaded
Open this post in threaded view
|

Re: SequenceFile (Text,Text) becomes plain text

Dennis Kubes
You need to set the input format of the second job.  It defaults to
TextInputFormat which is why you are seeing it become text.  Use a line
like below in the second job.

secondjob.setInputFormat(SequenceFileInputFormat.class);
secondjob.setInputKeyClass(Text.class);
secondjob.setInputValueClass(Text.class);

Dennis Kubes

Alejandro Abdelnur wrote:

> I may be missing something silly here,
>
> I have a MR that generates an output type (Text,Text)
>
> Consuming that output for another MR it becomes a plain text file thus the
> input is (LongWriteable, Text) with the long key being the line number and
> the text value is the key+value separated by a tab and my second MR blow as
> it was expecting (Text,Text) plus that the key is wrong.
>
> Doing a cat of the file I see it become a flat file with lines having "key
> \t value".
>
> How can I force the output of the first MR to remain a sequence file of
> (Text, Text)?
>
> Thxs.
>
> A
>
Reply | Threaded
Open this post in threaded view
|

Re: SequenceFile (Text,Text) becomes plain text

Bryan A. P. Pendleton
For that to work, the output of the previous job will have to set to
SequenceFileOuputFormat.

Note that, unless there are no tab characters in the keys of the output from
the first job, there's no way to read the existing output accurately back
in.

On 2/2/07, Dennis Kubes <[hidden email]> wrote:

>
> You need to set the input format of the second job.  It defaults to
> TextInputFormat which is why you are seeing it become text.  Use a line
> like below in the second job.
>
> secondjob.setInputFormat(SequenceFileInputFormat.class);
> secondjob.setInputKeyClass(Text.class);
> secondjob.setInputValueClass(Text.class);
>
> Dennis Kubes
>
> Alejandro Abdelnur wrote:
> > I may be missing something silly here,
> >
> > I have a MR that generates an output type (Text,Text)
> >
> > Consuming that output for another MR it becomes a plain text file thus
> the
> > input is (LongWriteable, Text) with the long key being the line number
> and
> > the text value is the key+value separated by a tab and my second MR blow
> as
> > it was expecting (Text,Text) plus that the key is wrong.
> >
> > Doing a cat of the file I see it become a flat file with lines having
> "key
> > \t value".
> >
> > How can I force the output of the first MR to remain a sequence file of
> > (Text, Text)?
> >
> > Thxs.
> >
> > A
> >
>



--
Bryan A. P. Pendleton
Ph: (877) geek-1-bp
Reply | Threaded
Open this post in threaded view
|

Re: SequenceFile (Text,Text) becomes plain text

Owen O'Malley-5

On Feb 2, 2007, at 2:46 PM, Bryan A. P. Pendleton wrote:

> Note that, unless there are no tab characters in the keys of the  
> output from
> the first job, there's no way to read the existing output  
> accurately back
> in.

*Sigh* That asymmetry in Text{In,Out}putFormat has bothered me for a  
while now. I think at some point, we should do a TabText{In,Out}
putFormat that looks like:

<key>\t<value>\n with tabs and newlines escaped in the keys and values.

That will give us a symmetric set of text formats. Furthermore, I'd  
say that if value == NULL, the tab should be left off.
Reply | Threaded
Open this post in threaded view
|

Re: SequenceFile (Text,Text) becomes plain text

Bryan A. P. Pendleton
Yes, it would be nice to fix that at some point.

Possibly a shadow file that keeps track of the offset of each key/value in
the file (probably using Vint-encoded difference-from-last-value). The
existing output would be preserved, but someone reading the file could use
such a "cheat sheet" to reconstitute the proper key/value sets, without
having to do any unescaping of tabs of newlines. And normal text tools could
still do something, albeit possibly led astray by extra tabs or newlines in
the data.

On 2/2/07, Owen O'Malley <[hidden email]> wrote:

>
>
> On Feb 2, 2007, at 2:46 PM, Bryan A. P. Pendleton wrote:
>
> > Note that, unless there are no tab characters in the keys of the
> > output from
> > the first job, there's no way to read the existing output
> > accurately back
> > in.
>
> *Sigh* That asymmetry in Text{In,Out}putFormat has bothered me for a
> while now. I think at some point, we should do a TabText{In,Out}
> putFormat that looks like:
>
> <key>\t<value>\n with tabs and newlines escaped in the keys and values.
>
> That will give us a symmetric set of text formats. Furthermore, I'd
> say that if value == NULL, the tab should be left off.
>



--
Bryan A. P. Pendleton
Ph: (877) geek-1-bp
Reply | Threaded
Open this post in threaded view
|

Re: SequenceFile (Text,Text) becomes plain text

Alejandro Abdelnur-2
In reply to this post by Bryan A. P. Pendleton
Yes, found the problem, it was something dumb, not setting the output to
SequenceFileOutputFormat. now things work.

Now that things work I've noticed the output of a MR using
SequenceFileOutputFormat is not compressed, but when I create a
SequenceFile.Writer it is by default compressed.

How to I set the MR output to be compressed in the JobConf? I can set
compression for the Map output but not for the MR output.

Thxs.

Alejandro


On 2/3/07, Bryan A. P. Pendleton <[hidden email]> wrote:

>
> For that to work, the output of the previous job will have to set to
> SequenceFileOuputFormat.
>
> Note that, unless there are no tab characters in the keys of the output
> from
> the first job, there's no way to read the existing output accurately back
> in.
>
> On 2/2/07, Dennis Kubes <[hidden email]> wrote:
> >
> > You need to set the input format of the second job.  It defaults to
> > TextInputFormat which is why you are seeing it become text.  Use a line
> > like below in the second job.
> >
> > secondjob.setInputFormat(SequenceFileInputFormat.class);
> > secondjob.setInputKeyClass(Text.class);
> > secondjob.setInputValueClass(Text.class);
> >
> > Dennis Kubes
> >
> > Alejandro Abdelnur wrote:
> > > I may be missing something silly here,
> > >
> > > I have a MR that generates an output type (Text,Text)
> > >
> > > Consuming that output for another MR it becomes a plain text file thus
> > the
> > > input is (LongWriteable, Text) with the long key being the line number
> > and
> > > the text value is the key+value separated by a tab and my second MR
> blow
> > as
> > > it was expecting (Text,Text) plus that the key is wrong.
> > >
> > > Doing a cat of the file I see it become a flat file with lines having
> > "key
> > > \t value".
> > >
> > > How can I force the output of the first MR to remain a sequence file
> of
> > > (Text, Text)?
> > >
> > > Thxs.
> > >
> > > A
> > >
> >
>
>
>
> --
> Bryan A. P. Pendleton
> Ph: (877) geek-1-bp
>