Bzip2 files as an input to MR job

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Bzip2 files as an input to MR job

Georgi Ivanov
Hi guys,
I would like to compress the files on HDFS to save some storage.

As far as i see bzip2 is the only format which is splitable (and slow).

The actual files are Avro.

So in my driver class i have :

job.setInputFormatClass(AvroKeyInputFormat.class);

I have number of jobs running processing Avro files so i would like to
keep the code change to a minimum.

Is it possible to comrpess these avro files with bzip2 and keep the code
of MR jobs the same (or with little change)
If it is , please give me some hints as so far i don't seem to find any
good resources on the Internet.


Georgi
Reply | Threaded
Open this post in threaded view
|

Re: Bzip2 files as an input to MR job

Niels Basjes
Hi,

You can use the GZip inside the AVRO files and still have splittable AVRO files.
This has the to with the fact that there is a block structure inside the AVRO and these blocks are gzipped.

I suggest you simply try it.

Niels


On Mon, Sep 22, 2014 at 4:40 PM, Georgi Ivanov <[hidden email]> wrote:
Hi guys,
I would like to compress the files on HDFS to save some storage.

As far as i see bzip2 is the only format which is splitable (and slow).

The actual files are Avro.

So in my driver class i have :

job.setInputFormatClass(AvroKeyInputFormat.class);

I have number of jobs running processing Avro files so i would like to keep the code change to a minimum.

Is it possible to comrpess these avro files with bzip2 and keep the code of MR jobs the same (or with little change)
If it is , please give me some hints as so far i don't seem to find any good resources on the Internet.


Georgi



--
Best regards / Met vriendelijke groeten,

Niels Basjes
Reply | Threaded
Open this post in threaded view
|

Re: Bzip2 files as an input to MR job

Georgi Ivanov
Hi Niels,
Thanks for the reply.
Changing the avro files is not really an option for me as it will require a lot of time( i have a lot ).
The Avro files themself are compressed a bit.
But still bzip2 gives 50% compression on one avro file.

So what i want is , to use Bzip2 compressed file as an input to my MR jobs.
Bzip2 is splittable.
Should be possible somehow , but i don't seem to find it atm.

On 22.09.2014 17:13, Niels Basjes wrote:
Hi,

You can use the GZip inside the AVRO files and still have splittable AVRO files.
This has the to with the fact that there is a block structure inside the AVRO and these blocks are gzipped.

I suggest you simply try it.

Niels


On Mon, Sep 22, 2014 at 4:40 PM, Georgi Ivanov <[hidden email]> wrote:
Hi guys,
I would like to compress the files on HDFS to save some storage.

As far as i see bzip2 is the only format which is splitable (and slow).

The actual files are Avro.

So in my driver class i have :

job.setInputFormatClass(AvroKeyInputFormat.class);

I have number of jobs running processing Avro files so i would like to keep the code change to a minimum.

Is it possible to comrpess these avro files with bzip2 and keep the code of MR jobs the same (or with little change)
If it is , please give me some hints as so far i don't seem to find any good resources on the Internet.


Georgi



--
Best regards / Met vriendelijke groeten,

Niels Basjes

Reply | Threaded
Open this post in threaded view
|

RE: Bzip2 files as an input to MR job

java8964 java8964
Georgi:

I think  you misunderstand the originally answer.

If you already use Avor format, then the file will be splitable. If you want to add compression on top of that,  feel free going ahead.

If you read the Avor DataFileWriter API:


You will see there is a setCodec method, which allow to you specify any codec to compress your data.

The compression can be either per block, or per record. Per block is recommended, as it will be more efficient.

You can use bzip2 or gzip or snappy or any other compression. You just need to to use the above api, and make sure the compression codec is available in all your task nodes.

splitable or unsplitable compression doesn't matter to you in this case, as you are using AVRO, which is splitable.

What you need to choose is which compression is better, or fit your application usage case.

In our production, we use snappy, as it gives us a good balance between compression ratio and read/decompression speed and CPU usage.

Different compressions have trade off. You need to compare them based on your case.

Yong


Date: Mon, 22 Sep 2014 17:21:29 +0200
From: [hidden email]
To: [hidden email]
Subject: Re: Bzip2 files as an input to MR job

Hi Niels,
Thanks for the reply.
Changing the avro files is not really an option for me as it will require a lot of time( i have a lot ).
The Avro files themself are compressed a bit.
But still bzip2 gives 50% compression on one avro file.

So what i want is , to use Bzip2 compressed file as an input to my MR jobs.
Bzip2 is splittable.
Should be possible somehow , but i don't seem to find it atm.

On 22.09.2014 17:13, Niels Basjes wrote:
Hi,

You can use the GZip inside the AVRO files and still have splittable AVRO files.
This has the to with the fact that there is a block structure inside the AVRO and these blocks are gzipped.

I suggest you simply try it.

Niels


On Mon, Sep 22, 2014 at 4:40 PM, Georgi Ivanov <[hidden email]> wrote:
Hi guys,
I would like to compress the files on HDFS to save some storage.

As far as i see bzip2 is the only format which is splitable (and slow).

The actual files are Avro.

So in my driver class i have :

job.setInputFormatClass(AvroKeyInputFormat.class);

I have number of jobs running processing Avro files so i would like to keep the code change to a minimum.

Is it possible to comrpess these avro files with bzip2 and keep the code of MR jobs the same (or with little change)
If it is , please give me some hints as so far i don't seem to find any good resources on the Internet.


Georgi



--
Best regards / Met vriendelijke groeten,

Niels Basjes