File Compression

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

File Compression

Michael Harris-13
I have a question about file compression in Hadoop. When I set the io.seqfile.compression.type=BLOCK does this also compress actual files I load in the DFS or does this only control the map/reduce file compression? If it doesnt compress the files on the file system, is there any way to compress a file when its loaded? The concern here is that I am just getting started with Pig/Hadoop and have a very small cluster of around 5 nodes. I want to limit IO wait by compressing the actual data. As a test when I compressed our 4GB log file using rar it was only 280mb.

Thanks,
Michael
Reply | Threaded
Open this post in threaded view
|

Re: File Compression

Arun C Murthy-2
Michael,

On Tue, Nov 13, 2007 at 08:56:36AM -0800, Michael Harris wrote:
>I have a question about file compression in Hadoop. When I set the io.seqfile.compression.type=BLOCK does this also compress actual files I load in the DFS or does this only control the map/reduce file compression? If it doesnt compress the files on the file system, is there any way to compress a file when its loaded? The concern here is that I am just getting started with Pig/Hadoop and have a very small cluster of around 5 nodes. I want to limit IO wait by compressing the actual data. As a test when I compressed our 4GB log file using rar it was only 280mb.
>

If you are loading files into HDFS as a SequenceFile and you set io.seqfile.compression.type=BLOCK (or RECORD) the file will have compressed records. Equivalently you can also use one of the many SequenceFile.createWriter methods (see http://lucene.apache.org/hadoop/api/org/apache/hadoop/io/SequenceFile.html) to specify the compression type, compression codec etc.

Arun

Reply | Threaded
Open this post in threaded view
|

RE: File Compression

Devaraj Das
In reply to this post by Michael Harris-13
Yes, io.seqfile.compression controls compression of only the mapred files. A
way you can compress files on the dfs, independent of mapred, is to use the
java.util.zip package over the OutputStream that the
DistributedFileSystem.create returns. For example, you can use
java.util.zip.GZIPOutputStream. Pass the
org.apache.hadoop.fs.FSDataOutputStream that
org.apache.hadoop.dfs.DistributedFileSystem.create() returns as an argument
to the GZIPOutputStream constructor.  

> -----Original Message-----
> From: Michael Harris [mailto:[hidden email]]
> Sent: Tuesday, November 13, 2007 10:27 PM
> To: [hidden email]
> Subject: File Compression
>
> I have a question about file compression in Hadoop. When I
> set the io.seqfile.compression.type=BLOCK does this also
> compress actual files I load in the DFS or does this only
> control the map/reduce file compression? If it doesnt
> compress the files on the file system, is there any way to
> compress a file when its loaded? The concern here is that I
> am just getting started with Pig/Hadoop and have a very small
> cluster of around 5 nodes. I want to limit IO wait by
> compressing the actual data. As a test when I compressed our
> 4GB log file using rar it was only 280mb.
>
> Thanks,
> Michael
>