Hadoop Custom InputFormat (SequenceFileInputFormat vs FileInputFormat)

Travis Chung

I'm working with a single image file that consists of headers and a multitude of different of data segment types (each data segment having its own sub-header that contains meta data).

Example file layout:

| Header | Seg A-1 Sub-Header | Seg A-1 Data | Seg A-2 SubHdr | Seg A-2 Data | Seg B-1 Subhdr | Seg B-1 Data | Seg C-1 SubHdr | Seg C-1 Data | etc....

The headers will vary from 1-10 Kb in size and each Data segment size will vary anywhere from 10KB - 10GB. The headers are represented as characters and the data is represented as binary. The headers include some useful information like number of segments, size of subheaders and segment data (I'll need this to create my splits).

To digest it all, I'm wondering if it's best to create a custom InputFormat inheriting from (1) FileInputFormat or (2) SequenceFileInputFormat.

If I go with (1), I will create HeaderSplits and DataSplits (data splits will be equiv to block size 128MB). I would also create a custom RecordReader for the DataSplits. Where the record size will be of tile sizes, 1024^2 Bytes. In the record reader, I will just read a tile size at a time. For the headers, each split will contain one record.

If i go with (2), I believe the bulk of my work would be in converting my image file to a SequenceFile. I would create a a key,value for each header/subheader, and a key/value for every 1024^2 Bytes in my Segment Data. Once I do that, I would have to create a custom SequenceFileInputFormat that will also split the headers from the partitioned data segments. I read that SequenceFiles are great for dealing with the "large # of small files" problem, but I'm dealing with just 1 image file (although with possibly many different data segments).

I also noticed that SequenceFileInputFormat uses FileInputFormat getSplits implementation. I'm assuming I would have to modify it to get the kinds of splits that I want. (Extract the Header key/value pair and parse/extract size info, etc).

Is one approach better than the other? I feel (1) would be a simpler task, but (2) has a lot of nice features. Is there a better way? Thank you in advance!