FileSplit clarification

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

FileSplit clarification

Travis Chung

I wanted to get clarification on the start parameter. If I understand correctly, it's the byte offset from the beginning of the file.

/** Constructs a split with host information
   *
   * @param file the file name
   * @param start the position of the first byte in the file to process
   * @param length the number of bytes in the file to process
   * @param hosts the list of hosts containing the block, possibly null
   */
  public FileSplit(Path file, long start, long length, String[] hosts)

In Hadoop RecordReader blog, he creates a custom RecordReader and checks to see if he needs to skip the first line (assuming it's been processed by the previous split).

Why would he need to skip the first line if getStart() already points to the beginning of the current split?

In initialize() of CustomRecordReader:

// Split "S" is responsible for all records
// starting from "start" and "end" positions
start = split.getStart();
end = start + split.getLength();

// Retrieve file containing Split "S"
final Path file = split.getPath();
FileSystem fs = file.getFileSystem(job);
FSDataInputStream fileIn = fs.open(split.getPath());

// If Split "S" starts at byte 0, first line will be processed
// If Split "S" does not start at byte 0, first line has been already
// processed by "S-1" and therefore needs to be silently ignored
boolean skipFirstLine = false;
if (start != 0) {
    skipFirstLine = true;
    // Set the file pointer at "start - 1" position.
    // This is to make sure we won't miss any line
    // It could happen if "start" is located on a EOL
    --start;
    fileIn.seek(start);
}