I wanted to get clarification on the start parameter. If I understand correctly, it's the byte offset from the beginning of the file.
/** Constructs a split with host information * * @param file the file name * @param start the position of the first byte in the file to process * @param length the number of bytes in the file to process * @param hosts the list of hosts containing the block, possibly null */ public FileSplit(Path file, long start, long length, String hosts)
In Hadoop RecordReader blog, he creates a custom RecordReader and checks to see if he needs to skip the first line (assuming it's been processed by the previous split).
Why would he need to skip the first line if getStart() already points to the beginning of the current split?
In initialize() of CustomRecordReader: