S3AFileSystem premature EOF with random InputPolicy

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

S3AFileSystem premature EOF with random InputPolicy

Dave Christianson
I'm seeing a problem using the S3AFileSystem where I get a premature EOF.  In particular when reading Parquet files using projection. It appears this causes a backward seek on the S3A input stream which then triggers a bug where the input stream will attempt to use the "random" input policy, and eventually (depending on the file and the amount of data read) read past the end of its readahead buffer without reopening the stream ... resulting in an EOF.

I haven't seen this issue reported anywhere, I'm wondering about whether this worth a fix (it looks like this stream reopening behavior just needs to be more aggressive) or if it's just better to retrieve the whole file sequentially before attempting to parse (I was suprised it works at all).


I've written a sample program that illustrates the problem given a path in S3 (minus parquet):

final Configuration conf = new Configuration();
conf.set("fs.s3a.readahead.range", "1K");
conf.set("fs.s3a.experimental.input.fadvise", "random");

final FileSystem fs = FileSystem.get(path.toUri(), conf);

// forward seek reading across readahead boundary
try (FSDataInputStream in = fs.open(path)) {
    final byte[] temp = new byte[5];
    in.readByte();
    in.readFully(1023, temp); // <-- works
}

// forward seek reading from end of readahead boundary
try (FSDataInputStream in = fs.open(path)) {
final byte[] temp = new byte[5];
in.readByte();
in.readFully(1024, temp); // <-- throws EOFException
}


Regards, Dave Christianson