reviewers needed: HADOOP-16830 Add public IOStatistics API

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

reviewers needed: HADOOP-16830 Add public IOStatistics API

Steve Loughran-4
Hi,

Can I get some reviews of this PR
https://github.com/apache/hadoop/pull/2323

It adds a new API, IOStatisticsSource, for any class to act as a source of
a static or dynamic IOStatistics set of counters/gauges/min/max/mean stats

The intent is to allow applications to collect statistics on streams,
iterators, and other classes they use to interact with filesystems/remote
stores, so get detailed statistics on the #of operations, latencies etc.
There's help to log these results, as well as aggregate them


Here's the API specifications

https://github.com/steveloughran/hadoop/blob/s3/HADOOP-16830-iostatistics-common/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/iostatistics.md

The FSDataStreams do passthrough of this, and there's a set of remote
iterators which also do passthrough, making it easy to chain/wrap
iteration code.
https://github.com/steveloughran/hadoop/blob/s3/HADOOP-16830-iostatistics-common/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/functional/RemoteIterators.java

It also includes a statistics snapshot which can be serialized as JSON and
java objects, and aggregate results
https://github.com/steveloughran/hadoop/blob/s3/HADOOP-16830-iostatistics-common/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/statistics/IOStatisticsSnapshot.java

This is how applications can aggregate results, and then propagate it back
to the AM/job driver/query engine

We already have PRs using this for S3A and ABFS on input streams, and in
S3A we also count LIST performance, which clients can pick up provided they
use the listStatusIterator, listFiles etc calls which return RemoteIterator.

I know it's a lot of code, but it's split into interface and
implementation, the public interface is for applications, the
implementation is what we are using internally, and which we will tune as
we adopt it more.

I have been working on this on and off for months, and yes it has grown.
But now that we are supporting more complex storage systems, the existing
tracking of long/short reads isn't informative enough. I want to know how
many GET requests failed and had to be retried, how often the DELETE calls
were throttled, and what the real latency of list operations are over
long-haul connections.

Please, take a look. As a new API it's unlikely to cause any regressions
-the main things to worry about are "is that API the one applications can
use" and "hi Steve got something fundamentally wrong in his implementation
code?"

-Steve