I now have a WIP of Cascading 1.2 that includes support for Riffle
Riffle is an Apache licensed library that includes Java annotations for
marking lifecycle and dependency methods on a 'process' object.
That is, you can create custom objects with 'start' and 'stop' methods, as
well as with getters for incoming/outgoing resources (input files, and
With a collection of such objects, each one for a particular task like
running a copy job, or Mahout process, you can have either Riffle or
Cascading chain and execute all the processes in dependency order.
Note that Riffle is very early stage (and likely naive), and the Cascading
support is likely to evolve before the 1.2 final release (sometime this
The long term goal here is to allow Mahout and other projects to apply the
annotations, and then third party tools can be used to run the processes.
For you Cascading users, writing a simple DistCp wrapper (or putting the
annotations directly on hadoop DistCp object, would allow a efficient copy
to run inside of a Cascade process along side your Flow instances.
Or more importantly, you can write iterative processes (e.g. page rank, etc)
that act like a single process even though internally there is a unknown
number of Flows being created on the fly. (I'm running a connected component
algorithm that requires multiple Flows/passes in production now as a Riffle