Sorry, cross posting to save time.

I now have a WIP of Cascading 1.2 that includes support for Riffle annotations.

Riffle is an Apache licensed library that includes Java annotations for marking lifecycle and dependency methods on a 'process' object.

That is, you can create custom objects with 'start' and 'stop' methods, as well as with getters for incoming/outgoing resources (input files, and output files).

With a collection of such objects, each one for a particular task like running a copy job, or Mahout process, you can have either Riffle or Cascading chain and execute all the processes in dependency order.

You can see more about Riffle here (which includes a tool to run a collection of processes):

You can download WIP builds for Cascading 1.2 (1.1 is the current stable version) here:

Note that Riffle is very early stage (and likely naive), and the Cascading support is likely to evolve before the 1.2 final release (sometime this fall).

The long term goal here is to allow Mahout and other projects to apply the annotations, and then third party tools can be used to run the processes.

For you Cascading users, writing a simple DistCp wrapper (or putting the annotations directly on hadoop DistCp object, would allow a efficient copy to run inside of a Cascade process along side your Flow instances.

Or more importantly, you can write iterative processes (e.g. page rank, etc) that act like a single process even though internally there is a unknown number of Flows being created on the fly. (I'm running a connected component algorithm that requires multiple Flows/passes in production now as a Riffle object)

Please feel free to fork and tweak.


Chris K Wensel
[hidden email]