FlumeJava: Easy, Efficient Data-Parallel Pipelines - Google, 2010

FlumeJava is a Java library to build data pipelines. While writing a single MapReduce job is not terribly difficult, writing the glue code between many jobs is cumbersome and prone to error. Additionally, code is not very modular as each stage tends to bake in the the glue code (reading the previous stages output for example). FlumeJava allows for computation to be modular and separate from the pipeline.

While FlumeJava translates to MapReduce steps, I’d like to see a FlumeJava that translates to MillWheel computations. While on the Streaming Compute team at Twitter, I noticed a lot of Storm users really liked the abstractions provided by Trident. Even if they didn’t need the exactly-once processing.

• Core Abstractions

I don’t cover all of them of course, just some of them.

• recordsOf(...) describes the serialization format
• PTable<K,V> represents a immutable multi-map
• It’s just a sublcass of PCollection<Pair<K,V>>
• parrallelDo(...) runs a function against each element in the collection
• i.e. map
• They also provide many other variants to convey the semantics of the computation to the optimizer
• groupByKey(...) represents the shuffle step in MapReduce
• combineValues(...) transforms PTable<L, Collection<V>> -> PTable<K, V> with an associative combining function
• FlumeJava also provides the obligatory join(), derived from the other operations, that transforms (PTable<K, V1>, PTable<K, V2>) -> PTable<K, Tuple2<Collection<V1>, Collection<V2>
• Optimizer
• ParallelDo Fusion
• Producer-Consumer Fusion: function-composition
• Sibling Fusion: Combines mappers that read the same input
• MapShuffleCombineReduce (MSCR)

“MSCR generalizes MapReduce by allowing multiple reducers and combiners, by allowing each reducer to produce multiple out-puts.”

• The generalization is great for optimization
• Strategy
• Sink Flattens - h(f(a) + g(b)) -> h(f(a)) + h(g(b))
• Creates opportunities for fusion
• Lift combineValues - A combineValues that follows a groupByKey is unnecessary
• Insert fusion blocks - Combine groupByKeys together
• Fuse parallelDos
• Fuse MCSRs
• Future Work
• Currently FlumeJava does not analyze the code in user functions
• If it did, it might be able to perform CSE
• Executor
• FlumeJava can cache intermediate results for reuse in debugging and testing
• I think this shows a great attention to the user
• The authors note that Cascading program is an explicit dataflow graph