FlumeJava is a Java library to build data pipelines. While writing a single MapReduce job is not terribly difficult, writing the glue code between many jobs is cumbersome and prone to error. Additionally, code is not very modular as each stage tends to bake in the the glue code (reading the previous stages output for example). FlumeJava allows for computation to be modular and separate from the pipeline.
While FlumeJava translates to MapReduce steps, I’d like to see a FlumeJava that translates to MillWheel computations. While on the Streaming Compute team at Twitter, I noticed a lot of Storm users really liked the abstractions provided by Trident. Even if they didn’t need the exactly-once processing.
I don’t cover all of them of course, just some of them.
recordsOf(...)describes the serialization format
PTable<K,V>represents a immutable multi-map
parrallelDo(...)runs a function against each element in the collection
groupByKey(...)represents the shuffle step in MapReduce
PTable<L, Collection<V>> -> PTable<K, V>with an associative combining function
join(), derived from the other operations, that transforms
(PTable<K, V1>, PTable<K, V2>) -> PTable<K, Tuple2<Collection<V1>, Collection<V2>
“MSCR generalizes MapReduce by allowing multiple reducers and combiners, by allowing each reducer to produce multiple out-puts.”
h(f(a) + g(b)) -> h(f(a)) + h(g(b))
combineValuesthat follows a