FlumeJava is a Java library to build data pipelines. While writing a single MapReduce job is not terribly difficult, writing the glue code between many jobs is cumbersome and prone to error. Additionally, code is not very modular as each stage tends to bake in the the glue code (reading the previous stages output for example). FlumeJava allows for computation to be modular and separate from the pipeline.
While FlumeJava translates to MapReduce steps, I’d like to see a FlumeJava that translates to MillWheel computations. While on the Streaming Compute team at Twitter, I noticed a lot of Storm users really liked the abstractions provided by Trident. Even if they didn’t need the exactly-once processing.
Core Abstractions
I don’t cover all of them of course, just some of them.
recordsOf(...)
describes the serialization formatPTable<K,V>
represents a immutable multi-map
PCollection<Pair<K,V>>
parrallelDo(...)
runs a function against each element in the collection
groupByKey(...)
represents the shuffle step in MapReducecombineValues(...)
transforms
PTable<L, Collection<V>> -> PTable<K, V>
with an associative combining
functionjoin()
, derived from the other
operations, that transforms
(PTable<K, V1>, PTable<K, V2>) -> PTable<K, Tuple2<Collection<V1>, Collection<V2>
MapShuffleCombineReduce (MSCR)
“MSCR generalizes MapReduce by allowing multiple reducers and combiners, by allowing each reducer to produce multiple out-puts.”
h(f(a) + g(b)) -> h(f(a)) + h(g(b))
combineValues
- A combineValues
that follows a groupByKey
is
unnecessarygroupByKey
s togetherparallelDo
s