MillWheel: Fault-Tolerant Stream Processing at Internet Scale - Google, 2013

Something I’ve noticed from this and other Google papers is that Google systems try to limit the API to allow for stronger guarantees. Other systems tend to relax guarantees in order to have more API features.

• Originally for Google’s Zeitgeist product
• Tracks trends in web queries
• Uses 1-second windows
• Data Model
• Each record is a (key, value, timestamp) tuple
• The key is how the data is partitioned across instances of a computation
• Low Watermarks
• They wanted to be able to tell the difference between data that is delayed and data that is not coming
• Each record coming in to the system is given a timestamp (often corresponding to the time the data was generated)
• The system tells the computations that all the data up to the low watermark has likely already arrived
• The application can decide what to do with late arriving data
• Some choose to drop it but log it
• Others actually go and correct their data
• Duplicate Prevention
• Exactly-once processing
• Used in billing systems
• Persistent State
• This is one of my favorite aspects of MillWheel
• Each computation for a given key has a persistent state updated atomically
• Allows for non-idempotent code to be treated as if it was idempotent
• This could be used to store the rolling window or anything else
• Treated as a binary blob, usually use Protocol Buffers for serialization
• The goal with this abstraction is to hide the replay, failures and relocation that the system needs to do
• Through the use of checkpointing and replay, they can guarantee exactly-once processing
• Though they note this does not automatically extend to external systems
• This was one of the issues for users of Storm
• Zombie Writers
• Old instances of computations changing state
• Add a sequence token to each write and the mediator to the store verifies this token
• Computations
• Processing is serialized per key, but parallelized over distinct keys
• They do not need have any order guarantees (beyond the low watermark)
• Any computation can produce to any stream and computation can subscribe to any stream
• Computations can be added and removed from the topology without restarting the system
• I wonder how they deal with noisy neighbors? Maybe it’s just part of the process isolation they do for Borg?
• Timers
• Per-key
• Computation can set them to run at a watermark level or at some time in the future
• Used for time based barriers (i.e. to end a window)
• Injectors
• Push data into MillWheel
• They decide on the low watermark value
• Can be distributed
• The aggregated low watermark value is used
• System
• Process for exactly once delivery:
• Record is checked against dedupe data
• User code is run
• Pending changes are committed
• Senders ACKed
• Productions sent to downstream services
• They can do the check pointing for multiple records in a batch
• Each record has a unique ID that is associated with the state modification
• As an atomic write to the backing store
• Use a bloom filter, records that have not been seen will likely pass
• Eventually GC the record IDs
• A replicated master handles load distribution
• Can move/split/merge intervals
• Strong Productions
• Produced records that are check-pointed before delivery
• Weak Productions
• Disable the exactly-once guarantee for applications that don’t need it
• Can’t simply skip the dedupe part for performance
• Optimistically send productions down the topology, waiting for downstream ACKs
• However, machine failure can quickly cause the pipeline to backup
• To alleviate this, the checkpoint some of the stragglers so that the computation can ACK it’s upstream producer
• (Note, this is a case where adding some guarantees actually increases performance)
• This is an interesting idea for Storm!