Kafka: a Distributed Messaging System for Log Processing - LinkedIn, 2011

[Paper] [Mirror]

I’m a big fan of Kafka. One of the coolest ideas Kafka uses is that while random access to a disk is slow, sequential I/O has high throughput. It’s OK to have a queue / stream on disk.

Design
- Topic: A stream of messages of a certain type
- Brokers: Set of servers responsible for a topic
  - Store messages in segment files in append only fashion
  - Each files is around 1GB
  - Only flush segments after a good amount of them have been published or after a certain amount of time
    - Configurable
  - Consumers can only see a message after it is flushed
  - Do not have their own cache, rely on FS cache
    - Little GC overhead
  - Uses sendfile API to limit the number of context switches into kernel
- Partition: A topic is sharded across brokers, each called a partition
  - Balances load
  - A consumer is guaranteed to see messages in order for a given partition
  - Smallest unit of parallelism
- Offset: The distance in bytes from the beginning of the log
  - Consumers do not request messages by IDs but by offset
  - Client libraries keep track of their offset
  - Brokers do not need to keep track of the offset
  - Makes replay of the log really easy
    - Very popular for ETL workloads
    - We made good use of this at Twitter
  - Client library may only expose one message at time to the client application but it works with the broker to ensure data is buffered
  - Broker deletes data after a configured time
- Consumer Group: A set of consumers that receive each message at least once
  - In other words, only one consumer in the group receives each message
  - A partition can only be subscribed to by one consumer in each group
Coordination
- Topic is often over partitioned to allow for more consumers than brokers
- Consumers keep track of their offsets in Zookeeper
- Brokers use Zookeeper to detect changes in the system
Misc:
- At-least-once semantics
  - Some clients use unique IDs if it matters
- Use a CRC for each message
  - The paper makes it sound like it transparently drops messages with a bad CRC
  - I hope there is some kind of logging for this!
- In the future they want to add replication (I believe this is in there now)
- In their use, the producers periodically write a monitoring record which includes the count of messages they sent
  - Consumers then verify this count against what they actually received
  - I thought this was a great low-overhead way to verify the system
- They have a custom Hadoop input format that reads from Kafka
  - With the ability to tell the broker where you want to start reading from, the tasks can easily restart when they fail
- They have a schema registry service for their Avro serialized messages

Kafka: a Distributed Messaging System for Log Processing - LinkedIn, 2011

Other notes about streaming