Kafka: a Distributed Messaging System for Log Processing - LinkedIn, 2011

I’m a big fan of Kafka. One of the coolest ideas Kafka uses is that while random access to a disk is slow, sequential I/O has high throughput. It’s OK to have a queue / stream on disk.

  • Design
    • Topic: A stream of messages of a certain type
    • Brokers: Set of servers responsible for a topic
      • Store messages in segment files in append only fashion
      • Each files is around 1GB
      • Only flush segments after a good amount of them have been published or after a certain amount of time
        • Configurable
      • Consumers can only see a message after it is flushed
      • Do not have their own cache, rely on FS cache
        • Little GC overhead
      • Uses sendfile API to limit the number of context switches into kernel
    • Partition: A topic is sharded across brokers, each called a partition
      • Balances load
      • A consumer is guaranteed to see messages in order for a given partition
      • Smallest unit of parallelism
    • Offset: The distance in bytes from the beginning of the log
      • Consumers do not request messages by IDs but by offset
      • Client libraries keep track of their offset
      • Brokers do not need to keep track of the offset
      • Makes replay of the log really easy
        • Very popular for ETL workloads
        • We made good use of this at Twitter
      • Client library may only expose one message at time to the client application but it works with the broker to ensure data is buffered
      • Broker deletes data after a configured time
    • Consumer Group: A set of consumers that receive each message at least once
      • In other words, only one consumer in the group receives each message
      • A partition can only be subscribed to by one consumer in each group
  • Coordination
    • Topic is often over partitioned to allow for more consumers than brokers
    • Consumers keep track of their offsets in Zookeeper
    • Brokers use Zookeeper to detect changes in the system
  • Misc:
    • At-least-once semantics
      • Some clients use unique IDs if it matters
    • Use a CRC for each message
      • The paper makes it sound like it transparently drops messages with a bad CRC
      • I hope there is some kind of logging for this!
    • In the future they want to add replication (I believe this is in there now)
    • In their use, the producers periodically write a monitoring record which includes the count of messages they sent
      • Consumers then verify this count against what they actually received
      • I thought this was a great low-overhead way to verify the system
    • They have a custom Hadoop input format that reads from Kafka
      • With the ability to tell the broker where you want to start reading from, the tasks can easily restart when they fail
    • They have a schema registry service for their Avro serialized messages