PNUTS: Yahoo!'s Hosted Data Serving Platform - Yahoo!, 2008

  • [Paper] [Mirror]

  • Designed for low latency, goes for eventual consistency
  • Uses an interesting technique of record level pub/sub instead of a replicated log
    • (Which I argue can emulate pub/sub)
    • pub/sub makes it easy for them to have asynchronous notifications to applications and caches
  • Transactional but not serializable
  • Schemas are flexible (records are stored as JSON in MySQL)
  • Can do selections and projects from a single table
  • Updates and deletes can only be done by a specified key
  • Hashed or ordered
  • Does no support joins or group by
  • Provides per-record timeline consistency
    • That is, all replicas apply updates in the same order, though different replicas may have older versions for a period of time
    • Each record has a version number prefixed with a generation number
      • The generation number is incremented on an insert
      • v2.3 is the 3rd update of the second generation
      • They seem to use the v2.0 to indicate it is deleted
    • The replica with the majority of writes becomes the master of the record
      • Each record has a list of the origin of the last N writes (typically 3)
      • This uses 2-bytes per record
      • The mastership can move between replicas
      • I’m not clear on how this prevents two replicas from believing they are the master and progressing separately
  • API
    • Read-any - Possibly stale read
    • Read-critical(version) - Read a version at least as new as the supplied version
    • Read-latest - For some reason they do not have to go to the master for this read, only in some circumstances
    • Write
    • Test-and-set-write(version) - Only writes if the current version is the same as the supplied version
    • Results can be streamed back to the client as they arrive
  • Architecture
    • Divided into regions
      • Each region holds a complete copy of each table
      • The authors claim that multiple copies allows them to eschew the use of backups
        • I question this mentality. What about software bugs or user error?
    • Each table is horizontally partitioned into tablets (a la Bigtable)
      • In the range of hundreds of megabytes to a few gigabytes
    • Tablet controller owns the mapping from key to tablet
      • Interval mapping, as some table are ordered
      • Pair of active/standby servers
      • They use a 2PC when splitting tablets
    • Router holds a cached copy
      • If a request is misdirected, the tablet server will return an error that will cause the router to retrieve a new copy
      • They mention that the router is not on the data path, so I wonder how they receive the error about the misdirection
        • Does the client request the location then query the tablet?
          • On an error, the tablet server could send an error to the router based on a field in the request?
        • Does the router forward the request but then has the router send the data to the client directly?
  • Replication
    • “Updates are considered ‘committed’ when they have been published to YMB”
      • YMB is their globally distributed message broker
    • YMB can handle single broker failures by logging messages to multiple servers
      • What about failure of a broker during a maintenance event?
    • Initially two copies are logged and then more as the data propagates asynchronously
    • Messages delivered to a particular YMB cluster will be delivered in order
      • This differs from Kafka which only promises per-broker ordering
    • To have per-record timeline consistency, all mutations for a given record must go to the same YMB cluster
      • Updates are forwarded to the master of the record to be committed
  • Recovery
    1. Tablet controller requests a copy from a source tablet
    2. A checkpoint message is published to YMB with the goal of having “any in-flight updates at the time the copy is initiated are applied to the source tablet”
    3. Source tablet is copied to destination region
  • Future Ideas
    • They’d like to have indexes and materialized views
      • Their plan is to asynchronously update these by monitoring the change stream
      • A clever idea if their customers are OK with delayed information