PNUTS: Yahoo!'s Hosted Data Serving Platform - Yahoo!, 2008

[Paper] [Mirror]
Designed for low latency, goes for eventual consistency
Uses an interesting technique of record level pub/sub instead of a replicated log
- (Which I argue can emulate pub/sub)
- pub/sub makes it easy for them to have asynchronous notifications to applications and caches
Transactional but not serializable
Schemas are flexible (records are stored as JSON in MySQL)
Can do selections and projects from a single table
Updates and deletes can only be done by a specified key
Hashed or ordered
Does no support joins or group by
Provides per-record timeline consistency
- That is, all replicas apply updates in the same order, though different replicas may have older versions for a period of time
- Each record has a version number prefixed with a generation number
  - The generation number is incremented on an insert
  - v2.3 is the 3rd update of the second generation
  - They seem to use the v2.0 to indicate it is deleted
- The replica with the majority of writes becomes the master of the record
  - Each record has a list of the origin of the last N writes (typically 3)
  - This uses 2-bytes per record
  - The mastership can move between replicas
  - I’m not clear on how this prevents two replicas from believing they are the master and progressing separately
API
- Read-any - Possibly stale read
- Read-critical(version) - Read a version at least as new as the supplied version
- Read-latest - For some reason they do not have to go to the master for this read, only in some circumstances
- Write
- Test-and-set-write(version) - Only writes if the current version is the same as the supplied version
- Results can be streamed back to the client as they arrive
Architecture
- Divided into regions
  - Each region holds a complete copy of each table
  - The authors claim that multiple copies allows them to eschew the use of backups
    - I question this mentality. What about software bugs or user error?
- Each table is horizontally partitioned into tablets (a la Bigtable)
  - In the range of hundreds of megabytes to a few gigabytes
- Tablet controller owns the mapping from key to tablet
  - Interval mapping, as some table are ordered
  - Pair of active/standby servers
  - They use a 2PC when splitting tablets
- Router holds a cached copy
  - If a request is misdirected, the tablet server will return an error that will cause the router to retrieve a new copy
  - They mention that the router is not on the data path, so I wonder how they receive the error about the misdirection
    - Does the client request the location then query the tablet?
      - On an error, the tablet server could send an error to the router based on a field in the request?
    - Does the router forward the request but then has the router send the data to the client directly?
Replication
- “Updates are considered ‘committed’ when they have been published to YMB”
  - YMB is their globally distributed message broker
- YMB can handle single broker failures by logging messages to multiple servers
  - What about failure of a broker during a maintenance event?
- Initially two copies are logged and then more as the data propagates asynchronously
- Messages delivered to a particular YMB cluster will be delivered in order
  - This differs from Kafka which only promises per-broker ordering
- To have per-record timeline consistency, all mutations for a given record must go to the same YMB cluster
  - Updates are forwarded to the master of the record to be committed
Recovery
1. Tablet controller requests a copy from a source tablet
2. A checkpoint message is published to YMB with the goal of having “any in-flight updates at the time the copy is initiated are applied to the source tablet”
3. Source tablet is copied to destination region
Future Ideas
- They’d like to have indexes and materialized views
  - Their plan is to asynchronously update these by monitoring the change stream
  - A clever idea if their customers are OK with delayed information

PNUTS: Yahoo!'s Hosted Data Serving Platform - Yahoo!, 2008

Other notes about database