Finding a needle in Haystack: Facebook’s photo storage - Facebook, 2010

[Paper] [Mirror]
Facebook Engineering Blog Post
Facebook found that their CDNs were great for handling the really popular photos but the long tail of access was huge. They couldn’t just cache everything.
In their existing NAS/NFS based system, “disk IOs for metadata was the limiting factor for [their] read throughput.”
- They optimized to get down to about 3 disk ops to read an image
- Haystack can generally get it down to 1
Architecture
- Store: Persistent storage of photos
  - The storage on a single machine is segmented into physical volumes (~100 with each holding about 100 GB)
    - Each physical volume is basically a large file
      - It’s append only
      - Each needle (the photo) also contains a checksum, the key, and a cookie
        
        Cookie is used to verify requests come from a Directory generated URL to avoid brute forcing
        
        Though a user could still pass the URL to someone not permitted to see the photo
  - An in memory index allows the server to look up the image with approximately one disc operation (with the offset)
    - The index could be built from the file, but a check-pointed version is written to disk asynchronously (then only part of the file needs to be read)
  - Deleted photos are marked in the needle (in the file)
    - A deleted photo still can cause a read of the needle but the server checks the needle for the deleted flag and updates it’s in memory index
  - They use XFS - great for preallocating the physical volume file
    - 1 GB extents and 256 kilobyte RAID stripes to minimize the number of times a needle lookup crosses a boundary (thus requiring more disk ops)
  - Logical volumes are made up of physical volumes on distinct machines
    - Each physical volume in the logical volume contains the same photos
    - Volumes are marked read-only at the granularity of a machine
  - While the paper doesn’t state it, it seems to me that these machines could be heterogeneous with different numbers of physical volumes per node
  - Directory
  - Mapping of logical -> physical volumes
  - Mapping of which logical volume each photo is on
  - Constructs a URL which allows the system to determine how to get the image without further interaction of the directory
  - When a write comes in, server asks the directory for a write-enabled logical volume. The server decides on a photo ID and uploads it to each physical volume in the logical volume
  - Directory load balances
    - writes across logical volumes
    - reads across physical volumes
  - Data stored in replicated database (presumably MySQL) with memcached
  - Cache
  - Internal CDN
  - Shelters them when CDN nodes fail
  - Only caches a photo if:
    1. request comes from the user, not CDN, and
    2. photo is read from a write-enabled volume
  - They only need to protect write-enabled stores because read stores can handle the read fine. Mixing read and write on the same store is slow.
  - Most photos are read soon after uploading (i.e. in a write-enabled store)
    - They want to implement active push to the cache for new uploads
Recovery
- Have a system that regularly test the volumes
  - Marks them as read-only if there is repeated issues
  - Then a human manually address the problem
- Bulk syncs occur when they need to reset the data on a machine
  - Takes a very long time, but only happens a few times a month
Compaction
- Online operation to reclaim space from deleted / modified photos
  - “Over the course of a year, about 25% of photos get deleted”
  - Photos are never overwritten, just the latest version is considered the current version
- Store machine copies the needles that are not deleted or duplicates to a new file
- Deletes go to both files
- Once op is complete, blocks modifications and atomically swaps the files and in-memory index

Finding a needle in Haystack: Facebook’s photo storage - Facebook, 2010

Other notes about storage