Finding a needle in Haystack: Facebook’s photo storage - Facebook, 2010

  • [Paper] [Mirror]
  • Facebook Engineering Blog Post

  • Facebook found that their CDNs were great for handling the really popular photos but the long tail of access was huge. They couldn’t just cache everything.
  • In their existing NAS/NFS based system, “disk IOs for metadata was the limiting factor for [their] read throughput.”
    • They optimized to get down to about 3 disk ops to read an image
    • Haystack can generally get it down to 1
  • Architecture
    • Store: Persistent storage of photos
      • The storage on a single machine is segmented into physical volumes (~100 with each holding about 100 GB)
        • Each physical volume is basically a large file
          • It’s append only
          • Each needle (the photo) also contains a checksum, the key, and a cookie
            • Cookie is used to verify requests come from a Directory generated URL to avoid brute forcing
            • Though a user could still pass the URL to someone not permitted to see the photo
      • An in memory index allows the server to look up the image with approximately one disc operation (with the offset)
        • The index could be built from the file, but a check-pointed version is written to disk asynchronously (then only part of the file needs to be read)
      • Deleted photos are marked in the needle (in the file)
        • A deleted photo still can cause a read of the needle but the server checks the needle for the deleted flag and updates it’s in memory index
      • They use XFS - great for preallocating the physical volume file
        • 1 GB extents and 256 kilobyte RAID stripes to minimize the number of times a needle lookup crosses a boundary (thus requiring more disk ops)
      • Logical volumes are made up of physical volumes on distinct machines
        • Each physical volume in the logical volume contains the same photos
        • Volumes are marked read-only at the granularity of a machine
      • While the paper doesn’t state it, it seems to me that these machines could be heterogeneous with different numbers of physical volumes per node
      • Directory
      • Mapping of logical -> physical volumes
      • Mapping of which logical volume each photo is on
      • Constructs a URL which allows the system to determine how to get the image without further interaction of the directory
      • When a write comes in, server asks the directory for a write-enabled logical volume. The server decides on a photo ID and uploads it to each physical volume in the logical volume
      • Directory load balances
        • writes across logical volumes
        • reads across physical volumes
      • Data stored in replicated database (presumably MySQL) with memcached
      • Cache
      • Internal CDN
      • Shelters them when CDN nodes fail
      • Only caches a photo if:
        1. request comes from the user, not CDN, and
        2. photo is read from a write-enabled volume
      • They only need to protect write-enabled stores because read stores can handle the read fine. Mixing read and write on the same store is slow.
      • Most photos are read soon after uploading (i.e. in a write-enabled store)
        • They want to implement active push to the cache for new uploads
  • Recovery
    • Have a system that regularly test the volumes
      • Marks them as read-only if there is repeated issues
      • Then a human manually address the problem
    • Bulk syncs occur when they need to reset the data on a machine
      • Takes a very long time, but only happens a few times a month
  • Compaction
    • Online operation to reclaim space from deleted / modified photos
      • “Over the course of a year, about 25% of photos get deleted”
      • Photos are never overwritten, just the latest version is considered the current version
    • Store machine copies the needles that are not deleted or duplicates to a new file
    • Deletes go to both files
    • Once op is complete, blocks modifications and atomically swaps the files and in-memory index