Facebook found that their CDNs were great for handling the really popular
photos but the long tail of access was huge. They couldn’t just cache
everything.
In their existing NAS/NFS based system, “disk IOs for metadata was the
limiting factor for [their] read throughput.”
They optimized to get down to about 3 disk ops to read an image
Haystack can generally get it down to 1
Architecture
Store: Persistent storage of photos
The storage on a single machine is segmented into physical volumes (~100
with each holding about 100 GB)
Each physical volume is basically a large file
It’s append only
Each needle (the photo) also contains a checksum, the key, and
a cookie
Cookie is used to verify requests come from a Directory generated
URL to avoid brute forcing
Though a user could still pass the URL to someone not permitted to
see the photo
An in memory index allows the server to look up the image with
approximately one disc operation (with the offset)
The index could be built from the file, but a check-pointed version is
written to disk asynchronously (then only part of the file needs to be
read)
Deleted photos are marked in the needle (in the file)
A deleted photo still can cause a read of the needle but the server
checks the needle for the deleted flag and updates it’s in memory index
They use XFS - great for preallocating the physical volume file
1 GB extents and 256 kilobyte RAID stripes to minimize the number of
times a needle lookup crosses a boundary (thus requiring more disk ops)
Logical volumes are made up of physical volumes on distinct machines
Each physical volume in the logical volume contains the same photos
Volumes are marked read-only at the granularity of a machine
While the paper doesn’t state it, it seems to me that these machines could
be heterogeneous with different numbers of physical volumes per node
Directory
Mapping of logical -> physical volumes
Mapping of which logical volume each photo is on
Constructs a URL which allows the system to determine how to get the image
without further interaction of the directory
When a write comes in, server asks the directory for a write-enabled
logical volume. The server decides on a photo ID and uploads it to each
physical volume in the logical volume
Directory load balances
writes across logical volumes
reads across physical volumes
Data stored in replicated database (presumably MySQL) with memcached
Cache
Internal CDN
Shelters them when CDN nodes fail
Only caches a photo if:
request comes from the user, not CDN, and
photo is read from a write-enabled volume
They only need to protect write-enabled stores because read stores can
handle the read fine. Mixing read and write on the same store is slow.
Most photos are read soon after uploading (i.e. in a write-enabled store)
They want to implement active push to the cache for new uploads
Recovery
Have a system that regularly test the volumes
Marks them as read-only if there is repeated issues
Then a human manually address the problem
Bulk syncs occur when they need to reset the data on a machine
Takes a very long time, but only happens a few times a month
Compaction
Online operation to reclaim space from deleted / modified photos
“Over the course of a year, about 25% of photos get deleted”
Photos are never overwritten, just the latest version is considered
the current version
Store machine copies the needles that are not deleted or duplicates to a new
file
Deletes go to both files
Once op is complete, blocks modifications and atomically swaps the files and
in-memory index