XORing Elephants: Novel Erasure Codes for Big Data - USC & Facebook, 2013

While this paper’s title implies that these are ‘novel’ erasure codes, I haven’t quite figured out how they differ from the LRC codes Microsoft researchers discussed in Erasure Coding in Windows Azure Storage in 2013. There’s a few papers with USC and Facebook on LRC codes. I think I’m missing something.

Regardless, I found this paper explain their use of LRC codes very well as well as motivates the problem beyond discussed in some other literature.

  • Motivating Fast Local Repair
    • As we know from the Azure paper on LRCs, while RS codes are great for storage, the reconstruction cost is high
      • If you use a RS(10, 4) code you need 10 blocks to reconstruct any block
      • This means 10 network I/Os
        • In this paper, Facebook discusses how this is a burden on their live cluster
          • They use HDFS-RAID for some of their data
          • They use RS(10, 4)
        • Reconstruction traffic is not insubstantial
        • Typical nodes at Facebook hold 5TB, when a node goes down, that’s a lot of traffic to reconstruct the missing blocks

        “The repair traffic with the current configuration is estimated around 10-20% of the total average of 2 PB/day cluster network traffic.”

    • Transient errors account for 90% of their failure events
      • Read of a block is degraded while the data block is reconstructed
      • In this case, the calculated block does not need to be written to disk
    • Node decommissioning poses another problem
      • Takes a very long time
      • Having a solution to repair data quickly would reduce network traffic of copying blocks from the node
        • i.e. other nodes could just calculate some of it (aggregate BW greater than the network I/O of the to be decommissioned server)
    • Often the network is the bottleneck in MapReduce jobs
      • If we can use less network I/O for repairs, we can speed up production jobs
    • Their final point is pretty cool, geo-replication
      • For systems that span geographically disparate data centers, being able to repair blocks locally is a huge win
      • Inter-DC links may be slow, small (relatively) or expensive to use
      • An interesting idea that the Azure researchers pointed out was that you could actually just grab the closest 10 blocks and do repair to minimize inter-DC traffic
  • Theory
    • They define a metric describing the number of blocks required for reconstructing a given block deemed Block Locality:

      “An (k, n − k) code has block locality r, when each coded block is a function of at most r other coded blocks of the code.”

    • The idea is that if we can construct a code to have low locality, we require less network I/O to reconstruct the block
    • They then go own to describe how MDS actually have to worst block locality but optimal storage
    • They propose a middle ground (the other side being replication) of codes deemed Locally Repairable Codes
  • LRC in Xorbas
    • Xorbas is their implementation of LRC on HDFS
    • They use RS(10, 4) for the overall parity blocks
    • They show how you can just use XOR for the LRC
    • Then one could add a LRC for the data blocks D0-D4, another for D5-D9 and then a final LRC for the parity blocks P0-P3
    • However, they show that through careful section of the coefficients used for the first two blocks, they can easily generate the third LRC block
    • Ending up with an extra 2 blocks overhead compared to RS(10, 4) and a total of 16 blocks
  • HDFS-Xorbas Architecture
    • Based on HDFS-RAID
    • RAID-NODE
      • Creates and maintains parity blocks on HDFS
      • Creation is done as a MapReduce job
    • BlockFixer
      • Checks for bad or missing blocks
      • Uses MapReduce
    • They let HDFS-RAID create the RS(10, 4) parity blocks and then incrementally create the LRC blocks with XOR
    • By default, Hadoop is configured to place blocks randomly and avoids the same nodes
    • Repair
      • First they try a light-decoder that uses the LRC (XOR)
      • Fall back on the RS(10, 4)
  • Future Direction: Spin discs down!
    • I thought this was a great idea