While this paper’s title implies that these are ‘novel’ erasure codes, I haven’t
quite figured out how they differ from the LRC codes Microsoft researchers
discussed in Erasure Coding in Windows Azure Storage
in 2013. There’s a few papers with USC and Facebook on LRC codes. I think I’m
missing something.
Regardless, I found this paper explain their use of LRC codes very well as well
as motivates the problem beyond discussed in some other literature.
- Motivating Fast Local Repair
- As we know from the Azure paper on LRCs, while RS codes are great for
storage, the reconstruction cost is high
- If you use a RS(10, 4) code you need 10 blocks to reconstruct any block
- This means 10 network I/Os
- In this paper, Facebook discusses how this is a burden on their live
cluster
- They use HDFS-RAID for some
of their data
- They use RS(10, 4)
- Reconstruction traffic is not insubstantial
- Typical nodes at Facebook hold 5TB, when a node goes down, that’s a lot of
traffic to reconstruct the missing blocks
“The repair traffic with the current configuration is estimated around
10-20% of the total average of 2 PB/day cluster network traffic.”
- Transient errors account for 90% of their failure events
- Read of a block is degraded while the data block is reconstructed
- In this case, the calculated block does not need to be written to disk
- Node decommissioning poses another problem
- Takes a very long time
- Having a solution to repair data quickly would reduce network traffic
of copying blocks from the node
- i.e. other nodes could just calculate some of it (aggregate BW greater
than the network I/O of the to be decommissioned server)
- Often the network is the bottleneck in MapReduce jobs
- If we can use less network I/O for repairs, we can speed up production
jobs
- Their final point is pretty cool, geo-replication
- For systems that span geographically disparate data centers, being able to
repair blocks locally is a huge win
- Inter-DC links may be slow, small (relatively) or expensive to use
- An interesting idea that the Azure researchers pointed out was that you
could actually just grab the closest 10 blocks and do repair to minimize
inter-DC traffic
- Theory
- LRC in Xorbas
- Xorbas is their implementation of LRC on HDFS
- They use RS(10, 4) for the overall parity blocks
- They show how you can just use XOR for the LRC
- Then one could add a LRC for the data blocks D0-D4,
another for D5-D9 and then a final LRC for the
parity blocks P0-P3
- However, they show that through careful section of the coefficients used for
the first two blocks, they can easily generate the third LRC block
- Ending up with an extra 2 blocks overhead compared to RS(10, 4) and a total
of 16 blocks
- HDFS-Xorbas Architecture
- Based on HDFS-RAID
- RAID-NODE
- Creates and maintains parity blocks on HDFS
- Creation is done as a MapReduce job
- BlockFixer
- Checks for bad or missing blocks
- Uses MapReduce
- They let HDFS-RAID create the RS(10, 4) parity blocks and then incrementally
create the LRC blocks with XOR
- By default, Hadoop is configured to place blocks randomly and avoids the
same nodes
- Repair
- First they try a light-decoder that uses the LRC (XOR)
- Fall back on the RS(10, 4)
- Future Direction: Spin discs down!
- I thought this was a great idea