Erasure Coding in Windows Azure Storage - Microsoft, 2012

  • [Paper] [Mirror]
  • [ATC’12] [Video]

  • Given a file, if we break it into 6 chunks we can achieve reliability with 3 parity chunks.
    • The overhead is then 1.5x (6 + 3)/6 = 1.5.
  • We can break the file into 12 chunks and we need 4 parity chunks.
    • The overhead is then 1.33x (12 + 4)/12 = 1.33.
  • Now, when there is a missing block, you need to retrieve 12 blocks instead of 6.
    • This means 2x disk and network IO.
  • They note that traditional erasure codes assume that 1 failure is as likely as 2 failures.
    • However, 1 failure is a lot less likely than 2 failures at the same time.
  • There solution is to break the file into two block of 6 chunks.
    • They create 2 file parity blocks as before.
    • Then they create 1 parity block for each half.
    • This means, when there is a single block missing, they only need to retrieve 6 blocks instead of 12.