# GraphChi: Large-Scale Graph Computation on Just a PC - CMU, 2012

• [Paper] [Mirror]
• [OSDI’12] [Video]

• Big Graph Problems are not the same as Big Data Problems
• The actual data can fit on a hard drive.
• Main Problem: Random access on a drive is slow
• Parallel Sliding Windows: Load/Process/Store all of the data of the adjacent edges at the same time.
• Graph is stored as in-edges.
• Split the vertices across P shards (each a file on disk, each file has all the incoming edges to that vertex).
• Sort the edges in each file by the source vertex.
• Let’s try and load the sub-graph of shard 1:
1. Load the first file, now we have all of the in-edges for the subgraph.
2. Since the other files are sorted by source vertex, all of the out-edges will be at the start of the other shards! We load those.
• In total, P^2 reads and writes for a full pass on the graph. But P is relatively small. Thus few random reads.
• Main Takeway: Sometimes by using algorithms and datastructures more cleverly we can achieve similar performance on a single machine.