This report poses the question:
“When does it make economic sense to make a piece of data resident in main memory and when does it make sense to have it resident in secondary memory (disc) where it must be moved to main memory prior to reading or writing?”
I love this report because it turns the problem in to one of economics. Using money as a measure we can actually determine a rule of thumb that makes sense. It reminds me of the Computer Architecture classes I took.
The report acknowledges that sometimes data must be resident in RAM due to latency requirements. However they believe this is an uncommon case.
I have been guilty of blurting out that we could just solve a problem by putting the entire dataset in main memory without really thinking about how much of the data actually needed to be in RAM. In a distributed system (where many of the problems I deal with are) it could be more useful to have multiple copies of hot data in RAM and leave the long tail data on disc. This report doesn’t address that concern head on but it gives a great framework to work in.
The authors present a rule of thumb for how long a 1KB page should be kept in memory given the cost of an extra CPU, disc and channel to support the read. They calculate this to be about 2000 $/accesses/second. Main memory costs $5/megabyte.
“Pages referenced every five minutes should be memory resident.”
This is based on the break even point of a disc access every 2000/5 = 400
seconds which they approximate to 5 minutes. (I’m reading the February 1986
report which has slightly different numbers than the May 1985 one.)
As the record size decreases, the break even time increases and conversely for larger records the break even time decreases. However, they note that at some point the record size will be larger than the disc transfer size. That may be different now but it’s an indicator that there may be other thresholds in modern memory hierarchies.
The authors give an example of a customer with a 500 MB database that they wanted to keep in main memory. An all disc system with the same TPS could be built for a million dollars less than a main memory system
They then showed that for the same number of TPS they could determine an optimal hybrid memory / disc system $1.27 million less than the main memory system.
The notion of using many parallel discs reminds me of Google’s Dremel.
What about the trade-off between memory and CPU cycles? When does it make sense to compress a bunch of data to save memory or cache some computations that might be used later?
Similar to The Five Minute rule, we can determine the cost of memory and price per instruction. In the report the use $0.001 per byte and $0.005 per instruction creating the rule:
“Spend 5 bytes of main memory to save 1 instruction per second.”
Unsurprisingly the numbers are not exactly the same now and I don’t think the authors of the paper intended people to take them literally then. However the idea that we need to think about how we are trading off memory and disc. Really memory and anything!
You should also checkout two later reports on the topic:
They don’t talk about it in this paper, but one could see extending the idea for local caching of data from a network. How does the math change as we go into inter-planetary storage systems? Amazon’s CTO Werner Vogels references the report in the context of Amazon’s Glacier product which offers cold storage at low cost and huge latency (on the order of hours).
The industry does this kind of analysis all the time, some with complicated models, but as a newly minted engineer, I will try to keep this in mind while thinking about systems.