Beyond Server Consolidation - Amazon, 2009

[Article] [Mirror]

This article is by Amazon’ CTO Werner Vogels.

Virtualization was originally designed to efficiently use hardware
- Only a few companies could afford systems so they used virtualization so that each customer could be isolated
- Originally virtualization was coarse-grained time sharing
The big push at the time of the article was server consolidation
- This is a cost saving exercise
  - From my time at Twitter, I learned how much money can be saved this way
- Vogels believes a main reason for “server sprawl” was that software vendors required isolation for their applications. Certain OS versions or configurations are commonly required.
- Many of the servers were underutilized
Vogles believes that virtualization is more than just consolidation:

“Virtualization breaks the 1:1 relationship between applications and the operating system and between the operating system and the hardware.”
- It’s no just the N:1 (many apps 1 resource) relationship that virtualization provides, but 1:N relationships (one app many resources)
  - Virtualization can allow for elastic applications that can scale according to load
Underutilized Servers

“Single averages seldom tell the whole story.”
- The utilization of a set of servers is not constant however it’s often periodic.
  - Google found that even in their well tuned systems, utilization can be anywhere between “10 and 50 percent when inspected over longer timeframes”
- One of the challenges is determining the resource requirements of an application
  - One of the things we wanted to do on the Streaming Compute team at Twitter is to figure out a good way for internal customers of Storm to characterize their resource usage. It’s really tough considering the inherent spikes in their load. Even harder is to consider how shared resources (like I/O) change with increased load (can’t assume it’s linear).
  - Vogels advocates for a profile that considered resource usage over time
  - He notes that it is also important to considered dependencies on other systems
    - I think we also need to consider failure or slowdown of dependencies
    - This is a common issue among Storm users. If a service their topology depends on is experiencing issues then their stream would backup. This means the topology requires more resources to catch up once the services return to normal.
  - He also asks what happens when an application runs out of capacity?
    - Is it able to adapt?
  - A common practice is to put applications in an isolated environment for analysis
    - We did this on the Streaming Compute team, but it did not provide the whole picture
  - Vogels believes the biggest challenge is balancing workloads at runtime
    - With less slack due to consolidation, applications hit the resource boundaries faster
      - I’ve heard stories of poorly configured applications that were running fine on shared infrastructure for months because the system allowed for using more resources when it was available. When more slots on the server were inevitably filled there was no longer room for ‘bursting’ and the applications failed.
  - Vogels doesn’t believe that 100% utilization should be the goal
    - He suggests 70% for highly tuned apps
    - and 50% for mixed workload environments
  - He talks about transparent migration but discusses some of the challenges with it
    - Some applications can checkpoint and restart
      - I pose the question: What applications do we have that don’t need to clustered like a database but would be amenable to checkpointing?
        
        There’s a few papers researching checkpointing applications and stopping them when EC2 spot instance prices change
Development - Virtualization allows for easier development of applications
- They can be made sell serve
- Developers can develop on a small VM and then switch to a larger instance when they need to evaluate with real work loads
- Uniform deployment environment
- Testing
  - Resource usage changes depending on where the team is in their dev cycle
  - Great for dealing with many different OSes or configurations
Procurement
- Before EC2, teams had to deal with long acquisition times for servers that “often [ran] into several months”
- The teams were then hesitant to return the resources because they didn’t want to wait for the acquisition again
- Teams had to judge how many servers they needed for the project before they started development (in order to overlap the wait for servers with development)
  - I bet they erred on the side of caution, wasting lots of resources
  - I note that it’s hard enough to determine the resource requirements of an application while it’s running let alone before it’s even finished being designed!
- Vogels discusses how much this issue is magnified in government IT
  
  “One DoD IT architect reported that the department’s software prototype normally would cost $30,000 in server resources, but by building it in virtual machines for Ama- zon EC2, in the end it consumed only $5 in resources.”
  
  Article here
Utility Computing
- If we treat the infrastructure as a utility (pay for usage) we get a whole host of benefits
  - Almost no initial acquisition costs
  - Run VMs on local infra then push to the cloud for production
  - Overflow / peak capacity
  - Great for apps that don’t need to run all the time
    - Indexing is an example
    - He talks about an example of the New York Time’s using 100 machines for 24 hours to convert 4TB of document images to PDF “at the cost of a single server.”
      - This would be prohibitive if they purchased all of the servers to meet their deadline
      - Article here
- He also mentions that creating economic models for automated resource allocation “remains the Holy Grail”
  - I’ve heard some anecdotes of this at Google, super cool!