Friday, March 18, 2011

Understanding and using Amazon EBS - Elastic Block Store

There has been a lot of discussion in the last few days about EBS since it was implicated in a long outage at reddit.com.

Rule of Thumb

The benchmarking Netflix did when we started on AWS highlighted some inconsistent behavior in EBS. The conclusion we reached is a rule of thumb for EBS - If you sustain less than 100 iops (input+output per second) long term average it works fine. Short term bursts can be 1000 iops. By short term I mean less than a minute, long term more than 10 minutes. YMMV.

If you are doing benchmarks like this, collect response time and throughput and plot your data over time. You need to run long enough that the performance shows steady state behavior. The problem with EBS is that it doesn't have a particularly steady state. To explain why we need to look at the underlying architecture. I don't know the details of how EBS is implemented, but there is enough information available to explain how it behaves.

EC2

The AWS EC2 architecture is built out of commodity low cost servers, they have a single 1 Gbit network, a few CPUs, a few disks and a few GBytes of RAM. Over time the models have changed, and EC2 does have a 10Gbit network option now, but for the purposes of this discussion, we will concentrate on the 1Gbit network models. Individual servers are virtualized into the familiar EC2 models by slicing up the RAM, CPUs and disk space, and sharing the network bandwidth and disk iops. When EC2 instances break or are de-configured any data on the internal disks is lost.

Elastic Block Store http://aws.amazon.com/ebs/

The AWS EBS service provides a reliable place to store data that doesn't go away when EC2 instances are dropped, but it provides the same mounted filesystem capability as the internal disks. If you need more disk space or iops you can mount more EBS volumes on a single EC2 instance and spread out the load. The EBS volume is connected to the EC2 instance over the same 1Gbit network as everything else. In a datacenter this would normally be built using commercially available high end storage from NetApp, EMC or whoever, it would be quite expensive (cost much more than the EC2 instance itself) and be fast and reliable up to the limits of the network. To build a low cost cloud, the alternative is to use RAIN (Redundant Array of Inexpensive Nodes) which could be based on standard EC2 instances, or variants that have more disks per CPU. Software is then used to coordinate the RAIN systems and provide an EBS service that will be slower than high end storage, but still be very reliable and be limited by the 1Gbit network.

S3 and Availability Zones

AWS also has an S3 storage service that behaves like a key/value store accessed via http requests and a REST API rather than a directly mounted filesystem. It is possible to rapidly snapshot an EBS volume to and from S3, including incremental backups and restores that fill as they go so you don't have to wait before using them. This implies to me that they share a common back-end infrastructure to some extent. The primary additional difference is that EBS volumes only exist in a single AWS Availability Zone, and S3 data is replicated across two or three Availability Zones. It takes longer to replicate the data for S3, so it is slower, but it is very robust and it is almost impossible to lose data. You can think of an Availability Zone as a complete datacenter. All the zones in a region are separate datacenters that are close enough together to support a high bandwidth and low latency network between them, but they have separate power sources and connections to the Internet.

Multi-Tenancy

The most efficient chunk of compute and storage resource to buy and deploy when building a cloud is either too big or too small for the actual use cases of real applications. Virtualization is used to sub-divide the chunks, but then each individual machine is supporting several independent tenants. For local disks, the space is divided between the tenants, and for network, everyone is sharing the same 1Gbit interface. This works well on average, because most use cases aren't network or disk bound, but you cannot control who you are sharing with and some of the time you will be impacted by the other tenants, increasing variance within each EC2 instance. You can minimize the variance by running on the biggest instance type, e.g. m1.xlarge, or m2.4xlarge. In this case there isn't room for another big tenant, so you get as much as possible of the disk space and network bandwidth to yourself. The virtualization layer reserves some of the capacity. It's possible to tell that another tenant is keeping the CPU busy by looking at the "stolen time", but there are no metrics for stolen iops or network bandwidth.

The EBS service is also multi-tenant. Many clients mount disk space from a common backend pool of EBS disks. You don't get to see how the disk space is allocated, or how data is replicated over more than one disk or instance for durability, but it is limited to that availability zone. A busy client can slow down other clients that share the same EBS service resources. EBS volumes are between 1GB and 1TB in size. If you allocate a 1TB volume, you reduce the amount of multi-tenant sharing that is going on for the resources you use, and you get more consistent performance. Netflix uses this technique, our high traffic EBS volumes are mostly 1TB, although we don't need that much space.

This is actually no different in principle to large shared storage area network (SAN) backends (from companies like EMC or NetApp) that are in common datacenter use. Those also have unpredictable performance when pushed hard, and they mask this issue with lots of battery backed memory. The difference is cost. EBS is 10c per Gbyte per month. If you build a competing public cloud service using high end storage, you could get better performance but your cost base would be far higher.

Visualizing Multi-Tenant Disk Access

I have come up with some diagrams to help show what happens. I'm basing them on a simplified view of AWS where the only instance type family is m1 and everything they have is made out of one underlying building block. This consists of a fairly old specification system, 8 cores, 16GB RAM, four 500GB disks and a single 1Gbit network. In reality, AWS is much more complex than this, but the principles are the same.

Starting with internal disks, this is what an m1.xlarge looks like, it takes up the whole system apart from a small amount of memory, disk space and network traffic for the VM and AWS configuration/management information. You can expect to have minimal multi-tenant contention for network or disk access.



The m1.large instance type halves the system, each instance has two disks rather than four, so it shares the network and some of the disk controller bandwidth, but it should have minimal iops contention with the other tenant.



The low cost m1.small instance type has 160GB of disk per instance, so we can fit three per disk for a total of 12 instances per machine. (Note that the memory for a real m1.small is 1.7GB, so only 9 would fit in 16GB RAM, however the c1.medium instance has 1.7GB, 350GB disk, and more CPU, so six m1.small and three c1.medium fits). You can see the multi-tenancy problem here, any of the instances could generate enough traffic to fill the network and make one of the disks busy, and that is going to affect other instances in an unpredictable and random manner.

Here's an analogy, you can rent a whole house, rent a room in a house, or rent a couch to sleep on, you get what you pay for.

If you ever see public benchmarks of AWS that only use m1.small, they are useless, it shows that the people running the benchmark either didn't know what they were doing or are deliberately trying to make some other system look better. You cannot expect to get consistent measurements of a system that has a very high probability of multi-tenant interference.



EBS Multi-Tenancy

The next few diagrams show the flow of traffic from an instance to the EBS service, which makes two copies of the data on disks connected to separate instances. I don't know if this is how EBS works, but if we wanted to build an EBS-like system using the same building block it could look like this. In practice it would make sense to have specialized back-end building blocks with much more disk space.

The first diagram shows how Netflix runs EBS, we start with an instance that has the maximum network bandwidth with no other tenants, we allocate maximum size 1TB volumes (we stripe many of them together) and the service has to use most of the disk space in the back-end to support us, so we have less chance of another tenant making the EBS disks busy. The performance of EBS in this simplified case would be higher latency than local disk, but otherwise similar. I suspect that in reality the EBS volume is spread over more disks in the backend which gives higher throughput but with higher variance.



If we drop down to a more typical m1.large configuration with 100GB of EBS each, two instances are sharing network bandwidth, the EBS service is servicing two sets of requests, and the EBS back end has many more tenants per disk, so we would expect better peak performance than the two internal disks in the m1.large but more variance.



For the case where we have many m1.small instances each accessing a 10GB EBS volume, it is clear that the peak performance is going to be far better than a share of a local disk, but the contention for network, EBS service and backend disks will be extremely variable, so performance will be very inconsistent.



How To Measure Disk and Network Performance

Someone should write a book on that (I already did, but for Solaris), however there is a useful AWS forum post that explains how to interpret Linux iostat. This blog post is too long already, so Linux iostat will have to wait for another time.

Best Practices for Cloud Storage with Cassandra

There are two basic patterns for Cassandra, one is a persistent memory cache, where we size the data to fit in memory so that all reads are fast, and writes go to disk. The m2.4xl instance type with 68GB RAM and two 850GB disks is best. The second pattern is where there is a much larger data set than memory, and m1.xlarge with 16GB RAM and four 420GB disks will have the best iops for reads, and a much lower overall cost per GB for storage. In both cases, we get all the network bandwidth for servicing clients and the inter-node replication traffic, and minimal multi-tenant variance.

Saturday, March 12, 2011

How not to build a Private Cloud

It's all $, FUD, and internal politics. An MBO Cloud is what you get when the CEO tells the CIO to "figure out that cloud thing" (Management By Objective - i.e. the CIO bonus depends on it).

There is no technical reason for private cloud to exist.

[update: to clarify, that doesn't mean that I'm against private clouds or don't think they exist, because $, FUD and internal politics are a fact of life that constrain what can be done. Change also takes time and you have to "go to war with the army you have". However, this post is about what happens if your organization reallocates the $, isn't afraid, and has effectively no internal politics getting in the way.

This post was written in the middle of a debate on twitter between @adrianco @reillyusa @beaker and others including key insights from @swardley.

You should also read Christian Reilly's follow-up post "The Hollywood Culture" http://bit.ly/ePsisJ and many thanks to @bernardgolden for pointing out the excellent Business Week cover story on Cloud Computing http://ow.ly/4dm07 - after reading it I was amazed how well it aligned with what I write here - then I saw that it was by Ashlee Vance, one of the most clueful journalists around.

Netflix ITops Security Architect Bill Burns also wrote a very interesting post on the security challenges of cloud, we've been working together and he's on the interview team for the "Global Cloud Security Architect" I mention below.]

Too big for public cloud? You should *be* a public cloud.

Organizations who run infrastructure at the scale of tens to hundreds of thousands of instances have created cloud based models and opened them up to other organizations as public clouds. Amazon, Google, Microsoft are the clearest examples, they have expertise in software architecture, which is why they dominate the API definition. Telcos and hosting companies are adopting this model to provide additional public cloud capacity, using clones and variants of the API. Other organizations at this scale are already figuring out how to expose their capacity to their customers, partners and supply chain. The task you take on is to simultaneously hire the best people to run your cloud (competing with Amazon, Google etc.), and run it at low cost, which is why you need to be at huge scale and you need to decide that running infrastructure is a core competency of your business. Netflix is too small, doesn't regard infrastructure as core, and doesn't want to hire a bunch of ITops people.

It costs too much to port our apps? Your $ are mis-allocated.

What does it cost to build a private cloud, and how long does it take, and how many consultants and top tier ITops staff do you have to hire? Sounds like a nice empire building opportunity for the CIO. The alternative is to allocate that money to the development organization, hire more developers and rewrite your legacy apps to run on the public cloud, and give the development VP the budget to run public cloud directly. The payback is more incremental and manageable, but this is effectively a re-org of your business to move a large chunk of budget and headcount around. This is what happened at Netflix. It probably takes an act-of-CEO at most companies, the barriers are mostly political. Yes it will take time, but so will bringing up a private cloud.

Replace your apps with Saas offerings.

Many internal apps can be replaced by cloud services, we just outsourced our internal help desk and incident management software. No-one I know does payroll in-house. This is uncontroversial and is happening.

We can't put confidential data in a public cloud? This is just FUD.

The enterprise vendors are desperate to sell private clouds, so they are sowing this fear, uncertainty and doubt in their customer base to slow down adoption of public clouds. The reality is that many companies are already putting confidential data in public clouds. I asked the question "when will someone using PCI level 1 be in production on AWS" at a Cloud Connect panel, and was told that it is already being done, and Terremark pointed out that they host H&R Block's tax business. There are several models of public cloud with different SLA, cost and operational models that can support confidential data securely. There is also an argument that datacenter security is not as strong as people would like to think, and that the large cloud vendors can do a better job than most enterprises at keeping the infrastructure secure. At Netflix, we are about to transition to a global cloud based business, we are currently hiring a "Cloud Security Architect" who understands compliance rules like PCI (the credit card standard) on a global basis (we didn't need global expertise before). Part of their job is going to be to implement this.

There is no way my execs will sign off on this! Do they care about being competitive?

The biggest champion at Netflix for doing public cloud and doing it "properly" with an optimized architecture was our CEO Reed Hastings. He personally argued that we should try to do NoSQL rather than MySQL to push the envelope. Why? Because the bigger risk for Netflix was that we wouldn't scale and have the agility to compete. He was right, we have grown faster than our ability to build datacenters, and we have the agility we need to outrun our competition. Netflix has never had a CIO in the first place, we do have an excellent VP of operations though, and there is plenty to do running the CDN's and Saas vendors that support Enterprise IT.

Will private clouds be successful? I think there will be a few train wrecks.

The train wrecks will come as ITops discover that it's much harder and more expensive than they thought, and takes a lot longer than expected to build a private cloud. Meanwhile their developer organization won't be waiting for them, and will increasingly turn to public clouds to get their jobs done. We could argue about definitions but there are private clouds that are effectively the back ends for specialized large scale ecosystems like engineering companies that have to interface to the things that build stuff, or operate in places where there is no effective connection into the public clouds. For example, on board a ship that has limited external bandwidth, or to support a third world construction project. My take is that these will be indistinguishable from specialized Saas offerings within a supply chain ecosystem.

How not to build a public cloud - The Netflix Way

Re-org your company to give budget and headcount to the developers, let them run the public cloud operations
Ignore the FUD, best practices and patterns for compliance and security already exist and are audit-able
Get the CEO to give the CIO a different MBO, to shrink their datacenter.

Good luck with that :-)

Wednesday, March 02, 2011

Maslow's Hierarchy of NoSQL Reads (and Writes)

I tried out Prezi to create this presentation, it was more fun to create than powerpoint but has a few limitations. I would like better drawing tools. The talk itself puts some of the main NoSQL solutions into context.