Want to Stay Up When AWS Is Down?
Amazon Web Services suffered another outage Sunday morning. The cause this time seems to have been the DynamoDB service, which is a very scalable, cloud-backed data store. Lots of other services suffered from “increased error rates” at the same time—apparently either DynamoDB underpins more of AWS than we realized, or it’s all based on some other layer that Amazon doesn’t talk about publicly.
We’ve seen a pattern play out in the last few AWS outages that should make us all think carefully about how we use the cloud.
The problem: what if you can’t make changes?
One version of this pattern is best exemplified by the following list of status messages from the AWS Status Dashboard for the EC2 service:
3:12 AM PDT: We are investigating increased error rates and latencies for the EC2 APIs in the US-EAST region.
7:43 AM PDT: Error rates for impacted APIs are beginning to recover and new instance launches are starting to succeed.
In other words, early Sunday morning, it was hard to launch a new EC2 instance in the US-East region, and this condition persisted for four-and-a-half hours. Past EC2 outages have also affected this aspect of the service. Applications that weren’t dependent on DynamoDB Sunday morning kept running along fine—that is, just as long as they didn’t need to start new instances, or auto-scale, or do anything anything else that relies on the EC2 API.
The second version of the pattern occurs when thousands of ops teams detect a problem in AWS and respond by launching new instances somewhere else, such as in another region. We have seen this in the past, when launching a new instance can go from taking a few seconds to taking much longer.
For instance, imagine that an East Coast AWS outage was more than just a few services; say a whole Availability Zone catches on fire or is taken offline by an earthquake. Thousands of ops teams around the world start scrambling to launch new instances in other regions, or on other clouds. As huge as AWS is, that kind of load is going to start to play tricks on the management systems, and those instances might take a while to launch.
Do you want to be the last team in line when that happens?
The importance of geographic redundancy
So what does this mean to us as we try to design robust systems that run in the public cloud? The first lesson is the one that you can read about all over the place. That’s because Amazon did a smart and far-thinking thing here—they divided up their world into “regions” and “availability zones.”
AZs are close together, but physically isolated. As a result, an AWS region is actually a super-data-center that includes multiple redundant physical facilities. It’s arguably a lot more redundant than anything any of us would build on our own.
Regions, however, are not just physically isolated, but logically isolated. Not only do they run a totally different set of AWS software, but they are only connected via the public Internet. As a result, a software problem in one region cannot affect another one.
That means, that, as architects, one thing we should do is to try to design our systems to take advantage of multiple regions. Of course, it’s not as easy as the blogosphere makes it out to be.
For instance, at Apigee we have been running our service across regions since 2012, primarily by extending our Cassandra and Zookeeper clusters across regions. We were able to do this because our software is built around technology, like Cassandra, that inherently supports geographical replication on an unreliable network. (We also had to set up redundant VPN tunnels and do some careful planning).
If we had based our product on a traditional DBMS or some other type of technology, however, we probably would not have been able to achieve the SLAs that our customers demand.
But unless you were a big DynamoDB user, you didn’t even need to use multi-region availability in order to stay alive last weekend. In fact, although we run across regions, we didn’t need to use that capability to stay alive—that part of our service remained doing its normal job, which is to reduce latency.
What you do need, however, is to be a bit conservative.
The importance of systems management
When we design systems, we tend to go through all the things that can go down. “What happens if a disk fails?” “What happens if a server fails?” “What happens if we need to upgrade the database server?”
Then we get to management, and someone asks a question like, “What happens when the systems management server goes down?” Too often, the answer is, “It’s just management. Everything else will keep running if it’s down. Don’t worry about it.”
This, of course, is precisely the wrong answer.
The infinite scalability and API-driven nature of the cloud leads to all sorts of interesting possibilities. Auto scaling is easy to turn on. You never need to update the software on a server because it’s easier to launch a new server. And you certainly don’t need to have servers, already running, that you can use in case of a failure.
But every one of those possibilities assumes that you can make management changes to your cloud at all times. In fact, you are more likely to need to make those changes when something is going wrong in the public cloud—and that is precisely when you need them most.
The importance of conservatism
So although it’s not an exciting and ground-breaking way to use the cloud, conservatism has its virtues, at least in this case.
- Provision more capacity than you need
- Keep that capacity up and running and actively serving traffic at all times
- Spread that capacity across multiple data centers (AZs for Amazon customers)
- If you can replicate across regions, and it makes sense cost-wise, then do it
- Don’t assume that you can scale up in minutes, let alone seconds
- Don’t assume that you can replace instances in minutes, or even hours
- If your disaster recovery plan involves another cloud, expect it to be slow when you need it most
By the way, this isn’t just an AWS thing. Do you build your own management infrastructure? Do you depend on Puppet or Chef or Ansible or Salt Stack or a single Git server, or even GitHub itself? What happens when that’s down? Maybe it’s time to think about that.
Image: Kirill Ulitin/The Noun Project