The Internet is abuzz with talk about failures of the cloud again after a brief AWS outage yesterday in the us-east region that managed to take some high profile sites offline for most of an hour. From the status report from Amazon, there was a partial network outage within an availability zone which cascaded to cause problems for instance connectivity, load balancers and EBS.
While there were some high profile outages, it seems that people are doing better and better at building their applications to handle these partial failures that are common when running applications at scale as there seemed to be fewer than there were the past few times that such outages occurred. And these outages occur everywhere — they just are higher profile when they occur in us-east.
So what are some steps you can take to help your app stay up when problems occur?
- Ensure that all pieces of your web app run in multiple availability zones. Running in a single availability zone exposes you to problems within that availability zone. You can use a load balancer to send traffic to nodes spread across zones. You can do this with an actual load balancer or by having multiple web nodes and using a smart DNS service that does health checking of the endpoints
- Backend services that don’t have a load balancer in front of them can use something like Zookeeper to coordinate which of serveral nodes is alive. Using something like this also helps you to naturally grow and shard the work across multiple smaller instances rather than having to go for one very large one.
- Ensure that you have capacity to handle the loss of a zone. You don’t want to run everything at 100% all the time. As with traditional capacity planning, build to know that you can survive the loss of some of your capacity without overwhelming your remaining servers. I personally like to think in threes for this and keep multiples of three for things — then rather than one server needing to be able to take the entire load in the case of an outage (and double its normal load of 50%), two need to take 50% (up from their normal load of 33%).
- If using RDS in a production setting, definitely use a Multi-AZ deployment. While it’s not perfect and the failover takes minutes rather than seconds, that’s still better than waiting hours for your instance to return.
- When using a NoSQL data store (such as Stackdriver’s personal favorite, Cassandra), be sure that you are set up to balance your load across AZs and spread your instances across them. As with web nodes or other backend services, you also want to make sure that you can handle the increased load associated with the loss of some of your nodes
- Watch for impaired instances by looking at the result of the DescribeInstanceStatus API call. Amazon has been beefing up their ability in reporting problems in this way and they can help a lot in finding problems. We use this as one of the sources of input to our instance health alerting within Stackdriver but you can also monitor it if you’re using Nagios or another tool
- If you can avoid using or depending on EBS do so. While EBS provides some nice persistence advantages, its dependence on the network gives it more ways to fail than the ephemeral storage of an instance. Unfortunately, this also rules out using the (otherwise quite nice) new m3 series of instances
- Think about and understand the failure patterns of your systems. And if you can swing it, go a step further and try controlled scenarios where you take out a lot of systems. The Netflix Chaos Monkey and Chaos Gorilla are great for this.
Remember. Failure can happen to anyone. It will happen to you too at some point. So take the opportunity to sit down and think through how to make your own systems more reliable.