Amazon had another outage today which impacted users of EBS in a single availability zone of the us-east-1 region. On the plus side, this outage did not seem to cascade and cause availability problems for EBS users in other availability zones. But there were a number of other things which failed across the Amazon services within that availability zone. Looking at them gives us some interesting insight into dependencies within the services provided by AWS.
The first and most obvious casualty of EBS problems are the APIs used to interact with AWS, the AWS console, and Elastic Beanstalk. Due to automated failovers in cases of problems, both ones provided by Amazon and custom solutions from Amazon users, the load on the API increases drastically in these sorts of events. This means that automated failover capabilities are actually less helpful than one might hope. As such, to remain up during these events, you need to have the capacity to sustain the loss of an availability zone without provisioning additional instances or changing the configuration of AWS services (such as ELBs).
Another big component which tends to fail in concert with EBS is RDS. While not surprising, especially with the announcement of provisioned IOPS support for RDS, this drives home the fact that RDS uses and depends on EBS for its data storage. And while Multi-AZ support promises to handle failover for these cases, the reality is that it seems to get stuck without failing over in these cases for at least some users.
Two of Amazon’s newer services also saw some downtime today — CloudSearch and ElastiCache. CloudSearch provides searching and indexing functionality and so it’s pretty obvious that its disk usage will be similar to that of a database and so EBS makes some sense. ElastiCache, on the other hand, provides a memcached compatible server for your in-memory caching needs. What does an in-memory cache need disk, and especially disk with the persistence guarantees of EBS as opposed to ephemeral disk? I suspect Amazon is using EBS there to provide a volume to snapshot and handle the scale out of the cache to new instances.
Did you notice problems with other services? Interested in helping us to build tools to monitor and deal with these problems — join us!