Report on October 23 Stackdriver Outage

On Wednesday, October 23, around 9:45am ET, the Stackdriver Intelligent Monitoring application suffered an outage.  We sincerely apologize for this incident.  We know that you depend on Stackdriver to help keep your application running, and when Stackdriver is unavailable, your job becomes more difficult.  We take this occurrence very seriously and will improve the service to reduce the chance of future failures.

Impact

The outage impacted customers several ways. The issue that persisted the longest was that the charts in the application had a gap of several hours where there was no data displayed. Additionally, alerting was delayed for approximately one hour and after the alerting functionality was restored to its normal state, some bad alerts were generated.

TL;DR

The Cassandra cluster hosting the time series data that is displayed in charts failed.  Backpressure from the queue feeding Cassandra propagated to other queues in the same message broker.  Because of our design for responding to congested queues, independent, healthy, functioning pipelines–most notably for alerting–were not served with new messages, degrading service even further. We restored most services by allowing messages to be delivered other subsystems independently, despite the problem on the Cassandra queue.  When attempts to restore the Cassandra cluster failed, we deployed a new Cassandra cluster and restored historical data to the cluster from an archive of all data to restore service fully.

Background

To understand the outage, consider the highly abstracted view of the Stackdriver architecture below. Data is posted to the gateway web service.  The gateway has a set of queues (shown in red) that buffer the messages as they are routed to multiple independent consumers.  The archiving consumer push data to the archival pipeline that processes the data and, eventually stores it in AWS S3.  The indexing consumer stores time series data into Cassandra.  The alerting consumer routes messages through the alerting pipeline that processes streams of data to detect conditions that produce alerts.  All of the components in the blue rectangle are deployed as a unit, or cell.  Additional cells–each with their own gateway, queues, and consumers–are deployed to support higher load.  Components in the pipeline are not shown, as indicated by the dotted arrows.

The highly abstracted view of the Stackdriver architecture illustrates the queues that feed the various pipelines.

The highly abstracted view of the Stackdriver architecture illustrates the queues that feed the various pipelines.

Postmortem

The timeline below shows the keys events of the outage and its resolution that are discussed below.

The timeline shows key events of the outage and its resolution.

The timeline shows key events of the outage and its resolution

The primary cause of the problems was the failure of the 36-node Cassandra cluster that hosts the time series data used in all of the graphs in the Stackdriver application. The cluster runs in three AZs in AWS’s US-East region.  It uses 3-way replication, a write consistency level of 1, and a read consistency level of 1.  Around 8:45 AM, we discovered that many of the nodes in the cluster had been leaking memory for several days (see below).  At 9:45 AM, nearly simultaneously, most nodes in the cluster exhausted their memory and crashed.

The Stackdriver Cassandra cluster leaked memory for several days preceding the outage.

The Stackdriver Cassandra cluster leaked memory for several days preceding the outage.

We record measurements for many of the metrics relevant to the issue, and the Stackdriver service, which we use to monitor our own infrastructure, identified the trends leading to the outage. Still, a failure like this caught us by surprise because our experience with Cassandra over the past year has given us confidence in the system.  Cassandra has proven to be operationally stable and resilient to network outages, node failures, and even AZ outages.  We accept the failures leading to this outage as our own.  We do not intend in any way to deflect any blame to the Cassandra community.  We provide these details only in the spirit of transparency.

As the cluster failed, the write latency to the cluster increased and writes eventually stopped. As designed, the queues buffering data to be written to Cassandra began to grow.  When the queues reach a threshold, the system is designed to spill data to disk at the gateways instead of adding to the already congested queues. This has successfully protected the system against transient failures in the past.  When spilling data to disk, messages are not published to any queues, even if some downstream consumers are keeping the backlog of messages on some queues small.  In this case, the Cassandra failure prevented messages from flowing to the archiving and alerting subsystems that were still operational. Clearly, this is a flaw in the system which we address further below.

Due to the widespread failure in the Cassandra ring, we did not attempt to restore the cluster incrementally. We instead attempted a complete restart of the cluster, hoping that all the nodes would bootstrap and agree on the configuration of the cluster.  We have previously used this approach successfully. The cluster did not converge to a single view of the system. We repeated the procedure with similar results. We spent approximately 90 minutes trying to restore the existing cluster unsuccessfully. We left the cluster running, even with the inconsistent view of state, to serve reads. Because we only append data in Cassandra (i.e. there are no updates), data returned from the cluster was valid, if incomplete.

When we saw that the quick restart was not going to be sufficient to return us to a normal operating mode, we broke the dependency between the queue feeding Cassandra and other subsystems. At approximately 11:00 AM, traffic began flowing through all other subsystems. This restored our ability to generate alerting notifications for new measurements. We also began to flush the older data that was buffered at the gateways through all subsystems except to Cassandra. As this buffered data–some of which was more than an hour old–was pushed through the alerting pipeline, we generated some alerts in error. The alerting pipeline is designed to handle data that is out of order on small time scales. Those mechanisms failed when presented with data that was an hour or more old.

At about 11:30 AM attempts to restore the existing Cassandra cluster were abandoned. We began our disaster recovery strategy of building a new Cassandra cluster. By 1:30 PM, a new 36-node Cassandra cluster was deployed and bootstrapped.  Deployment of a new Cassandra cluster is not unprecedented in our system.  We have tooling that automates the provisioning and deployment of a Cassandra cluster.  We have previously used this tooling to build new clusters for the purpose of upgrading the version of Cassandra or changing the schema we use in the database.  Of course, previously we would maintain both clusters until we could switch all traffic when the new cluster was fully populated.

When the new cluster was ready, we began to write new time series data to this new Cassandra cluster.  After observing the expected response of the cluster to writes of current data, we began the process of restoring historical data from an archive that we maintain that records all data we receive. Data restoration was completed for all users by 9am on October 26th. It should be noted that no data was lost during this incident and that during this restore process, we were able to show both data from before the outage window as well as the ongoing and current data from when we brought the new cluster online.  The only data that was consistently unavailable was for the time between 9:45 am and 1:30 pm on October 23.  But queries that spanned time periods on both sides of that window resulted in incomplete data presented in the charts.

We do not yet fully understand why the Cassandra cluster failed, but we do know how to isolate and mitigate against similar problems in the future.  We are working with the Cassandra developer community, as we have done in the past, to identify the problem and work toward a fix.  We remain confident that Cassandra is the right tool for this part of our system.  The community has been very responsive to questions and issues with the software.  We have observed the stability of the system increase with each release.  As part of deploying the new cluster, we made several important changes.  We upgraded the version of Cassandra from 1.2.5 to 1.2.10, upgraded the version of Java from 1.6u45 to 1.7u40, and change the RPC_SERVER_TYPE from ‘sync’ to ‘hsha’.  We had previously tested the configuration changes in production, but we did not change read traffic to point to the new cluster.

Throughout the incident, we attempted to communicate regularly with our customers. The first update to our status page occurred at 9:45 AM, just as the Cassandra cluster was failing. The first email to customers identifying the incident as a major event was sent at 11 AM. The status page was updated regularly to communicate progress toward resolution.

Analysis

Loose coupling is an important architectural property for building distributed systems that scale.  The loose coupling is often achieved by inserting a message queue between the producers and consumers.  The queue provides buffering to handle transient failures.  It provides a convenient point to measure the throughput and latency of the system, which helps gauge when capacity should be add to handle the request load.  It supports dynamic network topologies.

When multiple consumers require access to the same message, the publish-subscribe design pattern is often used.  Publisher, or producers, send messages to a message broker that filters and routes messages to queues or topics that make messages available to subscribers, or consumers, that have expressed an interest in messages of a certain type.

Lost in the architectural decision to use these the time-tested distributed systems design patterns to reduce coupling among independent components is that the message broker itself may, in fact, introduce a dependency among otherwise unrelated components.  This is key to the problems that caused the Stackdriver outage.

A failure of the indexing path led to an increase in the size of one queue.  The growth of that queue was not properly isolated.  Memory in the broker began to grow.  Our design to limit the impact of one slow queue in the message broker spilled all data to disk and stopped publishing any messages to the broker.  This design choice was made to protect the message broker during heavy load. This approach, however, kept messages from being processed by subsystems that were still healthy and functioning.

The lesson we take away is that use of a topic-based publish-subscribe design pattern does not, in itself, ensure independence of unrelated consumers.  True independence can only be achieved with careful consideration and design, also, of the broker itself.

Mitigation

We are planning changes to the Stackdriver infrastructure to mitigate similar problems in the future. Most critically, we are planning to add a greater degree of separation for the pipelines used for real-time data display and alerting. To do this, we will break the dependencies in the queueing service that the gateways use to route input messages to the multiple subsystems that consume the message type.  This will ensure that, when one subsystem is experiencing problems, the other subsystems will continue unaffected.  More pointedly, despite other issues in the system, the real-time alerting system–the subsystem that is most critical to our customers’ success–will continue functioning.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>