Stackdriver is a new SaaS company. Like 80% of all new commercial software, our products are built on the public cloud. We chose the public cloud rather than dedicated datacenters because, like many start-ups, we wanted to spend our time on our product rather than our infrastructure, and we valued the ability to start small and scale up and down quickly as needed.
Amazon and Rackspace have been great for us so far. No outages. No surprises. No performance problems. Our engineers spin instances up and down all day, we all have our own admin accounts on AWS, and it’s cheap. The cloud has played a key role in helping us get off the ground quickly and efficiently. What could go wrong?
Fortunately for us, through dozens of interviews with major cloud users, surveys of hundreds, and years of our own experience as AWS and Rackspace customers, we have learned that a lot can (and does) go wrong as companies scale their cloud operations. Our key takeaways:
1. Costs grow out of control
The days of small Amazon bills will not last long. As the team and business grows, infrastructure requirements increase exponentially. At the same time, waste starts to add up (dormant/underutilized instances, unused EBS volumes and public IP addresses, etc.). One AWS customer, for example, told us that a new engineer on his team accidentally left a test cluster running for an entire month, costing the company over $30,000 before it was discovered. Without visibility into these trends, management and investors are concerned when monthly IaaS bills increase by an order of magnitude. The engineering team responds by identifying waste, performing “cleanup,” and implementing new reporting processes. Despite investing countless hours in the effort, the core issues are never really addressed and the company either accepts higher IaaS costs as a new reality or dedicates engineering time on an ongoing basis to infrastructure right-sizing and cleanup.
2. Performance issues impact customer satisfaction
Customer complaints about poor application performance begin to trickle in. The team investigates, but is overwhelmed by data from disparate sources (IaaS console, open source tools, the cloud vendor’s status page, and even Twitter) and struggles to identify the root cause of the issue. Frequently, the issues subside, only to arise again at a later date (which usually points to underlying infrastructure issues). In other cases, the team ultimately identifies the “needle in the haystack” (bug, configuration issue, etc.) after a drawn-out investigation. Either way, customers suffer and the team goes looking for better monitoring tools, with little success.
3. Outages put the company at risk
The cloud vendor suffers a major outage, and the application is down for a significant period of time. Existing customers are furious. Management demands a post-mortem, which highlights the fact that the team could have avoided significant downtime by using a different infrastructure configuration. The team changes the configuration to survive that particular type of outage (such as a loss of connectivity to Elastic Block Storage
volumes in one zone). A few months later, another infrastructure outage occurs, but this outage affects a different IaaS product, and the application goes down again. The team identifies another opportunity to improve the resiliency of the application. Meanwhile, customers complain and cancel. New prospects are hesitant to sign up due to concerns about the widely-publicized outages. At this point, many vendors begin to evaluate other infrastructure options.
4. Security and compliance concerns are a constant drag on the business
The team begins to gain traction with larger customers–a group with a great deal of sensitivity to security and compliance. They ask the team to demonstrate that their security model satisfies their internal standards. The team discovers that they do not, and this is obvious to the customer. The team begins the painstaking process of adapting controls designed for the physical infrastructure world to their cloud environment. They define processes, write documentation, and even create internal tools to help manage security—all at a significant cost in terms of engineering time.
There has to be a better way
Our research suggests that these issues are among the most important considerations for IaaS customers.
At Stackdriver, we are fortunate to have some advanced warning, so we could avoid many of these pitfalls. Knowing that costs will escalate, we could set aside time and resources from the beginning to measure and manage the efficiency of our environment. Knowing that performance will be a major issue, we could implement a suite of monitoring tools and establish a world-class 24×7 NOC to stay on top of performance issues.
Knowing that there will be more infrastructure outages, we could also compile learnings from all of the public post-mortems on AWS outages and implement regular audits to ensure that we are following the findings of our peers. Finally, knowing that customers are going to be concerned about cloud security, we could create enterprise-grade security processes, controls, and tools.
But isn’t this exactly the opposite of what we intended to focus on when we decided to use the cloud in the first place?
Aren’t we supposed to be reducing operational overhead?
There has to be a better way.
There will be. Soon.