Announcing Policy-Driven Automation 

Available to All Stackdriver Pro Customers who Sign up by December 31st, 2013 At Stackdriver, we focus on helping DevOps identify, diagnose, and address performance issues as quickly as possible. Alerting policies play an important role in this process.  Stackdriver customers rely on our system to notify them when issues occur in their environments.  Typically, […]

Remember to build for failures 

Always be prepared for downtime in AWS or any other cloud host

The Internet is abuzz with talk about failures of the cloud again after a brief AWS outage yesterday in the us-east region that managed to take some high profile sites offline for most of an hour. From the status report from Amazon, there was a partial network outage within an availability zone which cascaded to […]

What are the most popular AWS services? 

Our sample set is still small (~30 AWS customers, about 40,000 resources under management) but we’re getting to the point where we have some interesting aggregate data on how people are using the cloud. One question that we asked ourselves early on was, “Which AWS services are used the most?” In the spirit of helping others who are asking themselves the same question, we provide a breakout of resources that we manage here.

We Can Do Better than Trial-and-Error 

Over the past three years, I have been responsible for managing a diverse set of applications running on AWS. The architecture of these applications has varied dramatically. As you can imagine, the experience of deploying and managing these applications is also completely different. This makes it really important to have a high level of visibility into the applications and to follow best practices in managing them–both of which are difficult in the cloud. I joined Stackdriver because I’m passionate about tackling this problem head on.

The 5 Most Important Lessons I’ve Learned Running on AWS 

At a recent AWS user group meeting, I presented the most important lessons that I have learned running applications in the cloud for the past six or so years. The key takeaways are:
1. Build for the cloud
2. State is a bug
3. Everything fails
4. It’s a new world
5. Just Say No to EBS

What Other Services Fail with EBS? 

Amazon had another outage today which impacted users of EBS in a single availability zone of the us-east-1 region. A number of other services, including RDS, the AWS Console, Elastic Beanstalk, CloudSearch, and ElasticCache, were affected. These simultaneous failures provide interesting insights into dependencies within the services provided by AWS.

We need better management for the cloud 

Companies that build applications using infrastructure-as-a-service quickly discover that cost, performance, availability, and security are difficult to manage using native tools from the cloud providers. Today, with no viable alternatives, they implement open source tools, build custom software, and dedicate headcount to managing their cloud operations. These all consume resources that would be better directed to serving the company’s mission. There has to be a better way.