Over the past three years, I have been responsible for managing a diverse set of applications running on AWS. The architecture of these applications has varied dramatically. For example, one application was fairly straightforward (based on the traditional LAMP stack) and relied heavily on both the core compute and storage (EC2, EBS, S3) as well as AWS platform services (such as ELB). Another application was a complex system that used EC2 as a primary building block to implement sophisticated services based on Node.js, MongoDB, Redis, and Elasticsearch. As you can imagine, the experience of deploying and managing these applications is completely different. This makes it really important to have a high level of visibility into the applications and to follow best practices in managing them–both of which are difficult in the cloud.
Gaining the Right Visibility
As an operations guy working in a cloud environment, I have been unable to find single toolset that is flexible enough to monitor and instrument all of the of the applications I have been responsible for. Tools like CollectD and Nagios are great for systems metrics but are ignorant of the cloud. They don’t for example, take into account that I can have EBS volumes detaching from one server and attaching to another. Or elastic load balancers terminating and re-launching unhealthy servers. In fact, they don’t even know that ELB exists. Likewise, the out of the box health checks and documentation seem to be designed for traditional applications in dedicated data centers.
Tools like CloudWatch are good for cloud services (such as ELBs) but are difficult to use for deep system or application level visibility. They don’t, for example, have any insight into file systems. Likewise, they don’t provide any insight into the performance of services running within instances, such as MySQL or Apache.
So I typically do what any reasonable ops person would do; I roll my own scripts. They get the job done–at least initially, and then become more and more of a time investment as I try to resolve all of the gotchas and edge cases that the cloud allows for.
Applying “Best Practices”
If you have any experience running in a public cloud such as AWS or Rackspace, then you understand that operating in the cloud is very different than doing so in a traditional datacenter. Hopefully, you learned this the easy way, by phoning a friend, through blogs, documentation, presentations, and other publicly-available information. Unfortunately, most of us learned the hard way, through downtime, performance issues, or worse.
In any event, we have been struggling for years to determine the “right” way to run our applications in the cloud, and the best practices promoted by leading vendors are easier said than implemented. For example, here is an example of a best practice scenario according to AWS: “To reduce the chance of total data loss on an EBS volume, AWS recommends snapshotting the volume frequently.” (http://aws.amazon.com/ebs/)
This problem seems easy to solve on the surface. Create a script which queries all of your AWS volumes, loops through each volume, and takes a snapshot. Install this in a cron job and you are “done”.
Here are a few reasons that you will revisit this script down the road:
1. There is no verification that the snapshot completed successfully.
- I have seen many times where a snapshot will stop before it reaches 100%. This will block future snapshots from completing successfully. You will need to write a separate check for stuck snapshots, install that in a separate cron job, and test that it works.
2. There is no verification that you actually took snapshots of all active volumes.
- There may be a bug in your code that surfaces down the road due to a change in the AWS API, a change in your infrastructure, or a change made to the script. You will need to write a separate check to verify that all active volumes have recent snapshots, again another cron job, and another test.
3. There may be volumes that you do not want to snapshot.
- Snapshots aren’t free. You may skip snapshotting a volume because it isn’t important. You now have to bake this logic into your snapshot script and each of your monitoring scripts that support it.
4. You may be leaking snapshots.
- If you have a bug in your code that takes too many snapshots, or if you don’t go back and delete old snapshots, then you could hit your AWS limit for number of snapshots. You will then be unable to take additional snapshots, possibly affecting other parts of your system.
5. Snapshotting a volume degrades the performance of the EBS volume.
- While taking a snapshot, the EBS volume suffers from a performance penalty. You now need to know when a good time is to take the snapshot. There is additional instrumentation needed to record and visualize this data. You now have to look at each of your volumes and determine the ‘best possible time to snapshot’. How does this scale? How do you automate this for hundreds of volumes?
6. And so on…
We Can Do Better
I joined Stackdriver because I’m passionate about making operations smarter in the cloud. At Acquia I designed and deployed a monitoring/instrumentation solution that scaled horizontally as capacity was added to the PaaS offering. I was also part of an operations team which went from managing 300 to 3000 servers in just over a year.
Taking a step back from the action I realized that there is so much data and automation that the AWS API allows for, but no way to compile, visualize, and make it actionable. There are so many best practices to follow, but no tool that can just TELL you what you are doing wrong. The same holds true with auditing. When do you find out you’re leaking snapshots? When your stack trace contains the words “snapshot limit exceeded”. (“!@#$ – Theres a snapshot limit?!”)
I’m excited to work with an awesome team on a product that I’ve really wanted since I started working in the cloud. Between the team and the customers I know that my cloud experience will be taken to the next level. Looking forward to making a difference with building and operating in the cloud.