Using Lambda for a Server-less Stackdriver Alerting Webhook 

Stackdriver now supports creating dashboards to understand how your AWS Lambda functions are being used.  These dashboards allow you to see how many times a given function is called, how often it fails, if it’s being throttled and how long the calls take.  

3 of the Most Important Agent-Driven Metrics to Monitor 

Amazon Web Services enables companies of all shapes, sizes and industries to quickly deploy and scale web-­based applications and sites. In order to succeed, visibility into your AWS infrastructure is key -­ you need to know what to monitor in order to stay on top of performance issues that may arise in your cloud stack. […]

Tracking Business Metrics (or Fun with Custom Metrics) 

Most Stackdriver users work in technical jobs.  Those of us in such positions tend to think about problems from the point of view of technology.  While Stackdriver’s support for custom metrics is a great way to supplement application monitoring with additional technical technical measurements (like the number of alerts sent to users or the failure […]

The Value of No Data 

Never trust a quiet instance or server

In a modern SaaS organization, data defines every element of business – how many messages, how long does it take, how much storage are we using, how many instances? But sometimes it’s a lack of data that should pull our attention – cloud instances, like cats and children, are most dangerous when they’re quiet. This […]

Intelligent Cloud Monitoring – Anomaly Detection 

At Stackdriver, we go beyond passive monitoring. Our goal is to proactively provide DevOps professionals with insights about your cloud deployments. By highlighting changes in your systems, we can guide you to focus on real problems. Using an analysis technique called “pattern detection,” we scan key metrics of your infrastructure for sharp changes in behavior. […]

The Broken State of Alerting for Cloud Apps 

Those of us who manage cloud-powered apps are frustrated with the current state of alerting. The task of notifying the right person at the right time when there is an important issue that needs attention (and not doing the inverse) is as easy to understand as it is difficult to implement. Tools that focus on threshold-based alerting are helpful but they fall far short in supporting common patterns in the performance of our systems (random spikes, trending, seasonality, etc.). At Stackdriver, we believe that a new level of intelligence for alerting is within reach–incorporating time-tested statistical techniques and modern cloud and “big data” technologies. We still have a lot of work to do but we are already are breaking new ground in terms of delivering more useful alerting capabilities to our customers.

Pingdom vs. New Relic 

Last week, Pingdom announced that they had added real user monitoring to their website monitoring platform, putting them in the sights of application monitoring giant New Relic. Its easy to get caught up in the buzz words used in press releases, and to that end I’ve decided to draw up a quick comparison of the two vendors’ feature sets. The conclusion – that Pingdom is more useful for websites while New Relic retains its title as king of APM market.

The Art of Integration 

Last month, we found that many were using as many as 3 monitoring packages for their public cloud stack. The issues with this setup (aside from those that come with each individual package) is the fact that ops staff must be trained on all 3 or, alternatively, responsibility for each must be split between different staff. Stackdriver’s new system (currently in free beta) solves this problem, allowing monitoring of all layers of your stack.

The Cassandra Outage That Didn’t Happen 

Have you ever experienced that sweet satisfaction of seeing you system work just the way you hope it would? We experienced exactly that feeling recently here at Stackdriver and thought that it might be worthwhile to have a brief postmortem on a non-outage for our Cassandra cluster.

Why Monitoring Doesn’t Have to Suck 

I saw an article yesterday that used Nagios as an example to illustrate common monitoring woes, concluding with the statement that the solution “would need to be built from the ground up”, rather than added on. Many devops teams now use a range of services, each with its own speciality, in order to aggregate a suitable solution to their needs, but it was noted that a unified package would save them trouble and time. Despite its gloomy nature, the article was quite well received here at Stackdriver HQ; where we currently have a unified solution to these very problems in beta.