I am a Platform Engineer at Stackdriver. Our monitoring service processes over 100 million measurements per day on AWS. We are constantly evaluating options to improve the performance and availability of our service, so when Google invited us to take part in the GCE limited preview–which Google promises offers superior performance and consistency–I was excited to see how they stack up relative to AWS. Based on my experience testing our Cassandra Cluster on GCE for the past few weeks, Google seems to be living up to that promise; I found that GCE not only performed faster than AWS equivalent instance sizes, but demonstrated consistent performance over a period of a few weeks — even with instances that lived in different GCE Zones.
Currently the Stackdriver service runs on AWS, and provides monitoring for both AWS and OpenStack services. Going forward, it is important to know what other public cloud offerings provide–for our customers and Stackdriver itself.
Our goal of the limited preview was to run a part of Stackdriver’s data pipeline in GCE. This included replicating our data pipeline from AWS to GCE and indexing the data into a Cassandra Cluster running entirely in GCE.
- 1 x n1-standard-1d RabbitMQ Server (us-central1-a)
- 2 x n1-standard-1d Indexers (us-central1-a, us-central1-b)
- 3 x n1-standard-2d Cassandra Servers (us-central1-a, us-central1-b, us-central2-a)
- Version 1.1.6
- Replication Factor 2
The data pipeline works as follows:
1. RabbitMQ receives a portion of the data pipeline mirrored from Stackdriver’s production environment.
2. Indexers consume messages from RabbitMQ and write them into Cassandra with a consistency level of ONE. (~800 msg/s, just a small slice of the data that we process normally)
3. Cassandra replicates the data between the zones, keeping two copies in total.
After accepting my invite to GCE, I visited the management console to explore what the service had to offer. I was greeted by a beta-quality interface. Required fields weren’t always marked, and it took me a few tries in order to launch an instance with the proper firewall rules. Contrast this to AWS’s new simple, easy-to-use dashboard; you can launch a server in just a few clicks.
Instance Sizes and Image Selection
Out of the box, GCE provides a very limited base image selection, but the essentials are there: Centos 6, Ubuntu 10/12 LTS, and GCEL (Google Compute Engine Linux). While this pales in comparison to the breadth of the AWS marketplace, it’s offers enough of the basics to get started.
The instance sizes offered are similar to those available in AWS, missing just a couple of the smaller instance types, such as the t1.micro and m1.small.
Network Configuration and Firewalls
Firewalls are GCE’s equivalent to security-groups in AWS. GCE really starts to shine in this area.
A little background:
- Instances are launched into a Zone and Network.
- GCE Zones are conceptually similar to AWS regions.
- Networks can span multiple zones and provide instances with a private network to communicate over.
Firewalls are specific to networks.
One significant advantage is that firewalls are not constrained to a single zone. It is much easier to configure the firewall once, and have it available for instances launched in any GCE Zone.
Another nice feature is that in addition to the normal iptables-based firewall rules (such as ALLOW from SOURCE 10.0.0.0/8) is that you can configure rules based on tags. This allows for you to tag instances with a role such as ‘cassandra’, and then configure SOURCE or TARGET firewall rules based on the tags. The IP or Zone of an instance doesn’t matter, just that it has a ‘cassandra’ tag.
Quotas are slightly different in GCE vs AWS as there are a few more quotas that you need to keep an eye on. In addition to the usual suspects like number of instances, size of disks, and IP addresses, there are new ones that you may not be familiar with: number of cores, firewalls and number of disks.
The quota on both the number of cores and number of instances makes capacity planning a little more difficult. Not only will you have to estimate how many machines you will use, but also how large they will be. As you increase the size of your servers, you’ll need to remember to increase your quota on the number of cores as well.
I don’t have hard numbers comparing AWS to GCE 1-1, but after working on AWS for 3 years I can tell you how different it felt. The entire experience of the command line felt faster, yum was fast, latency was low, and in general it felt like I was on a ‘real’ hardware server again. I used the same puppet manifests to bootstrap Cassandra nodes into a cluster as I do on AWS. On AWS, a Cassandra node of equivalent size would take ~120 seconds to bootstrap into the cluster, while in GCE I was performing the same activity in ~40 seconds.
One of the promises that GCE makes is that performance will be consistent from node to node. I ran the Cassandra cluster setup for a few weeks with a slice of our production load against it (~800 msgs/s). I examined the performance of the three nodes across zones and Google’s claim of consistent performance certainly appears to hold up. Pictured below is the load of the 3 nodes, which measured ~1.1 throughout the setup.
Command Line Tool (gcutil)
I noticed a few nice touches while working with GCE for the first time. Setting up gcutil, the command line tool for working with GCE, was extremely easy: download and run. Gcutil prompted me for all required information that it needed. It was even nice enough to generate and upload an SSH key for me. Compared to getting started with the AWS command line tools (or any other CLI tool), there was no setting environment variables, downloading additional software, setting paths, etc.
While GCE has some rough edges in the admin console UI, the service overall was a great experience. There is a limited selection of services and features compared to AWS, but GCE delivers on the promise of consistent performance. The one concern that I have going forward is whether GCE will continue to provide the same level of performance as it grows. Sure it feels fast with my six instances in limited preview. How will it feel when I am sharing with the rest of the world? And what has Google done to limit the host, network, and API contention that plague large AWS customers?