CloudWatch says you’re at 90% CPU utilization, while your installed agent tells you you’re at 60%. Conventional wisdom says that the agent is right – after all, it’s installed at the system level, and they can’t both be right.
Although many people believe this, it’s not correct – despite the two significantly different numbers, both CloudWatch and your agent are right. The difference lies in what they are reporting, and that difference also defines “CPU Steal.”
When Amazon reports on your CPU usage, the reported number is a percentage of your assigned CPU. Since EC2 instances are shared among multiple end users, Amazon determines how much of the CPU “belongs” to you and reports on how much of that you’re using.
Agent-based reporting, on the other hand, is based on a core’s total CPU usage. It does not take into account any other users attached to the instance or how much Amazon thinks should be available specifically for your app – rather, it simply reports on how many cycles your instance’s CPU is using at a given point in time.
The difference between these two metrics is what’s known as “CPU Steal.” Essentially, CPU steal is how much of your instance’s CPU is engaged by something other than your own virtual machine. This phenomenon is often chalked up to ‘noisy neighbors’ sharing your instance pulling significant usage.
As you can see, although CPU utilization appears to be a single metric, the API-reported version and agent-reported version are very different, illustrating the value of both. It also illustrates that, when when generating policies, it’s generally better to use the CloudWatch-reported metric, as this is what Amazon uses as its basis for throttling your available CPU cycles.
That’s not to say agent-based CPU reporting isn’t important – quite the opposite, understanding the system-level health of an instance you share with other AWS users can help you determine how important it is to spin up an extra box or perhaps to move entirely off of an instance full of noisy neighbors. If you set it up correctly, you can even have a CPU steal policy to alert you when an instance is being overloaded by someone other than you.
This is why it’s so important to have both levels of CPU monitoring implemented on your AWS environment – no single number tells the entire story. Having unified system and infrastructure monitoring is necessary to fully understand what your environment is doing.
If you’re a Stackdriver customer, we highly recommend you install our custom agent onto your servers. If not, we invite you to check out Stackdriver’s intelligent monitoring – within two minutes of signing up, you’ll have the CloudWatch information, and a quick install of the agent will give you the rest.