Before reading this document further, please note that it, and any code snippets it lists, are distributed under BSD license If you disagree to any of the terms of the license, please close this page now.

This text provides some explanations and advice on how to set up custom CloudWatch metrics, on an example of reporting disk utilization.

Why CloudWatch

If you are on AWS, and you need to monitor more values than provided by AWS monitoring, or, use monitoring that AWS itself can not provide, using CloudWatch may be a better approach than deploying your own monitoring stack (like Nagios, etc.) This is not an endorsment over which monitoring solution is better, I honestly don't have an opinion on which one is better, but if you are running a small operation, deploying a whole new framework takes up significant amount of resources.

Note that this text utilizes references to AWS documentation on how to use custom metrics, and does describe functions that are documented on that page; however, the perl script provided by AWS were deemed not as cool by me because:

Does it cost money?

CloudWatch service is not free for all cases. When I was writing this, basic CloudWatch monitoring was free for any EC2 instance, but additional metrics had a price after a free tier. You need to understand the costs before applying this solution at scale.

How does it work

CloudWatch is exposed by AWS as a web service API, most like anything else. You can access the API using command line tools provided by AWS, or from your own code; either by utilizing some AWS SDK, or making network calls into AWS web services.

As with virtually any other AWS web service, the calls must be authenticated, and authentication requires Access Key and Secret Key (there is a variety of names for these two tokens). The two keys act as a username and password, but are disposable, or temporary. Once the call is authenticated, it then must be authorized. This means that whatever the web service operation that is attempted must be allowed by the entity that the owns the corresponding Access Key. That owner can be an AWS user (account root), an IAM user, or an EC2 instance.

AWS user is in essence account root, so it has permission to do everything. The access enabled for IAM users are based on the policies attached to the user. Operations enabled for EC2 instances are based on the role that EC2 instance is associated with. The role can have a number of policies attached to it, which at the end define which operations are authorized.

To write CloudWatch metric data, the cloudwatch:PutMetricData permission must be enabled. Consult CloudWatch operations list and CloudWatch IAM permission map for more information on what else can be done with CloudWatch, and what permissions are required.

Measurement recording

CloudWatch generally records measurements of some value, and exposes API calls, UI, and default recordings as a service.

Metric is an important concept to understand. I would recommend first reading CloudWatch concepts page, but I find a bit misleading. A CloudWatch metric is all measurements of some specific value. For example, a CPU utilization, or an average water depth, but it is not bound to a specific CPU, instance, or a body of water. This means that a measurement of a metric will mean logically the same thing as any other measurement of the same metric unless the metrics are from different name spaces, or measurements have different dimensional values. Different metrics are differentiated by their names.

Namespaces are container for Metrics. Metrics can not be compared across namespaces. AWS defines its own namepaces for various services; all of those namespaces will start with prefix AWS/, and you can not post metrics to these namespaces, or create your own namespace with the same prefix.

You should report metrics to the same namespace as much as possible, unless there never is the case where the values from these metrics should be compared. It doesn't make sense, for example, to compare disk utilization for RDS with disk utilization from EBS, hence these would be reported to different namespaces. For your own network, any disk utilization should probably be reported to the same name space. So, if we are sticking to the example of disk utilization, then we should have a single metric, name it, for example, DiskUtilization, and publish it to a single name space. For the sake of certainty, let's call this namespace Deployment

The way that the metric data is attributed to a specific source, is by having specific dimensional values. In other words, two metric measurements are considered to be from the same source if, and only if, both of the measurements have the same dimensional values. These values are like source address that consists of pontentially multiple elements, which makes it a vector, and vector coordinates are in different dimensions. So every time a metric value is submitted, the dimensional values of that metric value must be specified. CloudWatch UI will be able to plot the values from the same metric from multiple dimensions on the same chart. An example of a dimension can be an instance ID or, usable for disk utilization, the partition mount point, or a block device.

Interestingly, some dimension names are considered "special", and are treated as such by CloudFront UI. For example, InstanceID is recognized as AWS Instance ID, and the UI will provide instance name along with it. I can't find any documentation that lists those special dimension names.

Caution about values and dimensions. Once a metrics has values within certain dimensions, it is no longer possible to report values that have a dimension value that is not specified. CloudWatch API will accept the value, but the value will be ignored, I guess because CloudWatch doesn't know how to place it. If you want to make changes to how you specify values dimensions, you should use a new metric; it's impossible to delete existing metrics, however, if metric is not reported on within 2 weeks, it will be automatically removed.

It is possible to specify units of measure along with metric values. It's unclear what will happen if you mix multiple units of measure in the same metric; logically, this will make no sense. Obviously, unit of measures are advisory only, but it helps whoever reads them to interpret them, plus some of the CloudWatch UI use them to display the values in a more appropriate manner.

Reporting Disk Utilization

Let's say that we want to report disk utilization, on an Amazon Linux system. Let's also say that we want to report it for all mounted partitions on the system. It's the logical volumes that matter, not physical devices.

Of course, you can report all kinds of information on a logical volume - capacity, bytes used, bytes available, free inodes, etc. However, note that the CloudWatch metrics is a priced resource, so you probably want to limit the amount of metrics use create. What usually is important - is the percentage of space left. If you expect a high amount of small files to be created on a volume, then you will also need to monitor the percentage of consumed inodes.

I haven't found a reliable way to list all the file systems that are "interesting", as nowadays, there is a bunch of virtual or temporary file systems that get mounted on any AWS (or Linux) machine. However, your volumes will typically be of type ext4, of xfs, so you can use the file system type as a separator.

So, this script, will:

echo "ext4" | sort | uniq | xargs --verbose -I'{}' df -t {} | awk 'BEGIN {cl=0} { cl++; if (cl>1) { print "aws cloudwatch put-metric-data --region $(curl -s|sed s'/[a-z]$//') --namespace 'System/Linux' --value " $3/$2*100 " --metric-name DiskSpaceUtilization --unit Percent --dimensions InstanceId=$(curl -s,MountPath="$NF ; }  }' | xargs -I{} sh -xc {}

Note that this line can be safely stuck into crontab, so it's easy to deploy

Note that you should run aws configure to configure your access and secret keys, unless you can use EC2 authentication. For EC2 authentication, aws will pick it up automatically.

That's it !
by Pawel S. Veselov, 2015
[Back] [Or press "BACK" button on your cool browser ;)]
This page was accessed
Traffic load: