I have been doing a lot of work lately to generate volumes of metrics. Why bother you might ask. I will tell you. If you do not measure what is going on with your resources (applications and machines in my case) you cannot put a number on things, which it turns out is quite useful. In putting numbers on things you’ll feel you’ve passed from a land of make believe and half-assed guesses into a world where you truly understand your resources and can make firmly backed assurances.
For example. It’s one thing to tell your boss “the server is really slow because like lot’s of web requests and stuff” (trust me business types love that sorta explanation), and another entirely to be able to say “we are experiencing 2x the normal request load and the server is doing it’s damnedest to use up all the available cpu and/or memory”. That brings me to important point, if you are not measuring anything how do you describe normal? Without metrics it’s probably in terms of “everything seems ok, I mean the pages load fast!” Even worse it could be “hey not a single crash today, the servers actually staid up!”
How do you really start to create a well informed solution to problems such as slowness and lack robustness (shit crashes)? Without metrics it’s likely “let’s throw a bunch of hardware at it and hope it’s enough”.
When you have the actual numbers you can make much better inferences, the conclusions of which will fuel sound strategies based on more than gut feelings. Later in this post I’ll point out how marriage of proper metrics and load testing can give you deep insight into your application.
OK, So Like What Are Metrics?
There are 3 that, in my experience, are pretty important when you are building web apps.
First up is your simple gauge metric. This little fella is simply a value at any given time. A good example in the web application world is count of database connections being pooled. Another very simple example is how much memory is available. A gauge provides a reading of some value at any given time. Over time it will likely go up and down, and sometimes stay pretty constant.
The next metric is the humble counter. A counter is pretty obvious right? You count things with it. A good example is how many times an endpoint has been hit. Note that a counter doesn’t have to increment by one each time, you could have a counter which counts the total number of bytes in and out of your server. In such a case you’d have pretty non-uniform increments to the counter each web request.
The last but not least of the metrics is the timer. A timer counts the number of milliseconds in a given period. Say your code has a section which makes a call to a third party web service. Wrap it in a timer and you can see when that service starts to get flaky on you. Timers also work great for wrapping algorithms or tricky sections of code. They’ll give you quite a bit of insight over time.
We Have Metrics Now What?
You will want a way to log each of these metrics at given intervals, and some way to visualize the data over time. Most likely you will be using standard line graphs to visualize this data, but it’s also possible you’ll use other forms of graphing and even custom visualizations. I’ll post later on tools which can be used to do the storage and graphing. These days there are quite a few packages that provide full and partial solutions to these problems out there including Ganglia, Graphite, and Cacti.
The other thing that you’ll probably want to do is create alerts based on certain metrics. You might be taking an average of requests per second and if it exceeds a custom threshold send an alert to one or more persons. Other useful alerts are disk space (way too common to run out of) and memory and cpu alerts on machines. There are some solutions to alerting, the one I’ve used most is Nagios. Depending on what solution you use for storing and retrieving metrics you might actually be able to just write your own little alerting app. In a language like Python or Ruby it could be rather trivial.
Once you have alerts you may also want to invest some work into integration with a tool like PagerDuty which makes it easy to alert the right person(s) at a given time, and escalate issues if nobody responds in a given time.
Help Dev Find Their Way
Wanna hear a really cool story about using a gauge to supercharge dev? I put gauges on each web app endpoint to report how many actual SQL statements are run during each web request. Using this information we where able to find really nasty goings on. We’ve found endpoints which caused the dreaded N+1 issue and fixed them. We found endpoints that just ran way too many SQL statements, and often found we could cache much of the work in Memcached.
We also used counters, counting the total number of statements overall and the total per endpoint. This allows calculating the % of total requests run by endpoint on average, taking in to account how many requests they get in total. This helped used understand focus on endpoints that get hit often and run lots of SQL vs ones that don’t get hit often and run lots of SQL.
Our app is quite bound by the DB end of things. We have some very big queries that can get generated. We wrapped up every endpoint so that it has a timer started when the request hits our app, reporting the value when the request terminates. With this we can see endpoints that take a long time, even though they may only be running a single query (or a few). Again another way to allow dev to dig into the performance side of things. Of course if endpoints where slow for reasons other than the DB we’d see that too (though it’s really not been the case).
The best part is when the devs get around to making improvements, either shaving down the number of queries run or the time they take to run, we can see the results quickly through the metrics. Honestly it makes everyone feel that much more awesome (well I mean as long as their fix works). It’s a good thing.
At this point you have a good set of metrics in place for your application, and when you run your app it generates metrics. You can gain even more insight by running load testing to generate, wait for it, loads of metrics (yuk yuk yuk). Metrics by themselves will give you insights into running applications, such which endpoints are receiving the most traffic and how fast they are. By pairing metrics with load testing you will gain insights into performance when things go past current production levels. Load testing is your crystal ball.
Wanna get a very good idea of how many requests per second your app can handle? Turn up the dial on your load tests slowly, ramp it up until the server just doesn’t want to respond (much/at all) anymore. Some may say “I already do this and I don’t have any stinking metrics. I know how many requests per second my app handles on given hardware yada yada yada”. So you know at what point it breaks, but can you tell my why it breaks? Tell me which trends indicate certain types of performance bottlenecks. Oh I see now we are back to “it’s slow because lots of web requests… and stuff”.
Even if all you wanted to do was throw money at it, would you know if it’s money well spent on memory, (faster) hard disks, more and/or faster cpus, more nodes etc. That is hard to tell just from a basic figure that says at X requests per second makes the app start flaking out.
Luckily you have metrics, right? Now go and look at the metrics generated by your load testing. You’ll probably see a ton of things that will blow your mind. Like what you may ask.
Well let’s start with timers. You’ll probably see the less optimized endpoints start taking a lot of time. You’ll be thinking “hmmmm these are weak points”. You’ll also maybe see your memory and/or cpu spike at a certain requests per second. Again you’ll be thinking something like “hmmmm something really eats up cpu and/or memory”. Counters like my SQL counter example above will indicate other issues, really the sky is the limit here you just have to start recording things!
What I’ve done here is explain how load testing will allow you to methodically find and deal with performance issues. You may be finding it interesting how interconnected load testing and gathering metrics are. It’s really should not be a surprise. Basically the two are part of one whole, each useful by their own but when put together you enable very broad insights neither can provide on their own.
Not Just for Developers
Metrics can provide useful information to people other than dev. How about customer service? Giving them a heads up when response times plummet (expect lots of phone calls) or when a certain user just can’t seem to log in (John J. just failed login 5 times!) will allow them to make better decisions.
Speaking of which product dev guys love it too. They can see which features are used and by who. They have numbers on the dev stuff which always seems to resonate quite well. How important is that new features? Well I don’t know this current features/endpoint is taking 3+ second on average to load maybe we should fix that first.
Sales? Who isn’t using the site, who is. How can we engage the customer and bring them back in when they may be ready to leave.
Marketing loves metrics, mostly for bragging rights. “We get X requests per month” for example.
More to Come
In my next post I will take some time to simply lay out a lot of the basic metrics most engineering teams will be interested in, and maybe a few others non-eng groups would be interested in. Later on I’ll cover actually putting some software together to deal with metrics, and to generate them too.