One of the essential ingredients of an ecosystem where you can deploy on Friday with confidence is the ability to know the state of your application(s) as they are running on your production environment. Are they up and running? Are they healthy from a technical and functional perspective? Are there any trends that indicate any problems might occur in the near future? Are alerts automated when there’s a problem? Are alerts relevant and actionable?
So what’s observability and how does it help us?
In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.Wikipedia
From this definition we can infer that a system that doesn’t log events and doesn’t expose metrics in any way is not very observable. There’s no view of the internal state of the system and the first indication that something is wrong is most likely a customer calling the service desk. Not a great position to be in because you’re always running behind the issues at hand.
On the opposite side of the spectrum: a system that streams log events into a dedicated log management system (e.g. ELK Stack, Splunk), exposes metrics to a dedicated APM system (e.g. Prometheus, AppDynamics, Dynatrace) and has a detailed health API (e.g. Spring Boot Actuator endpoints) could be considered highly observable; it’s very easy to collect and present KPIs and SLIs and to measure whether thresholds have been crossed, trends have been broken and whether our SLOs are in danger or not. This is a much better position to be in because you can catch issues before they actually become a problem for your customers.
Note that ‘system’ in this case is not just your application. It includes any dependencies and the infrastructure/platform that your application runs on. Essentially anything that can affect the SLA of the services/product you are on-call for.
The Google SRE Book is a really great place to start out if you want to learn more about observability and what all those TLAs mean. Particularly the chapter about Monitoring Distributed Systems is of interest in the context this blog post.
But wait, isn’t observability just a fancy word for monitoring?
Almost: observability means how well a system can be monitored and monitoring is the act of actually observing the data being output by your system.
Before going into the details of any technical solution it’s essential to understand what kind of data and events we would like to monitor in the first place. The goal is not to look at a dashboard all the time, but be actively notified in case there’s an event that needs further investigation. Your dashboard should be very boring most of the time. Furthermore, notifications should be kept to a minimum or people will start ignoring them because of notification fatigue. We want to limit our metrics to what’s strictly necessary and we want to keep our alerting limited to critical or about-to-become critical events so that whenever we receive a notification we know it’s something that’s worthy of our attention and the notification itself will contain just enough information to be actionable.
So let’s start simple and let’s start with some metrics that will apply to a large part of the population (e.g. technical metrics) rather than starting with very specific metrics that only apply to a small part of the population (e.g. business specific metrics). A good start is The Four Golden Signals from the SRE book.
In this screenshot only a handful of things that are being monitored. There’s CPU/memory/disk usage, number of requests, errors and response latency. This doesn’t look like much, but I agree with Google that if you don’t have anything right now and want to start with at least something, start with this and then expand from there. Logging is probably the next thing to look for, e.g. amount of lines logged at ERROR level.
Defining SLIs and SLOs
In this particular example ‘Request latency’ is an SLI (Service Level Indicator) and the accompanying SLO (Service Level Objective) could be something like ‘Request latency < 100ms’. Another SLI is ‘Error rate’ and the SLO could be ‘Error rate <0 .1%’. More advanced scenarios could include more complex metrics, e.g. more business-focused metrics or compound metrics.
Notification and alerting
If we have an SLO that says our request latency should not exceed 100ms we could send an alert whenever that threshold is crossed. But there are a few problems with that.
First of all this means we get a notification every single time this happens, and if you have a large number of requests coming into your system that may mean a very large amount of notifications. E.g. if your system handles 100.000 requests per day and 0.1% of those have a latency over 100ms you’ll get a 100 notifications. Ouch!
Another problem is that we send a notification after our SLO has been broken. So it’s reactive rather than proactive. It would be much nicer if we got a notification before a problem occurs rather than when it has already happened. That way we can fix the problem before it occurs and we don’t violate our SLAs to begin with. Let’s see how that’s solved in modern APM solutions.
Putting things together
So now that we have talked about the theory, let’s put some things together in practice. For this purpose we’re going to deploy a Spring Boot application with Actuator endpoints and deploy this onto Pivotal Cloud Foundry. Then we’ll expose some metrics into PCF Metrics and set some monitors so we get notifications in Slack.
Keep in mind that the stack above is just my favorite stack of the day. It’s perfectly possible to deploy a similar application onto a completely different stack (e.g. Kubernetes or IAAS directly) and feed the metrics into something like ELK, Prometheus or one of the other APM tools I mentioned earlier.
All code can be found at the GitHub repository. In essence, this application is a bog-standard Spring Boot application with starters for Web and Actuator. Actuator provides several production-ready features so an application is observable by default, with no custom code required out of the box (though customisations are possible). It’s a great place to start if Java is your language of choice.
The application has one RestController that will call a MetricsService to emit a custom metric on each request.
By default, Spring Boot already collects some default metrics and will expose these to PCF Metrics automatically in case it finds a service binding to the PCF Metrics Forwarder. There is no additional code required for this, this will simply work out of the box. And since the introduction of Spring Boot 2.0, Actuator can autoconfigure Micrometer so you can emit your metrics automatically to a variety of APM solutions in a transparent manner so you’re not locked into any specific vendor. Check the link for a list and an overview of what’s possible.
Deploying the example
Back to the example. Follow the instructions in the README to get the application up and running. Don’t worry about costs, PWS has a free trial tier that you can use for running these examples. After deploying the application and binding it to the metrics forwarder as described in the README, the application should begin outputting metrics fairly quickly (it’s near-realtime, but not at second accuracy, so give it a couple of minutes to collect). If you log into the apps manager it should show something like the screenshot below.
Click ‘View App‘ to go to the app. This will trigger a request and send a custom metric to the metrics forwarder. Click ‘View in PCF Metrics‘ to view the metrics. And that’s pretty much all there is to it.
In the screenshot above three charts have been setup: requests per minute, errors per minute and a custom metric. If you select a time frame with the mouse, it will also filter the log lines below accordingly so you can immediately match log lines to any potential anomalies in the charts. Similar and more extensive functionality can be found in other APM tooling.
In the source code, check out the ‘metrics’ package for the source code for emitting the custom metrics. In this case it’s specific code for communicating with the PCF Metrics Forwarder API, but feel free to adapt this to calling your own APM solution (or use Micrometer). Configuration to reach the API (endpoint URI and access key) is extracted from the runtime environment on application startup.
This information is then used to contruct a CustomMetrics object graph along with the value of the metric (a simple floating point number) which is then passed to the API. Again, this is all very straightforward, with very little code of your own (which is great because you’ll be doing this for all your applications and probably more than one metric per application).
Setting up alerts
Staring at a graph all day is pretty boring, especially because most of the time nothing should be happening. So let’s setup a monitor so we can get notified in case something interesting happens (and by interesting we of course mean something bad).
In the screenshot below a monitor has been setup that monitors the number of requests per minute. Critical threshold is set at 20 req/m and the warning threshold at 10 req/m. These numbers are really low so it can be tested easily for the purpose of this article; in a production scenario these would probably be different. Note that there are two thresholds: the critical threshold indicates something is really broken while the warning threshold indicates something is going to break in the near future. Dedicated APMs offer much more flexible thresholds and trend monitoring as well. The great thing about this is that we can be notified before something is going horribly wrong!
Futhermore, a configured Slack webhook will ensure a notification is received in the Slack project/channel of choice in case the monitor is triggered. So whenever a threshold is passed a notification will be received automatically! Of course, in your situation you are free to replace this with whatever your company is using for alerting (e.g. text messages, PagerDuty, Telegram channels or whatever is the preferred solution in your company).
Verifying everything works
To verify the setup is working should be rather easy because the thresholds are set very low. Hitting the main URL of the app by refreshing it in the browser for a few seconds should cross the threshold easily.
Shortly after hitting the endpoint with requests an alert is received via the configured Slack channel.
Checking back into PCF metrics (both in the charts and the monitor event list) we can see that the monitor has indeed been triggered. This demonstrates our setup is indeed working and alerting us when needed.
Setting up critical and warning alerts for the others of the Four Golden Signals should be fairly straightforward. After setting up the basics consider setting up some more advanced metrics and perhaps add some business metrics in the mix as well. Be sure to tune your thresholds to proper values for your particular setup (this may take some experimentation and time to get right).
That wraps it up for this article. Of course this was just a very simple example. To really flush out monitoring and alerting for all the applications in your entire organisation will take a lot of time and effort. But you don’t need to change the whole world in one day! Start with one application, preferably the one that hurts you the most right now, so you’ll get the most value out of it, and get the setup right for that application, then expand to other applications. A quick summary:
- Start with the Four Golden Signals
- Setup a dashboard that anyone in the team can view
- Setup automated alerting
- Alert for both critical failures and about-to-become critical failures
- Start with one application, preferably the most painful one
- Start small and work from there
Further reading material
Feedback is always appreciated, either by commenting here or on Twitter or LinkedIn!