Custom Metrics (JMX, Golang expvar, Prometheus, statsd or many other), APM and Opentracing are different approaches on how to instrument code in order to monitor health, performance and troubleshoot your application more easily.
Why do we need code instrumentation?
We all agree that creating an application and not monitoring it, is not a good idea. I’ve personally seen projects where they weren’t monitoring the application; everything crashed, they had no idea about what was going on and troubleshooting was chaos. No one knew what was going on and it was crashing on a per day basis in production, creating a terrible user experience and calling the sysadmin team at 3 AM complaining that they couldn’t work. It was a mess. If a monitoring and alerting process was in place, troubleshooting would have been easy.
For example, if you are developing an online shop you are going to gain insights on how your application performs in production: which are your most visited products, the slowest to load products, which users are experiencing slowness, what device, browser or geography may be slow, how long does it take to load the frontend part versus the backend database, to mention a few typical use cases. To gain this visibility you can either use an APM tool, use OpenTracing libraries or generate metrics ad-hoc for these parts you are interested in.
Comparing Custom Metrics, APM and OpenTracing: How to instrument code?
Are custom metrics the same as APM? No, both are often complementary. Organizations sometimes need both to monitor and troubleshoot large-scale, distributed or complex applications, but many times they can do their job with infrastructure monitoring plus custom metrics.
It’s always important that application developers understand the behaviour of their application and how it performs so they can find issues and solve them quickly. They can do that leveraging an APM that gives you an overview of transactions or instrumenting their code through custom metrics to gain observability on the parts they are keen on inside their code.
APM tools are useful for ‘detecting and isolating’ a problem and enabling a code developer to troubleshooting at a code level. However, as we know, most production performance issues are not code-related but infrastructure and application related. APM tools seldom go deep to the layers below the application to help determine the root cause. APM can tell the what is slow , you need more full stack telemetry to determine why its slow.
There are multiple reasons why your application can fail and are completely unrelated to code and can’t be covered just using APM even if includes lightweight infrastructure monitoring, for example:
- Your deployments that have JVMs have high heap usage.
- Your disk available to an specific application is at its full capacity.
- Some nodes in your cluster are oversubscribed.
- Kubernetes is not running all the replicas you requested for a given deployment.
- A daemon that your application depends on is failing.
- Some other application is using too much CPU
- You are under a DDoS attack.
- Kafka queues are backed up.
- Stolen CPU is high on adjacent VMs in the cluster.
- Certain processes are over utilizing network bandwidth within the same namespace.
- Cassandra compactions have dropped indicating not enough data being backed up for your services.
- Network bottleneck caused by dropped packets.
While APM can help in a situation like:
- A page takes longer than expected to load, but only sometimes
- You need to identify which parts generate slow queries on your backend
A right DevOps strategy requires full stack visibility across all stages of the application lifecycle, but APM itself does not provide that; APM data just represents a small subset of all the information that a infrastructure monitoring system can process.
So, when do you need APM and when you should implement custom metrics instead? It depends.
If you are early in your development process and you know which parts can be problematic you can start implementing custom metrics from the beginning. If you know or you can guess where you problem comes from you can instrument only that part of your code base.
If you need metrics all over the code and per transaction observability APM is what it takes then. But if you don’t need per transaction because you are doing server-side microservices monitoring, a full stack scalable telemetry tool that encompasses APM golden signals, application metrics and low level infrastructure and network metrics would be more valuable.
APM at the same time is more resource intensive, sometimes actually can become quite hungry, so that needs to be considered when running in production environments. That’s why you will find that often only a subset of request are sampled. Custom metrics instead, are definitely more lightweight.
OpenTracing is an standard for distributed tracing. We can see this as the instrumentation part of an APM. Ideally you instrument your code one time and then you should be able to visualize your data in multiple places, including open-source UIs or commercial APMs.
Comparison table between Custom Metrics and APM
All this sounds like too many if… can you give me a table so we can get this quickly? You got it.
|Code-related problems||Devs need to provide metrics with performance in code but are not as easy to identify||Yes||Yes|
|Node and service level aggregation||Yes||No||No|
|Standard implementaton||Some languages include a standard way to implement them: (Prometheus, Java JMX, Go expvar, …)||No||Yes|
|Allows capacity planning||Yes||No||No|
|Allows complete statistical measurements||Yes||No||No|
|Cloud Native Computing Foundation standard||Prometheus metrics only||No||Yes|
|Distributed application analysis||Yes, without per trace analysis||Yes||Yes|
|Useful for developers for pre-production environments||Yes||Yes||Yes|
|Useful for complete DevOps strategy||Yes||No||No|
Custom Metrics instrumentation examples
We’ve prepared some custom metric instrumentation examples for you about how to implement them in:
- How to instrument Java code (with JMX custom metrics)
- How to instrument Go code with custom expvar metrics
As you can see there are multiple different ways to instrument your application. Each one has it’s benefits and it’s drawbacks and they are all focused in different approaches.
APM gives you transaction level observability and focuses on the code, custom metrics requires to know what you are doing, because it’s a more DIY but won’t give you per transaction observability, instead will provide full stack visibility: hosts, orchestration, services and applications.
You can use one, some or all of them to throw all the light to your application and be confident that’s performing correctly, will depend on your needs.
When operating a production platform we should prepare for failure and in addition to monitoring working metrics, we cannot forget about the most common failure points. From an agnostic monitoring solution perspective, and following a use-case driven approach, we will learn the most common failure points in a Kubernetes infrastructure and how to detect them (metrics, events, checks, etc).