Custom Metrics (JMX, Golang expvar, Prometheus, statsd or many other), APM and Opentracing are different approaches on how to instrument code in order to monitor health, performance and troubleshoot your application more easily. To instrument your code in your application, you need to understand the differences between the different options and their pros and cons in each use case; in this guide we are going to cover how to create custom metrics in Java, Golang, Javascript and Python, so you can make yourself an idea about how it works.
Why do we need code instrumentation?
We all agree that creating an application and not monitoring it, is not a good idea. I’ve personally seen projects where they weren’t monitoring the application; everything crashed, they had no idea about what was going on and troubleshooting was chaos. No one knew what was going on and it was crashing on a per day basis in production, creating a terrible user experience and calling the sysadmin team at 3 AM complaining that they couldn’t work. It was a mess. If a monitoring and alerting process was in place, troubleshooting would have been easy. For example, if you are developing an online shop you are going to gain insights on how your application performs in production: which are your most visited products, the slowest to load products, which users are experiencing slowness, what device, browser or geography may be slow, how long does it take to load the frontend part versus the backend database, to mention a few typical use cases. To gain this visibility you can either use an APM tool, use OpenTracing libraries or generate metrics ad-hoc for these parts you are interested in. In this blog post series we are going to compare all the custom metric instrumentation options (Prometheus, expvar, statsd and JMX) showing examples on how to implement them in different programming languages: Golang, Java, Python, and Javascript. We will also compare using custom metrics with using APM and OpenTracing, so you can understand what do you exactly need.Comparing Custom Metrics, APM and OpenTracing: How to instrument code?
Are custom metrics the same as APM? No, both are often complementary. Organizations sometimes need both to monitor and troubleshoot large-scale, distributed or complex applications, but many times they can do their job with infrastructure monitoring plus custom metrics. It’s always important that application developers understand the behaviour of their application and how it performs so they can find issues and solve them quickly. They can do that leveraging an APM that gives you an overview of transactions or instrumenting their code through custom metrics to gain observability on the parts they are keen on inside their code. APM tools are useful for ‘detecting and isolating’ a problem and enabling a code developer to troubleshooting at a code level. However, as we know, most production performance issues are not code-related but infrastructure and application related. APM tools seldom go deep to the layers below the application to help determine the root cause. APM can tell the what is slow , you need more full stack telemetry to determine why its slow. There are multiple reasons why your application can fail and are completely unrelated to code and can’t be covered just using APM even if includes lightweight infrastructure monitoring, for example:- Your deployments that have JVMs have high heap usage.
- Your disk available to an specific application is at its full capacity.
- Some nodes in your cluster are oversubscribed.
- Kubernetes is not running all the replicas you requested for a given deployment.
- A daemon that your application depends on is failing.
- Some other application is using too much CPU
- You are under a DDoS attack.
- Kafka queues are backed up.
- Stolen CPU is high on adjacent VMs in the cluster.
- Certain processes are over utilizing network bandwidth within the same namespace.
- Cassandra compactions have dropped indicating not enough data being backed up for your services.
- Network bottleneck caused by dropped packets.
- A page takes longer than expected to load, but only sometimes
- You need to identify which parts generate slow queries on your backend
Comparison table between Custom Metrics and APM
All this sounds like too many if… can you give me a table so we can get this quickly? You got it.Custom Metrics | APM | OpenTracing | |
---|---|---|---|
Code-related problems | Devs need to provide metrics with performance in code but are not as easy to identify | Yes | Yes |
Infrastructure-related problems | Yes | No | No |
Node and service level aggregation | Yes | No | No |
Standard implementaton | Some languages include a standard way to implement them: (Prometheus, Java JMX, Go expvar, …) | No | Yes |
Allows capacity planning | Yes | No | No |
Allows complete statistical measurements | Yes | No | No |
Cloud Native Computing Foundation standard | Prometheus metrics only | No | Yes |
Distributed application analysis | Yes, without per trace analysis | Yes | Yes |
Useful for developers for pre-production environments | Yes | Yes | Yes |
Useful for complete DevOps strategy | Yes | No | No |
Custom Metrics instrumentation examples
There are multiple ways to instrument your code with custom metrics. Some languages like Java and Golang provide standard ways to do so, native to those languages, while for others like Javascript or Python, you will require third-party libraries (e.g. Prometheus or Statsd) to feed your monitoring system. We’ve prepared some custom metric instrumentation examples for you about how to implement them in:- How to instrument Java code (with JMX custom metrics)
- How to instrument Go code with custom expvar metrics
- Monitoring StatsD: metric types, format & code examples
- Prometheus metrics / OpenMetrics code instrumentation