Comparing Custom Metrics, APM and OpenTracing: How to instrument code?Are custom metrics the same as APM? No, both are often complementary. Organizations sometimes need both to monitor and troubleshoot large-scale, distributed or complex applications, but many times they can do their job with infrastructure monitoring plus custom metrics. It’s always important that application developers understand the behaviour of their application and how it performs so they can find issues and solve them quickly. They can do that leveraging an APM that gives you an overview of transactions or instrumenting their code through custom metrics to gain observability on the parts they are keen on inside their code. APM tools are useful for ‘detecting and isolating’ a problem and enabling a code developer to troubleshooting at a code level. However, as we know, most production performance issues are not code-related but infrastructure and application related. APM tools seldom go deep to the layers below the application to help determine the root cause. APM can tell the what is slow , you need more full stack telemetry to determine why its slow. There are multiple reasons why your application can fail and are completely unrelated to code and can’t be covered just using APM even if includes lightweight infrastructure monitoring, for example:
- Your deployments that have JVMs have high heap usage.
- Your disk available to an specific application is at its full capacity.
- Some nodes in your cluster are oversubscribed.
- Kubernetes is not running all the replicas you requested for a given deployment.
- A daemon that your application depends on is failing.
- Some other application is using too much CPU
- You are under a DDoS attack.
- Kafka queues are backed up.
- Stolen CPU is high on adjacent VMs in the cluster.
- Certain processes are over utilizing network bandwidth within the same namespace.
- Cassandra compactions have dropped indicating not enough data being backed up for your services.
- Network bottleneck caused by dropped packets.
- A page takes longer than expected to load, but only sometimes
- You need to identify which parts generate slow queries on your backend
Comparison table between Custom Metrics and APMAll this sounds like too many if… can you give me a table so we can get this quickly? You got it.
|Code-related problems||Devs need to provide metrics with performance in code but are not as easy to identify||Yes||Yes|
|Node and service level aggregation||Yes||No||No|
|Standard implementaton||Some languages include a standard way to implement them: (Prometheus, Java JMX, Go expvar, …)||No||Yes|
|Allows capacity planning||Yes||No||No|
|Allows complete statistical measurements||Yes||No||No|
|Cloud Native Computing Foundation standard||Prometheus metrics only||No||Yes|
|Distributed application analysis||Yes, without per trace analysis||Yes||Yes|
|Useful for developers for pre-production environments||Yes||Yes||Yes|
|Useful for complete DevOps strategy||Yes||No||No|
- How to instrument Java code (with JMX custom metrics)
- How to instrument Go code with custom expvar metrics
- Monitoring StatsD: metric types, format & code examples
- Prometheus metrics / OpenMetrics code instrumentation