Blog Icon

Blog Post

How to monitor Golden signals in Kubernetes.

What are Golden signals metrics? How do you monitor golden signals in Kubernetes applications? Golden signals can help to detect issues of a microservices application. These signals are a reduced set of metrics that offer a wide view of a service from a user or consumer perspective, so you can detect potential problems that might be directly affecting the behaviour of the application.

Golden signals, a standard for Kubernetes application monitoring

Congratulations, you have successfully deployed your application in Kubernetes. This is the moment you discover your old monitoring tools are pretty much useless and that you are not able to detect potential problems. Classic monitoring tools are usually based on static configuration files and were designed to monitor machines, not microservices or containers. In the container world things change fast. Containers are created and destroyed at an incredible pace and it is impossible to catch up without specific service discovery functions.

Most of the modern monitoring systems offer a huge variety of metrics for many different purposes. It is quite easy to drown in metrics and lose focus on what is really relevant for your application. Setting too many irrelevant alerts can drive you to a permanent emergency status and “alert burn out.” Imagine a node that is being heavily used and raising load alerts all the time. You are not doing anything about that as long as the services in the node work. Having too many alerts is as bad as not having any. Important alerts get masked in a sea of irrelevance.

This is a problem that many people have faced and, fortunately, someone has already solved. The answer is the four Golden signals, a term used first in the Google SRE handbook. Golden signals are four metrics that will give you a very good idea of the real health and performance of your application as seen by the actors interacting with that service, either if they are final users or another service in your microservice application.

golden signals

Picture from Denise Yu (@deniseyu21).

Golden signals metric: Latency explained

Latency is the time your system takes to serve a request against the service. This is an important sign to detect a performance degradation problem.

When using latency, it is not enough to use average values as they can be misleading. For example, we have a service that shows an average of 100 ms of response time. With only this information we can consider it pretty good, but the feedback of the users is that is perceived as slow.

The answer to this contradiction can be discovered using different statistic parameters, like standard deviation, that will give us an idea of the dispersion of the latency values. What if we have two kinds of requests, one of them is very fast, and the other slow because it is more database intensive. If a typical user interaction has one slow request and ten fast ones, the mean will probably be pretty low, but the application will be slow. Bottleneck analysis is important too, not only mean values.

A great tool to avoid this behavior are histogram metrics. These indicate the number of requests under different latency thresholds and allows to aggregate in percentiles. A percentile is a value below which a given percentage of measures falls, in example, p99 says that 99% of my requests have a lower latency value than the percentile.

As you can see in the screenshot, average latency is acceptable, but if we look at percentile, we see a lot of dispersion in the values, giving a better idea of what is the real latency perception. Different percentiles express different information, p50 usually express general performance degradation and p95 – or p99 – allows detection of performance issues in specific requests or components of the system.

Another useful tool to analyze latency values can be the APDEX score that, given your SLA terms, can give a good general idea of how good is your system condition based on percentiles.

Golden signals metric: Errors explained

The rate of errors returned by your service is a very good indicator of deeper issues. It is very important to detect not only explicit errors, but implicit errors too.

An explicit error would be any kind of HTTP error code. These are pretty easy to identify as the error code is easily obtained from the reply headers and they are pretty consistent throughout many systems. Some examples of these errors could be authorization error (503), content not found (404) or server error (500). Error description can be very specific in some cases (418 – I’m a teapot).

On the other hand, implicit errors can be trickier to detect. How about a request with HTTP response code 200 but with an error message in the content? Different policy violations should be considered as errors too:

  • Errors that do not generate HTTP reply, as a request that took longer than the timeout.
  • Content error in an apparently successful request.

When using dashboards to analyze errors, mean values or percentiles do not make any sense. In order to properly see the impact of errors, the best way is to use rates. Number of errors per second can give detailed information of when the system started to fail and with what impact.

Golden signals metric: Traffic / connections explained

Traffic or connections is an indicator of the amount of use of your service per time unit. It can be many different values depending on the nature of the system, like the number of requests to an API or the bandwidth consumed by a streaming app.

It can be useful to group the traffic indicators depending on different parameters, like response code or related to business logic.

Golden signals metric: Saturation explained

Saturation should be the answer to a question: how full is my system?

Usually saturation is expressed as a percentage of the maximum capacity, but each system will have different ways to measure saturation. The percentage could mean the number of users or requests obtained directly from the application or based upon estimations.

Most times, saturation is derived from system metrics, like CPU or memory, so they don’t rely on instrumentation and are collected directly from the system, using different methods, like Prometheus node-exporter. Obtaining system metrics from a Kubernetes node is essentially the same as with any other system. At the end of the day they are Linux machines.

It is important to choose the adequate metrics and use as few as possible. The key to successfully measuring saturation is to choose the metrics that are constraining the performance of the system. If your application is processor intensive, use CPU load. If it is memory intensive, choose used memory. The process of choosing saturation metrics is often a good exercise to detect bottlenecks in the application.

You should set alerts in order to detect saturation with some margin because usually the performance drastically falls when saturation goes over 80%.

Golden signals vs RED method vs USE method in Kubernetes

There are several approaches to design an efficient monitoring system for an application, but commonly they are base in the four Golden signals.Some of them like the RED method gives more importance to organic metrics, like requests rate, errors and latency. Others like the USE method focus on system level metrics and low level values like use of the CPU, memory and I/O. When do we need to use each approach?

RED method

RED method is focused on parameters of the application, without considering the infrastructure that runs the applications. It is an external view of the service – how the clients see the service. Golden signals try to add the infra component by adding the saturation value, that will be necessarily implied from system metrics. This way we have a deeper view, as every service is unavoidably tied to the infrastructure running it. Maybe an external view is fine, but saturation will give you an idea of “how far” the service is from a failure.

USE method

USE method puts the accent on utilization of resources, including errors in the requests as the only external indicator of problems. This method could overpass issues that affect some parts of the service. What if the database is slow due to a bad query optimization? That would increase latency but would not be noticeable in saturation. Golden signals try to get the best of both methods including external observable and system parameters.

Having said this, all these methods have a common point – they try to homogenize and simplify complex systems, in order to make the incident detection easier. If you are capable of detecting any issue with a little set of metrics, the process of scaling your monitoring to a big number of systems will be almost trivial.

Simplify monitoring, a good side effect

As a good side effect, reducing the number of metrics involved in incident detection helps to reduce alert fatigue due to arbitrary alerts set on metrics that will undoubtedly become a real issue or do not have a clear direct action path.

As a weakness, any simplification will remove details in the information received. It is important to note that, despite Golden signals being a good way to detect ongoing or future problems, once the problem is detected, the investigation process will require the use of different inputs to be able to dig deeper into the root cause of the problem. Any tool at hand can be useful for the troubleshooting process, like custom metrics or different metric aggregation – for example separate latency per deployment.

Golden signal metrics instrumentation in Kubernetes

Instrumenting code with Prometheus metrics / custom metrics

In order to get Golden signals with Prometheus, code changes (instrumentation) will be required. This topic is quite vast and has been covered in many previous articles like Prometheus metrics / OpenMetrics code instrumentation.

Prometheus has been positioned as a de facto standard for metric collecting, so most of the languages have a library to implement custom metrics in your application in a more convenient way. Nevertheless, instrumenting custom metrics requires a deep understanding of what the application does.

A poorly implemented code instrumentation can end up with time series cardinality bombing and a real chance to collapse your metric collection systems. I have seen using request id as a label, so it generated a time series per request (I promise this is real). Obviously this is something you don’t want in your monitoring system as it increases resources needed to collect the information and can potentially cause downtimes. Choosing a correct aggregation can be key to a successful monitoring approach.

Sysdig eBPF system call visibility (no instrumentation)

Sysdig monitor uses eBPF protocol to get information of all the system calls directly from the kernel. This way your application does not need any modification, neither in the code nor at container runtime. What is running in your nodes is exactly the container you built, with the exact version of the libraries, with your code (or binaries) intact.

System calls can give information about the processes running, memory allocation, network connections, access to filesystem, and resource usage among other things. With this information it is possible to obtain meaningful metrics that will provide a lot of information of what is happening in my systems.

Golden signals are some of the metrics available out of the box, providing latency, requests rate, errors and saturation, with a special added value that all these metrics are correlated with the information collected from the Kubernetes API. This correlation allows you to do meaningful aggregations and represent the information using multiple dimensions:

  • Group latency by node -> This will provide information about different problems with your kubernetes infrastructure
  • Group latency by deployment -> This allows to track problems in different microservices or applications
  • Group latency by pod -> Maybe a pod in your deployment is unhealthy

These different levels of aggregation allow us to slice our data and locate issues, helping with troubleshooting tasks by digging into the different levels of the kubernetes entities, from cluster to node, to deployment and then to pod.

Instrumenting code APM / Opentracing

Different APM (Application Performance Monitoring) applications can give very specific information about your application, including where is the code responsible for a specific action. This requires instrumentation, either with code changes or with changes on your application container.

This method requires the monitoring agent to load libraries in your application, explicitly in the code or implicitly (by binary preloading or modified runtimes). This means that what is running in production could not be the exact code you programmed in development, implying a risk of unforeseen problems and uncontrolled software updates. You are even exposed to a crash in your application due to a crash in the instrumentation code. Performance degradation can be an issue too as APM requires more work to retrieve all the data.

In addition, you are running third party code in the instrumentation. Does your security team audit the code of the APM library?

Opentracing can be a good alternative to commercial APM as it presents an agnostic instrumentation method. It can be used with many different open source and commercial solutions and it has a good community that takes care of the reliability and security of the libraries. One more thing, it is under the CNCF umbrella.

The relation between APM and Golden signals is somehow complex, because some of the parameters are related with the infrastructure – like saturation – and usually this is the weakest part of an APM approach.

You can find more information about this topic here: How to instrument code: Custom metrics vs APM vs OpenTracing.

Istio

Istio is a service mesh, a layer over the applications deployed in Kubernetes that provide different features to manage networking functions, like canary deployments, intelligent routing, circuit breakers, load balancing, network policy enforcement or health checks.

One of the features that Istio provides is visibility into the services, with a limited tracing feature. It gives information about latency, errors and requests, being a very good approach to easily obtain the Golden signals. You can learn more about getting istio metrics in our blog: How to monitor Istio.

A practical example of Golden signals in Kubernetes

As an example to illustrate the use of Golden signals, we have deployed a simple example go application with Prometheus instrumentation. This application will apply a random delay between 0 and 12 seconds in order to give usable information of latency. Traffic will be generated with curl, with several infinite loops.

We have included a histogram to collect metrics related to latency and requests. These metrics will help us to obtain the three first Golden Signals: latency, request rate and error rate. We will obtain saturation directly with Prometheus and node-exporter, using in this example percentage of CPU in the nodes.

We have deployed the application in a Kubernetes cluster with Prometheus and Grafana and generated a dashboard with Golden signas. In order to obtain the data for the dashboards we have used this PromQL sentences:

  • Latency:
    sum(greeting_seconds_sum)/sum(greeting_seconds_count) //Average
    histogram_quantile(0.95, sum(rate(greeting_seconds_bucket[5m])) by (le)) //Percentile p95
  • Request rate:
    sum(rate(greeting_seconds_count{}[2m])) //Including errors rate(greeting_seconds_count{code="200"}[2m]) //Only 200 OK requests
  • Errors per second:
    sum(rate(greeting_seconds_count{code!="200"}[2m]))
  • Saturation: We have used cpu percentage obtained with node-exporter:
    100 - (avg by (instance) (irate(node_cpu_seconds_total{}[5m])) * 100)

This way we obtain this dashboard with the Golden signals:

golden signals dashboard grafana

This cluster also has the Sysdig agent installed. Sysdig allows us to obtain these same Golden signals without the use of instrumentation (although Sysdig could pull in Prometheus metrics too!). With Sysdig we could use a default dashboard and we would obtain the same meaningful information out of the box!

golden signals dashboard sysdig

Depending on the nature of the application it is possible to do different aggregations:

  • Response time segmented by response code.
  • Error rate segmented by response code.
  • CPU usage per service or deployment.

Caveats and gotchas of Golden signals in Kubernetes

  • Golden signals are one of the best ways to detect possible problems, but once the problem is detected you will have to use additional metrics and steps to diagnose further the problem. Detecting issues and resolving them are two different tasks and they require different tools and views of the application.

  • Mean is not always meaningful, check standard deviation too, especially with latency. Take in consideration the request path of your application to look for bottlenecks. You should use percentiles instead of averages (or in addition to them).

  • Does it make sense to alert every time the CPU or load is high? Probably not. Avoid “alert burnout” setting alerts only in parameters than are clearly indicative of problems. If it is not an actionable alert, just remove it.

  • In the case where a parameter does not look good but it is not affecting directly your application, do not set an alert. Instead, create tasks in your backlog to analyze the behaviour and avoid possible issues in the long term.

Share This

Stay up to date

Sign up to recieve our newest.

Related Posts

How to instrument code: Custom metrics vs APM vs OpenTracing.

How to monitor Istio, the Kubernetes service mesh

Announcing the Sysdig Cloud-Native Visibility + Security Platform 2.0