The four Golden Signals of Kubernetes monitoring

By Javier Martínez - OCTOBER 27, 2022
Golden Signals Cover

Golden Signals are a reduced set of metrics that offer a wide view of a service from a user or consumer perspective: Latency, Traffic, Errors and Saturation. By focusing on these, you can be quicker at detecting potential problems that might be directly affecting the behavior of the application.

Google introduced the term “Golden Signals” to refer to the essential metrics that you need to measure in your applications. They are the following:

  • Errors – rate of requests that fail.
  • Saturation – consumption of your system resources.
  • Traffic – amount of use of your service per time unit.
  • Latency – the time it takes to serve a request.
The four Golden Signals of Kubernetes monitoring


This is just a set of essential signals to start monitoring in your system. In other words, if you’re wondering which signals to monitor, you will need to look at these four first.

Enter: Goldilocks and the four Monitoring Signals

Once upon a time, there was a little girl called Goldilocks, who lived at the other side of the wood and had been sent on an errand by her mother, passed by the house, and looked in at the window…

Errors

Goldilocks then tried the little chair, which belonged to the Little Bear, and found it just right, but she sat in it so hard that she broke it.

The four Golden Signals of Kubernetes monitoring
The error rate for the chairs is ⅓

The Errors golden signal measures the rate of requests that fail.

Note that measuring the bulk amount of errors might not be the best course of action. If your application has a sudden peak of requests, then logically the amount of failed requests may increase.

That’s why usually monitoring systems focus on the error rate, calculated as the percent of calls that are failing from the total.

If you’re managing a web application, typically you will discriminate between those calls returning HTTP status in the 400-499 range (client errors) and 500-599 (server errors).

Measuring errors in Kubernetes

One thermometer for the errors happening in Kubernetes is the Kubelet. You can use several Kubernetes State Metrics in Prometheus to measure the amount of errors.

The most important one is kubelet_runtime_operations_errors_total, which indicates low level issues in the node, like problems with container runtime.

If you want to visualize errors per operation, you can use kubelet_runtime_operations_total to divide.

Errors example

Here’s the Kubelet Prometheus metric for error rate in a Kubernetes cluster:

sum(rate(kubelet_runtime_operations_errors_total{cluster="", job="kubelet", metrics_path="/metrics"}[$__rate_interval])) by (instance, operation_type)
Code language: JavaScript (javascript)
The four Golden Signals of Kubernetes monitoring

Saturation

Goldilocks tasted the porridge in the dear little bowl, and it was just right, and it tasted so good that she tasted and tasted, and tasted and tasted until she was full.

The four Golden Signals of Kubernetes monitoring
After eating one small bowl, Goldilocks is unable to eat more. That’s saturation.

Saturation measures the consumption of your system resources, usually as a percentage of the maximum capacity. Examples include:

  • CPU usage
  • Disk space
  • Memory usage
  • Network bandwidth

In the end, cloud applications run on machines, which have a limited amount of these resources.

In order to correctly measure, you should be aware of the following:

  • What are the consequences if the resource is depleted? It could be that your entire system is unusable because this space has run out. Or maybe further requests are throttled until the system is less saturated.
  • Saturation is not always about resources about to be depleted. It’s also about over-resourcing, or allocating a higher quantity of resources than what is needed. This one is crucial for cost savings.

Measuring saturation in Kubernetes

Since saturation depends on the resource being observed, you can use different metrics for Kubernetes entities:

  • node_cpu_seconds_total to measure machine CPU utilization.
  • container_memory_usage_bytes to measure the memory utilization at container level (paired with container_memory_max_usage_bytes).
  • The amount of Pods that a Node can contain is also a Kubernetes resource.

Saturation example

Here’s a PromQL example of a Saturation signal, measuring CPU usage percent in a Kubernetes node.

100 - (avg by (instance) (rate(node_cpu_seconds_total{}[5m])) * 100)
Code language: JavaScript (javascript)
The four Golden Signals of Kubernetes monitoring

Traffic

And the Middle-sized Bear said:
“Somebody has been tumbling my bed!”
And the Little bear piped:
“Somebody has been tumbling my bed, and here she is!”

The four Golden Signals of Kubernetes monitoring
One of the beds is being used, but none should be being used instead. That’s an unusual traffic.

Traffic measures the amount of use of your service per time unit.

In essence, this will represent the usage of your current service. This is important not only for business reasons, but also to detect anomalies.

Is the amount of requests too high? This could be due to a peak of users or because of a misconfiguration causing retries.

Is the amount of requests too low? That may reflect that one of your systems is failing.

Still, traffic signals should always be measured with a time reference. As an example, this blog receives more visits from Tuesday to Thursday.

Depending on your application, you could be measuring traffic by:

  • Requests per minute for a web application
  • Queries per minute for a database application
  • Endpoint requests per minute for an API

Traffic example

Here’s a Google Analytics chart displaying traffic distributed by hour:

The four Golden Signals of Kubernetes monitoring

Latency

At that, Goldilocks woke in a fright, and jumped out of the window and ran away as fast as her legs could carry her, and never went near the Three Bears’ snug little house again.

The four Golden Signals of Kubernetes monitoring
Goldilocks ran down the stairs in just two seconds. That’s a very low latency.

Latency is defined as the time it takes to serve a request.

Average latency

When working with latencies, your first impulse may be to measure average latency, but depending on your system that might not be the best idea. There may be very fast or very slow requests distorting the results.

Instead, consider using a percentile, like p99, p95, and p50 (also known as median) to measure how the fastest 99%, 95%, or 50% of requests, respectively, took to complete.

Failed vs. successful

When measuring latency, it’s also important to discriminate between failed and successful requests, as failed ones might take sensibly less time than the correct ones.

Apdex Score

As described above, latency information may not be informative enough:

  • Some users might perceive applications as slower, depending on the action they are performing.
  • Some users might perceive applications as slower, based on the default latencies of the industry.

This is where the Apdex (Application Performance Index) comes in. It’s defined as:

The four Golden Signals of Kubernetes monitoring

Where t is the target latency that we consider as reasonable.

  • Satisfied will represent the amount of users with requests under the target latency.
  • Tolerant will represent the amount of non-satisfied users with requests below four times the target latency.
  • Frustrated will represent the amount of users with requests above the tolerant latency.

The output for the formula will be an index from 0 to 1, indicating how performant our system is in terms of latency.

Measuring latency in Kubernetes

In order to measure the latency in your Kubernetes cluster, you can use metrics like http_request_duration_seconds_sum.

You can also measure the latency for the api-server by using Prometheus metrics like apiserver_request_duration_seconds.

Latency example

Here’s an example of a Latency PromQL query for the 95% best performing HTTP requests in Prometheus:

histogram_quantile(0.95, sum(rate(prometheus_http_request_duration_seconds_bucket[5m])) by (le))
Code language: JavaScript (javascript)
The four Golden Signals of Kubernetes monitoring

RED Method

The RED Method was created by Tom Wilkie, from Weaveworks. It is heavily inspired by the Golden Signals and it’s focused on microservices architectures.

RED stands for:

  • Rate
  • Error
  • Duration

Rate measures the number of requests per second (equivalent to Traffic in the Golden Signals).

Error measures the number of failed requests (similar to the one in Golden Signals).

Duration measures the amount of time to process a request (similar to Latency in Golden Signals).

USE Method

The USE Method was created by Brendan Gregg and it’s used to measure infrastructure.

USE stands for:

  • Utilization
  • Saturation
  • Errors

That means for every resource in your system (CPU, disk, etc.), you need to check the three elements above.

Utilization is defined as the percentage of usage for that resource.

Saturation is defined as the queue for requests in the system.

Errors is defined as the number of errors happening in the system.

While it may not be intuitive, Saturation in Golden Signals is not similar to the Saturation in USE, but rather Utilization.

The four Golden Signals of Kubernetes monitoring

A practical example of Golden signals in Kubernetes

As an example to illustrate the use of Golden Signals, here’s a simple go application example with Prometheus instrumentation. This application will apply a random delay between 0 and 12 seconds in order to give usable information of latency. Traffic will be generated with curl, with several infinite loops.

An histogram was included to collect metrics related to latency and requests. These metrics will help us obtain the initial three Golden Signals: latency, request rate and error rate. To obtain saturation directly with Prometheus and node-exporter, use percentage of CPU in the nodes.


File: main.go ------------- package main import ( "fmt" "log" "math/rand" "net/http" "time" "github.com/gorilla/mux" "github.com/prometheus/client_golang/prometheus/promhttp" ) func main() { //Prometheus: Histogram to collect required metrics histogram := prometheus.NewHistogramVec(prometheus.HistogramOpts{ Name: "greeting_seconds", Help: "Time take to greet someone", Buckets: []float64{1, 2, 5, 6, 10}, //Defining small buckets as this app should not take more than 1 sec to respond }, []string{"code"}) //This will be partitioned by the HTTP code. router := mux.NewRouter() router.Handle("/sayhello/{name}", Sayhello(histogram)) router.Handle("/metrics", promhttp.Handler()) //Metrics endpoint for scrapping router.Handle("/{anything}", Sayhello(histogram)) router.Handle("/", Sayhello(histogram)) //Registering the defined metric with Prometheus prometheus.Register(histogram) log.Fatal(http.ListenAndServe(":8080", router)) } func Sayhello(histogram *prometheus.HistogramVec) http.HandlerFunc { return func(w http.ResponseWriter, r *http.Request) { //Monitoring how long it takes to respond start := time.Now() defer r.Body.Close() code := 500 defer func() { httpDuration := time.Since(start) histogram.WithLabelValues(fmt.Sprintf("%d", code)).Observe(httpDuration.Seconds()) }() if r.Method == "GET" { vars := mux.Vars(r) code = http.StatusOK if _, ok := vars["anything"]; ok { //Sleep random seconds rand.Seed(time.Now().UnixNano()) n := rand.Intn(2) // n will be between 0 and 3 time.Sleep(time.Duration(n) * time.Second) code = http.StatusNotFound w.WriteHeader(code) } //Sleep random seconds rand.Seed(time.Now().UnixNano()) n := rand.Intn(12) //n will be between 0 and 12 time.Sleep(time.Duration(n) * time.Second) name := vars["name"] greet := fmt.Sprintf("Hello %s \n", name) w.Write([]byte(greet)) } else { code = http.StatusBadRequest w.WriteHeader(code) } } }
Code language: JavaScript (javascript)

The application was deployed in a Kubernetes cluster with Prometheus and Grafana, and generated a dashboard with Golden Signals. In order to obtain the data for the dashboards, these are the PromQL queries:

Latency:

sum(greeting_seconds_sum)/sum(greeting_seconds_count) //Average histogram_quantile(0.95, sum(rate(greeting_seconds_bucket[5m])) by (le)) //Percentile p95
Code language: JavaScript (javascript)

Request rate:

sum(rate(greeting_seconds_count{}[2m])) //Including errors rate(greeting_seconds_count{code="200"}[2m]) //Only 200 OK requests
Code language: JavaScript (javascript)

Errors per second:

sum(rate(greeting_seconds_count{code!="200"}[2m]))
Code language: JavaScript (javascript)

Saturation:

100 - (avg by (instance) (irate(node_cpu_seconds_total{}[5m])) * 100)
Code language: JavaScript (javascript)

Conclusion

Golden Signals, RED, and USE are just guidelines on what you should be focusing on when looking at your systems. But these are just the bare minimum on what to measure.

Understand the errors in your system. They will be a thermometer of all the other metrics, as they will point to any unusual behavior. And remember that you need to correctly mark requests as erroneous, but only the ones that should be exceptionally incorrect. Otherwise, your system will be prone to false positives or false negatives.

Measure latency of your requests. Try to understand your bottlenecks and what the negative experiences are when latency is higher than expected.

Visualize saturation and understand the resources involved in your solution. What are the consequences if a resource gets depleted?

Measure traffic to understand your usage curves. You will be able to find the best time to take down your system for an update, or you could be alerted when there’s an unexpected amount of users.

Once metrics are in place, it’s important to set up alerts, which will notify you in case any of these metrics reach a certain threshold.


Track golden signals easily with Sysdig Monitor

With Sysdig Monitor, you can quickly review the golden signals in your system, out of the box.

Review easily the Latency, Errors, Saturation and Traffic for the Pods in your cluster. And thanks to its Container Observability with eBPF, you can do this without adding any app or code instrumentation.

Sysdig Advisor accelerates mean time to resolution (MTTR) with live logs, performance data, and suggested remediation steps. It’s the easy button for Kubernetes troubleshooting!

Sysdig Monitor Golden Signals

Try it free for 30 days!