< back to blog

The four Golden Signals of Monitoring

Javier Martínez
Javier Martínez
@
The four Golden Signals of Monitoring
Published:
October 27, 2022
Table of contents
This is the block containing the component that will be injected inside the Rich Text. You can hide this block if you want.
This is the block containing the component that will be injected inside the Rich Text. You can hide this block if you want.

Golden Signals are a reduced set of metrics that offer a wide view of a service from a user or consumer perspective: Latency, Traffic, Errors and Saturation. By focusing on these, you can be quicker at detecting potential problems that might be directly affecting the behavior of the application.

Google introduced the term “Golden Signals” to refer to the essential metrics that you need to measure in your applications. They are the following:

  • Errors – rate of requests that fail.
  • Saturation – consumption of your system resources.
  • Traffic – amount of use of your service per time unit.
  • Latency – the time it takes to serve a request.

This is just a set of essential signals to start monitoring in your system. In other words, if you’re wondering which signals to monitor, you will need to look at these four first.

Errors

The Errors golden signal measures the rate of requests that fail.

Note that measuring the bulk amount of errors might not be the best course of action. If your application has a sudden peak of requests, then logically the amount of failed requests may increase.

That’s why usually monitoring systems focus on the error rate, calculated as the percent of calls that are failing from the total.

If you’re managing a web application, typically you will discriminate between those calls returning HTTP status in the 400-499 range (client errors) and 500-599 (server errors).

Measuring errors in Kubernetes

One thermometer for the errors happening in Kubernetes is the Kubelet. You can use several Kubernetes State Metrics in Prometheus to measure the amount of errors.

The most important one is kubelet_runtime_operations_errors_total, which indicates low level issues in the node, like problems with container runtime.

If you want to visualize errors per operation, you can use kubelet_runtime_operations_total to divide.

Errors example

Here’s the Kubelet Prometheus metric for error rate in a Kubernetes cluster:

sum(rate(kubelet_runtime_operations_errors_total{cluster="\

About the author

No items found.
featured resources

Test drive the right way to defend the cloud
with a security expert