How to Monitor etcd on Kubernetes

By on May 2, 2017
etcd logo

In the following tutorial we’ll show how to monitor etcd as a registry service for a Kubernetes cluster, including metrics, alerts, health check, performance and common failure points. But first things first,

What is etcd?

The motivation of etcd is to provide a distributed key-value dynamic database which maintains a “configuration registry”. This registry is one of the foundations of a Kubernetes cluster service directory, peer discovery and centralized configuration management. It bears certain resemblance to a Redis database, classical LDAP configuration backends or even the Windows Registry, if you are more familiar with those technologies.

According to its developers, etcd aims to be:

  • Simple: well-defined, user-facing API (JSON and gRPC)
  • Secure: automatic TLS with optional client cert authentication
  • Fast: benchmarked 10,000 writes/sec
  • Reliable: properly distributed using Raft

Kubernetes uses the etcd distributed database to store its REST API objects (under the /registry directory key): pods, secrets, daemonsets, deployments, namespaces, events, etc.

# etcdctl ls /registry
/registry/pods
/registry/events
/registry/minions
/registry/ranges
/registry/replicasets
/registry/serviceaccounts
/registry/controllers
/registry/namespaces
/registry/configmaps
/registry/services
/registry/clusterrolebindings
/registry/clusterroles
/registry/secrets
/registry/daemonsets
/registry/deployments

Raft is a “consensus” algorithm, this is, a method to achieve value convergence over a distributed and fault-tolerant set of cluster nodes. Designed to overcome network errors, clock discrepancies, race conditions, complete node failures and attachment of new nodes. It abstracts away the core complexity of assembling a distributed system.

Without going into the gory details that you will find in the referenced articles, the basics of what you need to know:

  • Node status can be one of: Follower, Candidate (briefly), Leader
  • If a Follower cannot locate the current Leader, it will become Candidate
  • The voting system will elect a new Leader amongst the Candidates
  • Registry value updates (commits) always go through the Leader
  • Once the Leader has received the ack from the majority of Followers the new value is considered “committed”
  • The cluster will survive as long as most of the nodes remain alive

Maybe the most remarkable features of etcd are the straightforward way of accessing the service using REST-like HTTP calls, that makes integrating third party agents as simple as you can get, and its master-master protocol which automatically elects the cluster Leader and provides a fallback mechanism to switch this role if needed.

etcd cluster common points of failure

Most of the time your etcd cluster works so neatly, it is easy to forget its nodes are running. Keep in mind, however, that Kubernetes absolutely needs this registry to function, a major etcd failure will seriously cripple or even take down your container infrastructure. Pods currently running will continue to run, but you cannot make any further operations. When you re-connect etcd and Kubernetes again, state incoherences could cause additional malfunction.

Cautionings aside, you can start by monitoring these three common issues in your cluster:

etcd monitoring and alerting

You can run etcd in Kubernetes, inside Docker containers or as an independent cluster (either in Docker containers or directly bare-metal). Usually, for simple scenarios, etcd is deployed in a Docker container like other Kubernetes services as the API server, the controller, scheduler or the kubelet. On more advanced scenarios etcd is often an external service, in these cases you will normally see 3 or more nodes to achieve the required redundancy.

etcd node availability

An obvious error scenario for any cluster is that you lose one of the nodes. The cluster will continue operating, but it is probably a good idea to receive an alert, diagnose and recover before you continue losing nodes and risk facing the next scenario, total service failure.

basic alert

Tagging your monitoring agents provides a very convenient way of defining groups that can be later on used for your monitoring context. Only two agent tags are used in this scenario, the etcd nodes and the kubernetes nodes. The basic entity where you want to apply the alert test is defined in the Segmented by field, thus, this alert will perform one verification per host.mac but we can also use host.hostName. The check is going to wait 5 minutes before considering it down, to avoid alerting on temporary network glitches.

You can check the log of the etcd service on one of the remaining nodes at the failure time specified in the alert:

etcd3 etcd2[962]: failed to read 3f744bbb6c6ab1c3 on stream MsgApp v2 (read tcp 10.0.2.15:39894->
etcd3 etcd2[962]: the connection with 3f744bbb6c6ab1c3 became inactive

rpc errors

Its peers have detected the follower failure and the cluster will continue working, but as you can see from the chart above, the cluster Leader is registering that all the RPC requests to this specific follower are suddenly failing.

If the node which goes offline was actually the former cluster Leader, the Raft protocol will negotiate a new one:

etcd2 etcd2[958]: raft.node: 3f744bbb6c6ab1c3 lost leader 229b0bd139298d6b at term 1275
etcd2 etcd2[958]: raft.node: 3f744bbb6c6ab1c3 elected leader 2d4437615e2062b3 at term 1275
etcd2 etcd2[958]: failed to read 229b0bd139298d6b on stream MsgApp v2
etcd2 etcd2[958]: the connection with 229b0bd139298d6b became inactive

Take into account that your cluster will stop working if you lose most of your nodes.

But, what happens if you dynamically add or remove etcd nodes?

This is totally supported by etcd, if you have a cluster with 3 nodes, add new 6 nodes and then 5 nodes go down, your cluster will become inoperative. You can explicitly remove an etcd node using a member remove command, in this case, the removed nodes won’t be considered towards the total number of nodes needed for cluster coherency. Thus, if you explicitly remove 2 nodes in a 3 nodes scenario, the remaining node will continue working in a healthy state (but without HA), if these same 2 nodes go down unexpectedly, the cluster will fail.

etcd service unavailable

As mentioned before, losing the whole etcd registry means that Kubernetes cannot read/write values in the registry, leading to its malfunctioning or even sudden stop, this is an event that requires to be immediately addressed by the administrators, so you want to configure it as “Critical”.

etcd failure

This alert will monitor all the hosts that contain an agent with the role etcd and each one of the physical nodes (Segmented by: host.mac) will be treated as a different entity.

Since the Across field is set to any , all the samples collected for every host.mac over a minute will be evaluated separately, therefore, each one of the etcd nodes can trigger the alert independently. This could be useful, for instance, to detect a network failure that has left one of the nodes isolated.

If you select every instead, the alert will only fire if all of groups fulfill the condition simultaneously. This approach may be more accurate if what you want to detect is a complete etcd cluster failure.

Additionally, you can configure a somewhat more complete check:

etcd failure from k8s

In this case, the alert will check that each one of the pods can connect to an etcd host/port. To avoid firing the alarm from development/test scenarios you can make use of Kubernetes namespaces and restrict the check to production pods.

For this test, you need to customize (see Sysdig Monitor docs) the etcd check, providing the target host and port.

Cluster member Latency

Another metric to consider is the member latency: delay until a Follower achieves data coherence with the cluster Leader.

latencies

In a testing scenario running on a local network, the sync latencies are fairly stable around 0.01-0.02 sec.

This is how we set an alert when this latency goes higher than 0.1 sec for any of the members, possibly signaling a network clog, package loss, etc:

latency alert

Now if we simulate a network jam in the internal LAN which interconnects the current etcd Leader with one of it’s Followers, will see something like this:

leader latency

The “Leader Latency by Follower” graph that you can find accessing the etcd metrics evidence the synchronization lag problems.

Sudden short bursts of latency can show up and disappear without any intervention, to avoid a noisy alert that does not require any specific actions, you would rather configure the alert to trigger only if the high latencies are sustained for example over 10+ minutes, as opposed to at least once used in the example .

Monitoring etcd performance

From the “Explore” tab in Sysdig Monitor, you can filter the metrics by the string “etcd” in the lower left panel and look through all the etcd views and metrics available. You can choose the ones more relevant to your use case and compose a custom board.

Conveniently enough, if you go to “Dashboards” tab, and add a new dashboard, there is already an “etcd” preconfigured one where you can check the most useful metrics:

etcd dash

To simulate a high load scenario, let’s reproduce a common situation: you need to upgrade the software and/or hardware of one of the Kubernetes nodes. Moving all the containers away and later on re-filling the node again will generate a considerable amount of stress in the etcd registry.

Taking a quick look at the etcd dashboard, you realize that the “idle” state for this cluster is around 40 atomic CompareAndSwap cluster operations per second:

compareandswap

Let’s configure a load warning if this metric reaches 100+ operations per second (on average):

100 ops

According to the “Overview” table, the cluster is already withstanding considerable loads:

cluster load

Now, let’s proceed to drain one of the Kubernetes workers, forcing a cluster re-balance:

$ kubectl drain --grace-period=10 kubeworker2

cluster load peak

If you check your etcd dashboard, there is a noticeable peak during this operation, more than enough to trigger the warning message. Un-cordoning the node will cause a similar peak while the pods return to the refreshed node:

$ kubectl uncordon kubeworker2

Additional etcd metrics

The default Sysdig Monitor dashboard for etcd will show you the most relevant metrics and the graphs are a nice intuitive place to start. But there is more to it, some metrics are not included on the default view but you can add the ones that might be interested in.

To name a few additional metrics:

  • Standard Deviation etcd.leader.latency.stddev : If you are expecting high loads of registry updates, for example during a node cordon/uncordon, maybe an etcd member showing higher standard deviations is more relevant than the absolute load value per se.
  • Bandwidth and packets per second etcd.self.send.pkgrate, etcd.self.recv.pkgrate : Additional to the number of successful / failed operations shown in the graphs, you can also get a measure of synchronization packages exchanged between the Leader and Followers, this could be useful to design network capacity requirements.
  • Service availability with etcd.can_connect : This metric was used already in an alert above. Sysdig Monitor agent checks service availability, simple but very useful.

Eventually, in a complex enough deployment, you are going to wish for a metric rather specific to your scenario. To continue with the article’s context, imagine that you want a metric and alerts for the etcd database size:

etcd1 etcd2 # pwd
/var/lib/etcd2
etcd1 etcd2 # du -csh
324M    .
324M    total

If you know which commands to use and the specific output you wish to monitor, you can always write your own custom checks and run them within the Sysdig Monitor agent.

Conclusions

etcd is a simple and robust service which is required to deploy a Kubernetes cluster. Even though the Raft distributed consensus algorithm is able to overcome most of the temporal network failures, node losses, cluster splits, etc, if you are running Kubernetes in production, it is essential to monitor and set up alerts on relevant cluster events before it’s too late.

If you dig in the list of etcd error codes , there are of course more advanced cases that the ones covered in this brief article: max numbers of peers in the cluster, anomaly detection between the Kubernetes and etcd nodes, Raft internal errors, registry size, etc, to name a few, although we leave those for a second part of this article in the future.

Monitoring etcd with Sysdig Monitor its really easy, just one tool to monitor etcd and Kubernetes. Sysdig Monitor agent will collect all the etcd metrics and you can quickly setup the most important etcd alerts. If you haven’t tried Sysdig Monitor yet, you are just one click away from our free Trial!




Eager to learn more? Check out our online session: Building an Open Source Container Security Stack

On this session Sysdig and Anchore are presenting how using Falco and Anchore Engine you can build a complete open source container security stack for Docker and Kubernetes.

This online session will live demo:

  • Using Falco, NATS and Kubeless to build a Kubernetes response engine and implement real-time attack remediation with security playbooks using FaaS.
  • How Anchore Engine can detect software vulnerabilities in your images, and how can be integrated with Jenkins, Kubernetes and Falco.


Stay up to date!

Get new articles from this blog (weekly)
Or container ecosystem updates (monthly)

Thanks so much for signing up!
Please check your inbox for a confirmation email.