The most common problems and outages in a Kubernetes cluster come from coreDNS, so learning how to monitor coreDNS is crucial.
Imagine that your frontend application suddenly goes down. After some time investigating, you discover it’s not resolving the backend endpoint because the DNS keeps returning 500 error codes. The sooner you can get to this conclusion, the faster you can recover your application.
Monitoring your coreDNS can give you time to fix issues before your cluster decides to go down at the worst moment and it’s too late.
What is coreDNS?
CoreDNS is the default kube-dns since version v1.12 of Kubernetes, and it’s the recommended DNS server. It’s a key component, as each pod and service has a fully qualified domain name (FQDN). If kube-dns goes down, all of your cluster goes down.
How to monitor coreDNS
You usually see coreDNS running in your master node, but it can also run bare metal to provide service discovery in non-Kubernetes environments that use containers, like Docker.
Getting metrics from coreDNS
CoreDNS is instrumented and, like the rest of the components of the Kubernetes control plane, exposes Prometheus metrics in the port 9153. It provides information about requests to the DNS server and the plugins inside. Depending on the size of the cluster, the replicas can be one or more. You’ll need to scrape CoreDNS on each replica.
You can get the metrics accessing to the endpoint:
curl localhost:9153/metrics
And it will return a long list of metrics with this structure (truncated):
# HELP coredns_build_info A metric with a constant '1' value labeled by version, revision, and goversion from which CoreDNS was built. # TYPE coredns_build_info gauge coredns_build_info{goversion="go1.14.4",revision="f59c03d",version="1.7.0"} 1 # HELP coredns_cache_entries The number of elements in the cache. # TYPE coredns_cache_entries gauge coredns_cache_entries{server="dns://:53",type="denial"} 41 coredns_cache_entries{server="dns://:53",type="success"} 15 # HELP coredns_cache_hits_total The count of cache hits. # TYPE coredns_cache_hits_total counter coredns_cache_hits_total{server="dns://:53",type="denial"} 366066 coredns_cache_hits_total{server="dns://:53",type="success"} 135 # HELP coredns_cache_misses_total The count of cache misses. # TYPE coredns_cache_misses_total counter coredns_cache_misses_total{server="dns://:53"} 106654 # HELP coredns_dns_request_duration_seconds Histogram of the time (in seconds) each request took. # TYPE coredns_dns_request_duration_seconds histogram coredns_dns_request_duration_seconds_bucket{server="dns://:53",type="A",zone=".",le="0.00025"} 189356 coredns_dns_request_duration_seconds_bucket{server="dns://:53",type="A",zone=".",le="0.0005"} 189945 coredns_dns_request_duration_seconds_bucket{server="dns://:53",type="A",zone=".",le="0.001"} 190102 coredns_dns_request_duration_seconds_bucket{server="dns://:53",type="A",zone=".",le="0.002"} 235026
To monitor coreDNS with Prometheus, you just have to add the corresponding job:
- job_name: kube-dns honor_labels: true kubernetes_sd_configs: - role: pod relabel_configs: - action: keep source_labels: - __meta_kubernetes_namespace - __meta_kubernetes_pod_name separator: '/' regex: 'kube-system/coredns.+' - source_labels: - __meta_kubernetes_pod_container_port_name action: keep regex: metrics - source_labels: - __meta_kubernetes_pod_name action: replace target_label: instance - action: labelmap regex: __meta_kubernetes_pod_label_(.+)
Monitor coreDNS: What to look for?
Disclaimer: coreDNS metrics might differ between Kubernetes versions. Here, we used the Kubernetes 1.18 and the coreDNS version. You can check the metrics available for your version in the Kubernetes repo (link for the 1.18.8 version).
Request latency: Following the golden signals, the latency of a request is an important metric to detect any degradation in the service. To check this, you have to always compare the percentile against the average. The way to do this in Prometheus is by using the operator histogram.
histogram_quantile(0.99, sum(rate(coredns_dns_request_duration_seconds_bucket{job="kube-dns"}[5m])) by(server, zone, le))
Error rate: The error rate is another golden signal you have to monitor. Although errors are not always caused by the DNS failing, it’s still a key metric that you have to watch carefully. One of the key metrics of coreDNS about errors is coredns_dns_responses_total,
and the code
is also relevant. For example, the NXDOMAIN
error means that a DNS query failed because the domain name queried does not exist.
# HELP coredns_dns_responses_total Counter of response status codes. # TYPE coredns_dns_responses_total counter coredns_dns_responses_total{rcode="NOERROR",server="dns://:53",zone="."} 1336 coredns_dns_responses_total{rcode="NXDOMAIN",server="dns://:53",zone="."} 471519
Monitor coreDNS metrics in Sysdig Monitor
Similar to when we monitor etcd, the pod of coreDNS is not annotated by default. In order to get the metrics with the Sysdig agent, you have to annotate it.
Even easier, you can follow the steps given in promcat to monitor the entire control plane, and federate a Prometheus server with only the metrics you need, discarding everything else.
If you already have helm and helmfile, the process will be straightforward. If not, you can easily install helm by following the official instructions, then install helmfile. With these two applications, you can deploy a Prometheus server with the right rules and configuration out-of-the-box. This includes the corresponding files, helmfile.yaml
, recording_rules.yaml
, prometheus.yaml
, and prometheus.yml.gotmpl.
You just have to execute this line:
helmfile sync
Once you have installed the Prometheus server, the next step is to configure the Sysdig agent. You can do it by copying the following in the configmap:
apiVersion: v1 kind: ConfigMap metadata: name: sysdig-agent namespace: sysdig-agent data: prometheus.yaml: |- global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'prometheus' # config for federation honor_labels: true metrics_path: '/federate' metric_relabel_configs: - regex: 'kubernetes_pod_name' action: labeldrop params: 'match[]': - '{sysdig="true"}' sysdig_sd_configs: - tags: namespace: monitoring
And to add this control-plane dashboard from promcat:
Just execute this command:
docker run -it --rm \ sysdiglabs/promcat-connect:0.1 \ install \ kubernetes-control-plane:1.18.0\ -t YOUR-API-TOKEN
Conclusion
The coreDNS is the most common source of issues in a cluster. If DNS fails, then a lot of services (if not all of them) will fail, and your application will be down. Monitoring coreDNS can help you fix issues before they become a problem, or troubleshoot and recover from problems faster.
Monitoring coreDNS with Sysdig Monitor is really easy. With just one tool you can monitor both coreDNS and Kubernetes. Sysdig Monitor agent will collect all of the coreDNS metrics and you can quickly setup the most important coreDNS alerts.
If you haven’t tried Sysdig Monitor yet, you are just one click away from our free Trial!