Blog Icon

Blog Post

How to monitor coreDNS

NEW!! LIVE WEBINAR: Zero Trust Network Security for Containers and Kubernetes - Dec 10, 2020 10am Pacific / 1pm Eastern

The most common problems and outages in a Kubernetes cluster come from coreDNS, so learning how to monitor coreDNS is crucial.

Imagine that your frontend application suddenly goes down. After some time investigating, you discover it’s not resolving the backend endpoint because the DNS keeps returning 500 error codes. The sooner you can get to this conclusion, the faster you can recover your application.

Monitoring your coreDNS can give you time to fix issues before your cluster decides to go down at the worst moment and it’s too late.

Keep calm and check DNS

What is coreDNS?

CoreDNS is the default kube-dns since version v1.12 of Kubernetes, and it’s the recommended DNS server. It’s a key component, as each pod and service has a fully qualified domain name (FQDN). If kube-dns goes down, all of your cluster goes down.

How to monitor coreDNS

You usually see coreDNS running in your master node, but it can also run bare metal to provide service discovery in non-Kubernetes environments that use containers, like Docker.

Getting metrics from coreDNS

CoreDNS is instrumented and, like the rest of the components of the Kubernetes control plane, exposes Prometheus metrics in the port 9153. It provides information about requests to the DNS server and the plugins inside. Depending on the size of the cluster, the replicas can be one or more. You’ll need to scrape CoreDNS on each replica.

You can get the metrics accessing to the endpoint:

curl localhost:9153/metrics

And it will return a long list of metrics with this structure (truncated):

# HELP coredns_build_info A metric with a constant '1' value labeled by version, revision, and goversion from which CoreDNS was built.
# TYPE coredns_build_info gauge
coredns_build_info{goversion="go1.14.4",revision="f59c03d",version="1.7.0"} 1
# HELP coredns_cache_entries The number of elements in the cache.
# TYPE coredns_cache_entries gauge
coredns_cache_entries{server="dns://:53",type="denial"} 41
coredns_cache_entries{server="dns://:53",type="success"} 15
# HELP coredns_cache_hits_total The count of cache hits.
# TYPE coredns_cache_hits_total counter
coredns_cache_hits_total{server="dns://:53",type="denial"} 366066
coredns_cache_hits_total{server="dns://:53",type="success"} 135
# HELP coredns_cache_misses_total The count of cache misses.
# TYPE coredns_cache_misses_total counter
coredns_cache_misses_total{server="dns://:53"} 106654
# HELP coredns_dns_request_duration_seconds Histogram of the time (in seconds) each request took.
# TYPE coredns_dns_request_duration_seconds histogram
coredns_dns_request_duration_seconds_bucket{server="dns://:53",type="A",zone=".",le="0.00025"} 189356
coredns_dns_request_duration_seconds_bucket{server="dns://:53",type="A",zone=".",le="0.0005"} 189945
coredns_dns_request_duration_seconds_bucket{server="dns://:53",type="A",zone=".",le="0.001"} 190102
coredns_dns_request_duration_seconds_bucket{server="dns://:53",type="A",zone=".",le="0.002"} 235026

To monitor coreDNS with Prometheus, you just have to add the corresponding job:

- job_name: kube-dns
  honor_labels: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_namespace
    - __meta_kubernetes_pod_name
    separator: '/'
    regex: 'kube-system/coredns.+'
  - source_labels:
    - __meta_kubernetes_pod_container_port_name
    action: keep
    regex: metrics
  - source_labels:
    - __meta_kubernetes_pod_name
    action: replace
    target_label: instance
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)

Monitor coreDNS: What to look for?

Disclaimer: coreDNS metrics might differ between Kubernetes versions. Here, we used the Kubernetes 1.18 and the coreDNS version. You can check the metrics available for your version in the Kubernetes repo (link for the 1.18.8 version).

Request latency: Following the golden signals, the latency of a request is an important metric to detect any degradation in the service. To check this, you have to always compare the percentile against the average. The way to do this in Prometheus is by using the operator histogram.

histogram_quantile(0.99, sum(rate(coredns_dns_request_duration_seconds_bucket{job="kube-dns"}[5m])) by(server, zone, le))

Error rate: The error rate is another golden signal you have to monitor. Although errors are not always caused by the DNS failing, it’s still a key metric that you have to watch carefully. One of the key metrics of coreDNS about errors is coredns_dns_responses_total, and the code is also relevant. For example, the NXDOMAIN error means that a DNS query failed because the domain name queried does not exist.

# HELP coredns_dns_responses_total Counter of response status codes.
# TYPE coredns_dns_responses_total counter
coredns_dns_responses_total{rcode="NOERROR",server="dns://:53",zone="."} 1336
coredns_dns_responses_total{rcode="NXDOMAIN",server="dns://:53",zone="."} 471519

Monitor coreDNS metrics in Sysdig Monitor

Similar to when we monitor etcd, the pod of coreDNS is not annotated by default. In order to get the metrics with the Sysdig agent, you have to annotate it.

Even easier, you can follow the steps given in promcat to monitor the entire control plane, and federate a Prometheus server with only the metrics you need, discarding everything else.

If you already have helm and helmfile, the process will be straightforward. If not, you can easily install helm by following the official instructions, then install helmfile. With these two applications, you can deploy a Prometheus server with the right rules and configuration out-of-the-box. This includes the corresponding files, helmfile.yaml, recording_rules.yaml, prometheus.yaml, and prometheus.yml.gotmpl.

You just have to execute this line:

helmfile sync

Once you have installed the Prometheus server, the next step is to configure the Sysdig agent. You can do it by copying the following in the configmap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: sysdig-agent
  namespace: sysdig-agent
data:
  prometheus.yaml: |-
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    scrape_configs:
    - job_name: 'prometheus' # config for federation
      honor_labels: true
      metrics_path: '/federate'
      metric_relabel_configs:
      - regex: 'kubernetes_pod_name'
        action: labeldrop
      params:
        'match[]':
          - '{sysdig="true"}'
      sysdig_sd_configs:
      - tags:
          namespace: monitoring

And to add this control-plane dashboard from promcat:

Just execute this command:

docker  run -it --rm \ 
    sysdiglabs/promcat-connect:0.1 \ 
    install \ 
    kubernetes-control-plane:1.18.0\ 
    -t YOUR-API-TOKEN

Conclusion

The coreDNS is the most common source of issues in a cluster. If DNS fails, then a lot of services (if not all of them) will fail, and your application will be down. Monitoring coreDNS can help you fix issues before they become a problem, or troubleshoot and recover from problems faster.

Monitoring coreDNS with Sysdig Monitor is really easy. With just one tool you can monitor both coreDNS and Kubernetes. Sysdig Monitor agent will collect all of the coreDNS metrics and you can quickly setup the most important coreDNS alerts.

If you haven’t tried Sysdig Monitor yet, you are just one click away from our free Trial!

Stay up to date

Sign up to receive our newest.

Related Posts

How to monitor kube-proxy

How to monitor kube-controller-manager

How to monitor etcd