Blog Icon

Blog Post

How to monitor Kubelet

NEW!! LIVE WEBINAR: Introduction to Instrumenting Apps with Prometheus - August 13, 2020 10am Pacific / 1pm Eastern

Monitor Kubelet is key when running Kubernetes in production.

Kubelet is a very important service inside Kubernetes’ control plane. It’s the component that cares that the containers described by pods are running in the nodes. Kubelet works in a declarative way by receiving PodSpecs and ensuring that the current state matches desired pods.

Kubelet has some differences with other control plane components as it is the only one that runs over the host OS in the nodes, not as a Kubernetes entity. This makes kubelet monitoring a little special, but we can still rely on Prometheus service discovery (node).

How the kubelet fits in the kubernetes control plane

Getting metrics from Kubelet

Kubelet has been instrumented and it exposes Prometheus metrics by default in the port 10255 of the host, providing information about pods volumes and internal operations. This endpoint can be easily scraped, obtaining useful information without the need for additional scripts or exporters.

You can scrape Kubelet metrics accessing the port in the node directly without authentication.

curl  http://[Node_Internal_IP]:10255/metrics

If the container has access to the host network, you can access using localhost too.

Note that the port and address may vary depending on your particular configuration.

It will return a long list of metrics with this structure (truncated):

# HELP apiserver_audit_event_total Counter of audit events generated and sent to the audit backend.
# TYPE apiserver_audit_event_total counter
apiserver_audit_event_total 0
# HELP apiserver_audit_requests_rejected_total Counter of apiserver requests rejected due to an error in audit logging backend.
# TYPE apiserver_audit_requests_rejected_total counter
apiserver_audit_requests_rejected_total 0
# HELP apiserver_client_certificate_expiration_seconds Distribution of the remaining lifetime on the certificate used to authenticate a request.
# TYPE apiserver_client_certificate_expiration_seconds histogram
apiserver_client_certificate_expiration_seconds_bucket{le="0"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="1800"} 0
...

If we want to configure a Prometheus to scrape Kubelet, we can add this job to our targets:

You can customize your own labels and relabeling configuration.

Monitor Kubelet: what to look for?

Disclaimer: Kubelet metrics might differ between Kubernetes versions. Here, we used Kubernetes 1.15. You can check the metrics available for your version in the Kubernetes repo (link for the 1.15.3 version).

Number of kubelet instances: This value will give an idea of the general health of the kubelet in the nodes. The expected value is the number of nodes in the cluster. You can obtain this value counting targets found by Prometheus or by checking the process if you have low level access to the node.
A possible PromQL query for a single stat graph would be:

sum(up{job=\"kubernetes-nodes\"})

Number of pods and containers running: Kubelet provides insight to the number of pods and containers really running in the node. You can check this value with the one expected, or reported, by Kubernetes to detect possible issues in the nodes.

# HELP kubelet_running_pod_count Number of pods currently running
# TYPE kubelet_running_pod_count gauge
kubelet_running_pod_count 9
# HELP kubelet_running_container_count Number of containers currently running
# TYPE kubelet_running_container_count gauge
kubelet_running_container_count 9

Number of volumes: In the system, kubelet mounts the volumes indicated by the controller so it can provide information on them. This can be useful to diagnose issues with volumes that are not being mounted when a pod is recreated in a statefulSet. It provides two metrics than can be represented together; the number of desired volumes and the number of volumes actually mounted:

# HELP volume_manager_total_volumes Number of volumes in Volume Manager
# TYPE volume_manager_total_volumes gauge
volume_manager_total_volumes{plugin_name="kubernetes.io/configmap",state="actual_state_of_world"} 1
volume_manager_total_volumes{plugin_name="kubernetes.io/configmap",state="desired_state_of_world"} 1
volume_manager_total_volumes{plugin_name="kubernetes.io/empty-dir",state="actual_state_of_world"} 1
volume_manager_total_volumes{plugin_name="kubernetes.io/empty-dir",state="desired_state_of_world"} 1
volume_manager_total_volumes{plugin_name="kubernetes.io/host-path",state="actual_state_of_world"} 55
volume_manager_total_volumes{plugin_name="kubernetes.io/host-path",state="desired_state_of_world"} 55
volume_manager_total_volumes{plugin_name="kubernetes.io/secret",state="actual_state_of_world"} 4
volume_manager_total_volumes{plugin_name="kubernetes.io/secret",state="desired_state_of_world"} 4


Differences between these two values (outside of transitory phases) can be a good indicator of issues.

Config errors: This metric acts as a flag for configuration errors in the node.

# HELP kubelet_node_config_error This metric is true (1) if the node is experiencing a configuration-related error, false (0) otherwise.
# TYPE kubelet_node_config_error gauge
kubelet_node_config_error 0

Golden signals of every operation performed by kubelet (Operation rate, operation error rate and operation duration). Saturation can be measured with system metrics. Kubelet offers detailed information of the operations performed by the daemon. Metrics than can be used are:

  • kubelet_runtime_operations_total: Total count of runtime operations of each type.
    # HELP kubelet_runtime_operations_total Cumulative number of runtime operations by operation type.
    # TYPE kubelet_runtime_operations_total counter
    kubelet_runtime_operations_total{operation_type="container_status"} 225
    kubelet_runtime_operations_total{operation_type="create_container"} 44
    kubelet_runtime_operations_total{operation_type="exec"} 5
    kubelet_runtime_operations_total{operation_type="exec_sync"} 1.050273e+06
    ...
  • kubelet_runtime_operations_errors_total: Count of errors in the operations. This can be a good indicator of low level issues in the node, like problems with container runtime.
    # HELP kubelet_runtime_operations_errors_total Cumulative number of runtime operation errors by operation type.
    # TYPE kubelet_runtime_operations_errors_total counter
    kubelet_runtime_operations_errors_total{operation_type="container_status"} 18
    kubelet_runtime_operations_errors_total{operation_type="create_container"} 1
    kubelet_runtime_operations_errors_total{operation_type="exec_sync"} 7
  • kubelet_runtime_operations_duration_seconds_bucket: Duration of the operations. Useful to calculate percentiles.
    # HELP kubelet_runtime_operations_duration_seconds Duration in seconds of runtime operations. Broken down by operation type.
    # TYPE kubelet_runtime_operations_duration_seconds histogram
    kubelet_runtime_operations_duration_seconds_bucket{operation_type="container_status",le="0.005"} 194
    kubelet_runtime_operations_duration_seconds_bucket{operation_type="container_status",le="0.01"} 207
    ...

Pod start rate and duration: This could indicate issues with container runtime or with access to images.

  • kubelet_pod_start_duration_seconds_count: Number of pod start operations.
    # HELP kubelet_pod_start_duration_seconds Duration in seconds for a single pod to go from pending to running.
    # TYPE kubelet_pod_start_duration_seconds histogram
    ...
    kubelet_pod_worker_duration_seconds_count{operation_type="sync"} 196
    ...
  • kubelet_pod_worker_duration_seconds_count:
    # HELP kubelet_pod_worker_duration_seconds Duration in seconds to sync a single pod. Broken down by operation type: create, update, or sync
    # TYPE kubelet_pod_worker_duration_seconds histogram
    ...
    kubelet_pod_worker_duration_seconds_count{operation_type="sync"} 196
    ...
  • kubelet_pod_start_duration_seconds_bucket:
    # HELP kubelet_pod_worker_duration_seconds Duration in seconds to sync a single pod. Broken down by operation type: create, update, or sync
    # TYPE kubelet_pod_worker_duration_seconds histogram
    kubelet_pod_start_duration_seconds_bucket{operation_type="sync",le="0.005"} 194
    kubelet_pod_start_duration_seconds_bucket{operation_type="sync",le="0.01"} 195
    ...
  • Kubelet_pod_worker_duration_seconds_bucket:
    # HELP kubelet_pod_worker_duration_seconds Duration in seconds to sync a single pod. Broken down by operation type: create, update, or sync
    # TYPE kubelet_pod_worker_duration_seconds histogram
    kubelet_pod_worker_duration_seconds_bucket{operation_type="sync",le="0.005"} 194
    kubelet_pod_worker_duration_seconds_bucket{operation_type="sync",le="0.01"} 195
    ...

Storage golden signals (operation rate, error rate and duration).

  • storage_operation_duration_seconds_count:
    # HELP storage_operation_duration_seconds Storage operation duration
    # TYPE storage_operation_duration_seconds histogram
    ...
    storage_operation_duration_seconds_count{operation_name="verify_controller_attached_volume",volume_plugin="kubernetes.io/configmap"} 16
    …
  • storage_operation_errors_total:
    # HELP storage_operation_errors_total Storage errors total
    # TYPE storage_operation_errors_total counter
    storage_operation_errors_total { volume_plugin = "aws-ebs", operation_name = "volume_attach" } 0
    storage_operation_errors_total { volume_plugin = "aws-ebs", operation_name = "volume_detach" } 0
  • storage_operation_duration_seconds_bucket:
    # HELP storage_operation_duration_seconds Storage operation duration
    # TYPE storage_operation_duration_seconds histogram
    storage_operation_duration_seconds_bucket{operation_name="verify_controller_attached_volume",volume_plugin="kubernetes.io/configmap",le="0.1"} 16
    storage_operation_duration_seconds_bucket{operation_name="verify_controller_attached_volume",volume_plugin="kubernetes.io/configmap",le="0.25"} 16
    ...

Cgroup manager operation rate and duration.

  • kubelet_cgroup_manager_duration_seconds_count:
    # HELP kubelet_cgroup_manager_duration_seconds Duration in seconds for cgroup manager operations. Broken down by method.
    # TYPE kubelet_cgroup_manager_duration_seconds histogram
    ...
    kubelet_cgroup_manager_duration_seconds_count{operation_type="create"} 28
    ...
  • kubelet_cgroup_manager_duration_seconds_bucket:
    # HELP kubelet_cgroup_manager_duration_seconds Duration in seconds for cgroup manager operations. Broken down by method.
    # TYPE kubelet_cgroup_manager_duration_seconds histogram
    kubelet_cgroup_manager_duration_seconds_bucket{operation_type="create",le="0.005"} 11
    kubelet_cgroup_manager_duration_seconds_bucket{operation_type="create",le="0.01"} 21
    ...

Pod Lifecycle Event Generator (PLEG): relist rate, relist interval and relist duration. Errors or excessive latency in these values can provoke issues in Kubernetes status of the pods.

  • kubelet_pleg_relist_duration_seconds_count:
    # HELP kubelet_pleg_relist_duration_seconds Duration in seconds for relisting pods in PLEG.
    # TYPE kubelet_pleg_relist_duration_seconds histogram
    ...
    kubelet_pleg_relist_duration_seconds_count 5.344102e+06
    ...
  • kubelet_pleg_relist_interval_seconds_bucket:
    # HELP kubelet_pleg_relist_interval_seconds Interval in seconds between relisting in PLEG.
    # TYPE kubelet_pleg_relist_interval_seconds histogram
    kubelet_pleg_relist_interval_seconds_bucket{le="0.005"} 0
    kubelet_pleg_relist_interval_seconds_bucket{le="0.01"} 0
    ...
  • kubelet_pleg_relist_duration_seconds_bucket: :
    # HELP kubelet_pleg_relist_duration_seconds Duration in seconds for relisting pods in PLEG.
    # TYPE kubelet_pleg_relist_duration_seconds histogram
    kubelet_pleg_relist_duration_seconds_bucket{le="0.005"} 2421
    kubelet_pleg_relist_duration_seconds_bucket{le="0.01"} 4.335858e+06
    ...

Examples of issues in Kubelet

Pods are not starting.
This is typically a sign of Kubelet having problems connecting to the container runtime running below. Check for the pod start rate and duration metrics to check if there is latency creating the containers or if they are in fact starting.

A node doesn’t seem to be scheduling new pods.
Check the Kubelet job number. There is a chance that Kubelet has died in a node and it is unable to schedule pods.

Kubernetes seems to be slow performing operations.
Check all the golden signals in Kubelet metrics. It may have issues with storage, latency communicating with the container runtime engine or load issues.

Monitor Kubelet metrics in Sysdig Monitor

In order to track Kubelet in Sysdig monitor, you have to add some sections to the agent yaml configuration file.

With the metrics_filter part, you ensure that these metrics won’t be discarded if you hit the metrics limit. You can add any other metric offered by the API server that is not on this list, like this:

metrics_filter:
    - include: "kubelet_running_pod_count"
    ...
    - include: "go_goroutines"

Then, you configure how the Sysdig agent will scrape the metrics, searching the system for processes called kubelet and scraping in localhost through port 10255. As the Sysdig agent is capable of switching network context and connecting to the pod as it was at localhost, we don’t need to use https.

You can then build custom dashboards using these metrics. We have some pre-built dashboards that we can share with you if you’re interested.

Monitor kubelet sysdig dashboard

Conclusions

Monitoring Kubelet is fundamental, as it is a key piece in the cluster operation. Remember, all of the communication with the container runtime is done through Kubelet. It is the connection between Kubernetes and the OS running behind.

Some issues in your Kubernetes cluster that appear to be random can be explained by a problem in the Kubelet. Monitoring kubelet metrics can save you time when these problems come, and they will.

Sysdig helps you follow Kubernetes monitoring best practices, which is just as important as monitoring your workloads and applications running inside the cluster. Don’t forget to monitor your control plane!

Share This

Stay up to date

Sign up to receive our newest.

Related Posts

How to Monitor Kubernetes API Server

How to monitor Golden signals in Kubernetes

3 phases of Prometheus adoption.