Monitoring Ceph health with Prometheus

Monitoring Ceph health with Prometheus is straightforward since Ceph already exposes an endpoint with all of its metrics for Prometheus. In this article, we will put it all together to help you start monitoring your Ceph storage cluster and guide you through all the important metrics.

Ceph offers a great solution for object-based storage to manage large amounts of data even on economical hardware. Besides, the Ceph Foundation is organized as a direct fund under the Linux Foundation.

Monitoring Ceph is crucial for maintaining the health of your disk provider, as well as keeping the cluster’s quorum.

How to enable Prometheus monitoring for Ceph

If you deployed Ceph with Rook, you won’t have to do anything else. Prometheus is already enabled and the pod is annotated, so Prometheus will gather the metrics automatically.

Otherwise, if you didn’t deploy Ceph with Rook, there are a couple of additional steps.

Enable Prometheus monitoring

Use this command to enable Prometheus in your Ceph storage cluster. It enables an endpoint returning Prometheus metrics.

ceph mgr module enable prometheus

Please note that after doing this, you’ll need to restart the Prometheus manager module to completely enable Prometheus.

Annotate Ceph pods with Prometheus metrics

Add these annotations to ceph-mgr deployment so Prometheus service discovery can automatically detect your Ceph metrics endpoint.

annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '9283'

Monitoring Ceph health

Ceph status

The absolute Top 1 metric you should check is ceph_health_status. If this metric doesn’t exist or it returns something different from 1, the cluster is having critical issues.

Let’s create an alert to be aware of this situation:

absent(ceph_health_status == 1)

Cluster remaining storage

As in all systems where you use disks, you need to check the remaining available storage. To check this, you can use ceph_cluster_total_bytes to get the total disk capacity (in bytes) and ceph_cluster_total_used_bytes to get the disk usage (in bytes).

Let’s create a PromQL query to alert when the space left is under 15% of the total disk space:

(ceph_cluster_total_bytes-ceph_cluster_total_used_bytes)/ceph_cluster_total_bytes < 0.15

Object Storage Daemon nodes down

Object Storage Daemon (OSD) is responsible for storing objects on a local file system and providing access to them over the network. There’s an OSD in each node. If an OSD goes down, you won’t have access to the physical disks mounted on that node.

Let’s create an alert as if there’s an OSD down:

ceph_osd_up == 0

Missing MDS replicas

It’s important to check that the actual number of MDS replicas isn’t lower than expected. Usually, for high availability (HA), the number is three. But in larger clusters, it can be higher.

ceph-mds is the metadata server daemon for the Ceph distributed file system. It coordinates access to the shared OSD cluster. If MDS is down, you won’t have access to the OSD cluster.

This PromQL query will alert you if there’s no MDS available.

count(ceph_mds_metadata == 1) == 0

Quorum

In case the Ceph MONs cannot form a quorum, cephadm is unable to manage the cluster until the quorum is restored. Learn more about how Ceph uses Paxos to establish consensus about the master cluster map in the Ceph documentation.

It’s recommended to have three monitors to get a quorum. If any is down, then the quorum is at risk.

This can be alerted with the ceph_mon_quorum_status metric:

count(ceph_mon_quorum_status{%s} == 1) <= ((count(ceph_mon_metadata{%s}) %s 2) + 1)

Want to dig deeper into PromQL? Download our PromQL cheatsheet!

Add these metrics to Grafana or Sysdig Monitor in a few clicks

In this article, we’ve learned how monitoring Ceph health with Prometheus can easily help you check your Ceph cluster health, and identified the top five key metrics you need to look at.

In PromCat.io, you can find a dashboard and the alerts showcased in this article, ready to use in Grafana or Sysdig Monitor. These integrations are curated, tested, and maintained by Sysdig.

screenshot showing the Dashboard section for the PromCat Ceph Resource

Also, learn how easy it is to monitor Ceph with Sysdig Monitor.

If you would like to try this integration, we invite you to sign up for a free trial of Sysdig Monitor.

Stay up to date

Sign up to receive our newest.

Related Posts

How to monitor Ceph: the top 5 metrics to watch

How to monitor Amazon SQS with Prometheus

Top 5 key metrics for monitoring Amazon RDS