Monitoring Availability Metrics with Blackbox exporter and Sysdig

By Carlos Tolon - JUNE 24, 2021

SHARE:

The Prometheus Blackbox exporter allows endpoints exploration over several protocols, such as HTTP(S), DNS, TCP, and ICMP. This exporter generates multiple metrics on your configured targets, like general endpoint status, response time, redirect information, or certificate expiration dates.

The Blackbox Exporter works out-of-the-box, as it just focuses on external visibility details. To get more detailed metrics, you can instrument your applications.

Since the Blackbox exporter usually runs as a self-hosted service in a container, it can be easily deployed in your Kubernetes cluster.

By default, when performing HTTP probes, this exporter uses the GET HTTP method to explore your targets and expects a status code similar to 2xx. This behavior is easily configurable along with others, like basic or bearer token-based authentication, custom headers, or the usage of a proxy server to connect to targets.

Some interesting use cases where you can use Blackbox exporter with Sysdig are:

  • Detecting certificate expiration.
  • Analyzing the latency of specific targets and paths of services running in the same cluster.
  • Detecting endpoint failures.
  • Monitoring latency and reachability of an external endpoint like third services.

Installing and configuring Blackbox exporter in Kubernetes

You can use Helm to install and configure your Blackbox installation in Kubernetes.

$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
$ helm repo update

Under the parameter config, you can include your probes configuration. For example, you can define several HTTP probes for different scenarios. You can include the following configuration in a Helm values file that has to be provided with the “-f” flag.

config:
  modules:
    http_2xx:
      prober: http
      timeout: 2s
      http:
        follow_redirects: true
    http_k8s:
      prober: http
      timeout: 2s
      http:
        follow_redirects: true
        bearer_token: /var/run/secrets/kubernetes.io/serviceaccount/token
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt

In the http_2xx module, most options are kept by default. In the http_k8s example, it has been configured with a custom CA and bearer token-based authentication.

Once you have correctly configured your Helm values, you can install the Blackbox exporter:

$ helm install [RELEASE_NAME] prometheus-community/prometheus-blackbox-exporter -f values.yaml [-n NAMESPACE]

Scraping Blackbox using Sysdig Agent

The Blackbox exporter can be invoked using the native Prometheus service discovery, available with the Sysdig Agent.

To use the native Prometheus service discovery, you’ll need to enable the prometheus.prom_service_discovery configuration by adding the following lines in your Agent configuration:

prometheus:
  enabled: true
  prom_service_discovery: true

Target configuration for Blackbox exporter must be provided using Prometheus relabel_configs in a static scrape configuration. You can include the list of targets with the module you want to use as a parameter in your Sysdig Agent prometheus.yaml file.

prometheus.yaml: |
  global:
    scrape_interval: 10s
  scrape_configs:
  - job_name: blackbox
    metrics_path: /probe
    params:
      module:
      - http_2xx
    relabel_configs:
    - source_labels:
      - __address__
      target_label: __param_target
    - source_labels:
      - __param_target
      target_label: instance
    - replacement: blackbox-service:9115	# Blackbox hostname:port
      target_label: __address__
    static_configs:
    - targets:
      - http://front-end.sock-shop.svc.cluster.local
      - https://docs.sysdig.com/
      - https://promcat.io
      - https://api.promcat.io/apps
      - https://my.failing.app

In the relabel_configs block, replace “blackbox-service:9115” with your Blackbox exporter hostname:port. You can directly use the Blackbox Kubernetes Service and port.

In addition to static configuration, there is a common snippet where you do autodiscovery of services in the cluster and use Blackbox to check all the services with the annotation “prometheus.io/probe: true”.

prometheus.yaml: |
  global:
    scrape_interval: 10s
  scrape_configs:
  - job_name: kubernetes-services
    kubernetes_sd_configs:
    - role: service
    metrics_path: /probe
    params:
      module:
      - http_2xx
    relabel_configs:
    - action: keep
      regex: true
      source_labels:
      - __meta_kubernetes_service_annotation_prometheus_io_probe
    - source_labels:
      - __address__
      target_label: __param_target
    - replacement: blackbox-service:9115	# Blackbox hostname:port
      target_label: __address__
    - source_labels:
      - __param_target
      target_label: instance
    - action: labelmap
      regex: __meta_kubernetes_service_label_(.+)
    - source_labels:
      - __meta_kubernetes_namespace
      target_label: kubernetes_namespace
    - source_labels:
      - __meta_kubernetes_service_name
      target_label: kubernetes_name

Note: Using a service approach is a generally disruptive approach as we run the Sysdig agent as DaemonSet. Whenever possible try to avoid a combination of static configurations and service sd. This can cause overages with noisy targets and in the worst case even having containers ending up OOMKilled. With a pod approach, we are going to scrape the pod locally on the node where the agent is running.

Visualize and Alert on available metrics

Once you set up Sysdig agent to use the Blackbox exporter to probe your targets, metrics will flow to your Sysdig instance:

probe_duration_seconds metric returns how long the probe took to complete. You can use the following query to analyze the probe duration by each one of your targets:

avg(probe_duration_seconds) by(instance)

When our applications consume third-party services, it is important to detect outages or wrong code responses. You can get this information with the probe_http_status metric.

Certificate expiration can produce miscommunications between your services. The probe_ssl_earliest_cert_expiry metric provides you the timestamp in seconds when the certificate chain will no longer be valid.

Represent the expiring time of all your endpoints in one dashboard with the following query:

probe_ssl_earliest_cert_expiry - time()

Set up your own alerts for certificates that will expire in less than 60 days by using the following PromQL query:

probe_ssl_earliest_cert_expiry - time()

You can also detect when your probes are failing by using the probe_success metric:

sum(probe_success) by (instance) != 1

You can create your own dashboard for a complete overview of all your targets based on the available information:

If you would like to try this integration, we invite you to sign up for a free trial of Sysdig Monitor.

Subscribe and get the latest updates