Integrating Prometheus alerts and events with Sysdig Monitor

By on December 13, 2017
Prometheus alerts to Sysdig Monitor

Prometheus alerts: Sysdig ♥ Prometheus (part II)

If you already use (or plan to use) Prometheus alerts and events for application performance monitoring in your Docker / Kubernetes containers, you can easily integrate them with Sysdig Monitor via the Alertmanager daemon, we will showcase the integration in this post.

You can consider this piece the second part to the “Prometheus metrics: instrumenting your app with custom metrics and autodiscovery on Docker containers” blog post, where we detailed how to integrate custom Prometheus metrics.

Prometheus alert integration for your Docker and Kubernetes monitoring needs

Prometheus provides its own alerting system using a separate daemon called Alertmanager. What happens if you already have a Prometheus infrastructure for APM and plan to integrate the Sysdig container intelligence platform? (Or the other way around).

They can work together without any migration or complex adaptation efforts, actually, there is a lot to be gained from the combination of application-specific custom Prometheus monitoring that your developers love and deep container and service visibility provided by Sysdig.

These two contexts together add more dimensions to your monitoring data. To illustrate what we mean: You can easily detect that the MapReduce function on your backend container is taking longer than usual because your kubernetes.replicas.running < kubernetes.replicas.desired, the horizontal container scaling is failing and thus, the container that fired the alarm is receiving an order of magnitude more work.

Metrics Exporters, Prometheus, Alertmanager & Sysdig integration

Docker monitoring scenario

To put things in context let's assume that you already have a Docker environment instrumented with Prometheus:

Prometheus alerts diagram

It could be Swarm, Kubernetes, Amazon ECS… whatever you're using our integration with prometheus works the same way.

Simplifying, you have several exporters that emit metrics, the Prometheus server aggregates them and checks the alert conditions, if those conditions are met, it sends the configured alert to Alertmanager. Alertmanager is in charge of alert filtering, silencing, cooldown times and also sending the alert notifications to its receivers, mail and slack chat in our example.

One of the available receivers for Alertmanager is a webhook, this method boils down to HTTP POSTing a JSON data structure. Its simplicity and standard format provide a lot of flexibility to integrate any pair of producer / consumer software.

Accordingly, this is what we want to deploy:

Prometheus alert with Sysdig integration

A new webhook AlertManager receiver that retrieves the JSON, reformats it to adapt to the Sysdig API function and uploads the alert data to Sysdig Monitor, is really just that. The interesting bit is that you don't have to modify the monitoring / alerting infrastructure you already have. It's just a new data output.

Let's create an easily reproducible, please-do-try-this-at-home scenario:

Prometheus Metric Exporter

First, we need some data to be scrapped. You can reuse the trivial python script from the last article to get some available metrics.

You can get it as a docker container:

docker run -d -p 9100:9100 mateobur/pythonmetric

If you try:

curl localhost:9100

You should see some raw metrics:

...
function_exec_time_count{func_name="func1"} 53.0
function_exec_time_sum{func_name="func1"} 10.620916843414307
function_exec_time_bucket{func_name="func2",le="0.005"} 0.0
function_exec_time_bucket{func_name="func2",le="0.01"} 0.0
function_exec_time_bucket{func_name="func2",le="0.025"} 0.0
...

Prometheus server & alerts

Next, we are going to configure the Prometheus server itself. Take a look to the Prometheus alerting guide if you want to go further than this example.

We are going to modify three sections of the main configuration file.

First, we declare the metrics endpoint we just mentioned:

  - job_name: 'pythonfunc'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ['pythonmetrics:9100']

So prometheus will regularly scrape it.

Second, we declare a rule configuration file to load:

rule_files:
   - "/etc/prometheus/alerts.yml"

And third, we list the Alertmanager where we want to deliver our alerts:

alerting:
  alertmanagers:
  - static_configs:
    - targets:
       - alertmanager:9093

This configuration requires that the Prometheus container is able to resolve the pythonmetrics and alertmanager hostnames, don't worry much about it, using the docker-compose file we provide below, everything should work out of the box.

This is the alerts.yml we are going to use for the example:

groups:
- name: example
  rules:
  - alert: Function exec time too long
    expr: function_exec_time_sum{job="pythonfunc"} > 0.5
    for: 1m
    annotations:
      severity: 5
      container_name: nginx-front
      host_mac: 08:00:27:b2:06:e8
      container_id: 8c72934ff648
      host_hostName: dockernode
      description: Function exec time too long
    labels:
      source: docker

Nothing too surprising here, it has a name, a description, a condition to fire expressed in the internal Prometheus language and a time bucket to evaluate.

We are going to use the annotations as scope for our Sysdig alert, our webhook script will also translate the '_' to '.' characters (they are not valid as annotation name), so we can have the exact same names than native Sysdig alerts.

Alertmanager & webhook receivers - Prometheus alerts integration

The third piece of this puzzle is the Alertmanager, you can read about its configuration here. Particularly, you can use the different routes and receivers in the routing tree to filter and classify the alerts and, for example, deliver alerts from different parts of your infrastructure to separate Sysdig teams.

In this example we are just going to configure a default webhook receiver:

  # A default receiver
  receiver: SysdigMonitor

And define it here:

- name: 'SysdigMonitor'
  webhook_configs:
    - url: 'http://sysdigwebhook:10000'

Any time the Alertmanager needs to notify about an alert it will send a HTTP POST to that URL endpoint.

JSON Prometheus alerts to Sysdig API

For the last piece, we just need to catch that JSON output, do some minor rearrangements of the data and call the Sysdig API.

These ~30 lines of Python are enough to have a functioning starting point:

Complete integration with Docker-compose

To spawn all the pieces of this example at once in a more convenient way, you can just use this docker-compose file.

Just fill out the SYSDIG_API_KEY variable with your token string, and spawn it

docker-compose up -d

Let's take a look at every step:

Accessing http://localhost:9100 you should see the raw metrics again, as we mentioned earlier.

Accessing http://localhost:9090/ you get the Prometheus interface

Prometheus interface metric

If you click on the Alerts tab:

Prometheus alert fired

Your alert has fired, nice.

Next step, you have the Alertmanager interface at http://localhost:9093

Alertmanager Prometheus alert

It looks like the Alertmanager is taking care of our alerts, if you click on Status, you can see the current configuration file with the routing tree and receivers.

Alertmanager Sysdig webhook

Let's take a look at the docker-compose logs:

$ docker-compose logs
Attaching to sysdigwebhook_1, promserver_1, alertmanager_1, pythonmetrics_1
sysdigwebhook_1  |  * Running on http://0.0.0.0:10000/ (Press CTRL+C to quit)
sysdigwebhook_1  | 172.19.0.3 - - [12/Dec/2017 15:09:09] "POST / HTTP/1.1" 200

Your sysdigwebhook container has received a HTTP POST from the Alertmanager.

And the last and most important part, if you open your Sysdig Monitor Panel:

Sysdig Monitor with Prometheus alert integrated

There it is!

Your custom event will full scope and tags, on top of any other Sysdig metric you need to correlate.

Supercharge Debugging

By adding multi-dimensional scope to your metrics and dashboards you can supercharge your debugging capacity and find data correlations that are extremely arduous to discover manually.

Also, webhooks are incredibly useful to easily integrate microservices.

The webhook receiver code is just a PoC, you can use it as an starting point, but make sure to add exception handling and fallback routines if you plan to do anything more serious than a local test.

Prometheus and Alertmanager are opensource and you can also get a free trial of Sysdig Monitor right away.




Eager to learn more? Join our webinar Container Troubleshooting with Sysdig

Btw, we are running a webinar discussing the challenges of troubleshooting issues and errors in Docker containers and Kubernetes, like pods in CrashLoopBackOff, join this session and learn:

  • How to gain visibility into Docker containers with Sysdig open source and Sysdig Inspect
  • Demo: troubleshoot a 502 Bad Gateway error on containerized app with HAproxy
  • Demo: troubleshoot a web application that mysteriously dies after some time
  • Demo: Nginx Kubernetes pod goes into CrashLoopBackOff, what's you can do? Will show you how to find the error without SSHin into production servers

Join Container Troubleshooting with Sysdig webinar


Stay up to date!

Get new articles from this blog (weekly)
Or container ecosystem updates (monthly)

Thanks so much for signing up!
Please check your inbox for a confirmation email.