Blog Icon

Blog Post

How to monitor Istio, the Kubernetes service mesh

LIVE WEBINAR: 5 Prometheus Exporter Best Practices - Oct 20, 2020 10am Pacific / 1pm Eastern

In this article, we are going to deploy and monitor Istio over a Kubernetes cluster. Istio is a service mesh platform that offers advanced routing, balancing, security, and high availability features, plus Prometheus-style metrics for your services out-of-the-box.

What is Istio?

Istio is a platform used to interconnect microservices.It provides advanced network features like load balancing, service-to-service authentication, monitoring, and more without requiring any changes in service code.

In the Kubernetes context, Istio deploys an Envoy proxy as a sidecar container inside every pod that provides a service.

These proxies mediate every connection, and from that position, they route the incoming/outgoing traffic and enforce the different security and network policies.

This dynamic group of proxies is managed by the Istio “control plane,” a separate set of pods that orchestrate the routing, security, live ruleset updates, etc.

Istio architecture overview

You have detailed descriptions of each subsystem component in the Istio project docs.

Service mesh explained: The rise of the “service mesh”

Containers are incredibly light and fast, it’s no surprise their density is roughly one order of magnitude greater than virtual machines. Classical monolithic component interconnection diagrams are rapidly turning into highly dynamic, fault-tolerant N-to-N communications with their own internal security rules, labeling-based routes, DNS and service directories, etc. This is the famous microservice mesh.

This means that while software autonomous units (containers) are becoming simpler and numerous, interconnection and troubleshooting distributed software behavior is actually getting harder.

And of course we don’t want to burden containers with this complexity, we want them to stay thin and platform agnostic.

Kubernetes already offers a basic abstraction layer separating the service itself from the server pods. Several software projects are striving to tame this complexity, offering visibility, traceability, and other advanced pod networking features. We already covered how to monitor Linkerd, let’s talk about Istio.

Istio features overview

Intelligent routing and load balancing: Allows you to define policies to map static service interfaces to different backend versions, allowing for A/B testing, canary deployments, gradual migration, etc. Istio also allows you to define routing rules based on HTTP-layer metadata like session tokens or user agent string.

Network resilience and health checks: By setting timeouts, retry budgets, health checks, and circuit breakers, you can quickly weed unhealthy pods out of the service mesh.

Policy enforcement: Peer TLS authentication, pre-condition checking (whitelists and similar ACL), quota management to avoid service abuse, and/or consumer starvation.

Telemetry, traceability, and troubleshooting: Telemetry is automatically injected in any service pod providing Prometheus-style network and L7 protocol metrics. Istio also dynamically traces the flow and chained connections of your microservices mesh.

How to monitor Istio using Prometheus

One of the major infrastructure enhancements of tunneling your service traffic through the Istio Envoy proxies is that you automatically collect metrics that are fine-grained and provide high-level application information (since they are reported for every service proxy).

These individual metrics are gathered by the Prometheus, but you can also access the Prometheus endpoint.

Since mixer is deprecated, the metrics are coming directly from the pods:

  1. pilot (15014): all metrics of Istio. Used to monitor the control plane of Istio.
  2. envoy (15090): raw stats generated by Envoy (and translated from statsd to prometheus).

The Istio project also provides examples and documentation on configuring a Prometheus server to scrape and analyze the most relevant metrics.

kubectl apply -f install/kubernetes/addons/prometheus.yaml

Wait until the pod is ready, and forward the Prometheus server port to your local machine:

kubectl -n istio-system port-forward $(kubectl -n istio-system get pod -l app=prometheus -o jsonpath='{.items[0].metadata.name}') 9090:9090 &

You can now access the Prometheus server UI opening http://localhost:9090/ in your web browser.

monitor Istio with Prometheus

There is also a Grafana deployment pre-configured and ready to test at the Istio repository:

$ kubectl create -f install/kubernetes/addons/grafana.yaml

Again, wait for the pod and service to be up and running, and redirect the Grafana service port:

kubectl -n istio-system port-forward $(kubectl -n istio-system get pod -l app=grafana -o jsonpath='{.items[0].metadata.name}') 3000:3000 &

You can access a pre-populated Dashboard at http://localhost:3000/dashboard/db/istio-dashboard

Monitor Istio with a grafana dashboard

Monitoring Istio: Reference metrics and dashboards

Let’s start monitoring our services and application behavior.

Segmenting by service and service version, these are a few metrics that you usually want to monitor, coming from both the Istio Prometheus telemetry and Sysdig out-of-the-box metric collection:

  • Number of requests: istio_request_count.
  • Request duration: istio_request_duration_milliseconds_bucket by source and by destination.
  • Request size: istio_request_bytes_bucket by source and by destination.
  • All of these metrics are buckets, so we can calculate the histograms with percentiles 50, 95 and 99.
  • HTTP Error codes from the metric: istio_requests_total with the label code.

Using PromCat.io is the fastest way to create the dashboard, you just have to execute one command to get your dashboard with all metrics at once.

Promcat contains resources on how to monitor Istio

Monitor Istio in Sysdig Monitor: Scraping Istio Prometheus metrics

Istio core services using the Prometheus metric format are very convenient because, as you probably know, Sysdig will automatically detect and scrape Prometheus endpoints.

Let’s edit Sysdig agent configuration file (dragent.yaml) to configure which pods and ports should be scrapped:

prometheus:
  enabled: true
…
   - include:
      process.name: envoy
      conf:
        port: 15090
        path: "/stats/prometheus"

Make sure that Prometheus is enabled and then write an include filter.

With this configuration, you can scrape all envoy containers and get all metrics provided by the envoy without any additional configuration. This will work with any small cluster because we only scrape a few pods, but when the cluster grows the metrics can be overwhelming, so the easiest solution is to federate the Prometheus deployed by Istio with the Sysdig agent.

The process is not complicated but you have to know what you are doing. To help you in this process, the Istio page in PromCat.io contains all of the steps and files needed to get your dashboards and alerts working. Let’s take a quick look at them.

Promcat contains instructions on how to configure the exporter to monitor Istio

First, apply the recording rules. These rules are created to reduce the amount of metrics you are going to ingest. In a small cluster you don’t create a lot of metrics, but the cardinality of these metrics will explode as your cluster grows.

$ kubectl apply -f rules.yaml

In order to get the recording rules in the Prometheus server, it’s necessary to mount them as a volume:

$ kubectl -n istio-system patch deploy prometheus -p '{"spec":{"template":{"spec":{"volumes":[{"name":"config-rules","configMap":{"defaultMode":420,"name":"rules"}}]}}}}'
$ kubectl -n istio-system patch deploy prometheus -p '{"spec":{"template":{"spec":{"containers":[{"name":"prometheus","volumeMounts": [{"mountPath": "/opt/rules","name": "config-rules"}]}]}}}}'

The rules for the Prometheus server must be in its configmap so they can be found.

$ kubectl -n istio-system edit cm prometheus

And add this line at the same level as the global configuration:

 rule_files:
  - /opt/rules/rules.yaml

Also, since the Prometheus has to be scraped by the agent, it needs to be annotated:

$ kubectl -n istio-system patch deploy prometheus -p '{"spec":{"template":{"metadata":{"annotations":{"prometheus.io/scrape": "true", "prometheus.io/port": "9090"}}}}}'

To make sure the configuration is caught by Prometheus, just delete the pod:

$ kubectl -n istio-system delete pods $(kubectl get pods --namespace istio-system -l "app=prometheus,release=istio" -o jsonpath="{.items[0].metadata.name}")

Finally, but no less important, is changing the configuration of the agent. The fastest and simplest way, if you don’t already have another Prometheus configuration, is to patch the provided configmap:

$ kubectl -n sysdig-agent patch cm sysdig-agent -p "$(cat patch.yaml)"

Once the agent starts gathering the Prometheus metrics, it’s time to create the alerts and the dashboards. This simple command will create everything in your Sysdig account for you. You only need to replace $MONITOR_TOKEN with the Sysdig API key from your Sysdig Monitor settings.

docker run sysdiglabs/promcat-connect install istio:1.5 -t $MONITOR_TOKEN

How to monitor Istio internals

Apart from monitoring the services, you can use Istio and Sysdig aggregated metrics to monitor Istio internal services health and performance.

Istio provides its own Ingress controller, a very relevant piece of infrastructure to monitor. When your users are experiencing performance problems or errors, the edge router is one of the first points to check.

To assess the global health of your edge router connections, you can display its connections table, global HTTP response codes, resource usage, number of requests per service, or URL.

Connection Table

Connections Stats

Monitor Istio A/B deployments and canary deployments

One of Istio’s major features is the ability to establish intelligent routing based on the service version.

The pods that provide the backend for a certain service will have different Kubernetes labels:

Labels:         app=reviews
                pod-template-hash=3187719182
                version=v3

These different backends are transparent to the consumer (service or final user), but Istio can take advantage of this information to perform:

  • Content-based routing: For example, if the user-agent is a mobile phone, you can change the specific service that formats the final HTML template.
  • A/B deployments: Two similar versions of the service that you want to compare in production.
  • Canary deployment: Experimental service version that will only be triggered by certain conditions (like some specific test users).
  • Traffic Shifting: Progressive migration to the new service version maintaining the old version fully functional.

Aggregating Istio and Sysdig metrics, you can supervise these service migrations with all of the information you need to make further decisions.

For example, we are comparing the alpha and beta service pods. They provide the same Kubernetes service, and using Istio traffic shifting, we decide to split ingress traffic 50-50.

As you can see, the number of requests and duration of requests (two top graphs) are extremely similar, so we can assume it’s a fair comparison in terms of load.

If you look at the two bottom graphs, it turns out that service alpha is suffering almost three times the number of HTTP errors. Also, its worst case response time (99 percentile down-right graph) is significantly higher than service beta. Looks like our developers did a nice job with the new version :).

Istio Sysdig A/B deployment

Conclusions

Istio solves the “mesh tangle,” adding a transparent proxy as a sidecar to your service-provider pods. From this vantage point, it can collect fine-grained metrics and dynamically modify the routing flow without interfering with the pod software.

This strategy nicely complements Sysdig’s analogous, non-intrusive, minimal-instrumentation approach to maintain your service pods’ simple and infrastructure agnostic (as they should be).

Now you are collecting and organizing your service metrics into nice-looking dashboards. Do you know which metrics are really important to measure service quality and diagnose correct application behaviour? We recommend you to continue reading about the four golden signals of monitoring.

Sysdig helps you follow monitoring best practices. Try it today!

Stay up to date

Sign up to receive our newest.

Related Posts

6 Things to consider in a Prometheus monitoring platform

Improving the Prometheus exporter for Amazon CloudWatch

Challenges using Prometheus at scale