How to monitor Istio with Sysdig

In this previous article, we talked about how to monitor the Istio service mesh in Kubernetes with the out-of-the-box observability stack. This time, we will walk you through monitoring the Istio service mesh with Sysdig Monitor and how to troubleshoot issues.

Istio service mesh provides special characteristics and functionalities for microservices running on Kubernetes. Some of these features are:

Fault injection for Chaos engineering in Kubernetes.
Network management for microservices (virtual services, routing options, load balancing, traffic optimization, etc).
Security features, like transparent TLS encryption, authentication, authorization and audit tools, etc.

All these capabilities add an extra layer of complexity to the whole ecosystem, making it even more difficult to monitor applications and services running on Kubernetes.

Sysdig Monitor helps users with Istio monitoring, providing a comprehensive and unified portal where users can review their data. In addition, Sysdig Monitor brings extra features like Advisor and Inspect, a set of tools that will help you to troubleshoot applications and find out the root cause of issues very quickly.

Do you want to learn more about these Sysdig Monitor exclusive features? Congratulations, you’re in the right place!

Benefits of Sysdig Monitor for Istio service mesh

If you already read “How to monitor Istio, the Kubernetes service mesh,” you will be already asking yourself:

Why should I use Sysdig Monitor if I already have the default Istio monitoring stack?

In this section, we’ll answer this basic question. Check out the following list and learn how Sysdig Monitor can help you monitor the Istio service mesh.

Advisor helps you troubleshoot issues in your Istio service mesh infrastructure.
Inspect provides a web UI to analyze captures collected by Sysdig agents. You can do a post-mortem analysis of problems just after coming up.
Scalability is provided out-of-the-box with Sysdig Monitor. It is a SaaS offering, you won’t face the challenges of using Prometheus at scale.
Sysdig Monitor provides LTS (Long-Term Storage). You won’t need to worry about how and where time-series data is stored.
The Sysdig Agent collects all the Istio metrics you may need. Since Istio already exposes metrics in Prometheus format by prometheus.io annotations, it is not required to deploy a Prometheus instance to scrape metrics for your Istio infrastructure. Sysdig agent will be responsible for that task.
A set of alert templates for Istio are available in Sysdig Monitor. You can even create your own alerts based on your preferences.
Istio control plane, services, and workloads dashboards are included out-of-the-box in Sysdig Monitor. As soon as the platform starts ingesting Istio traffic, dashboards will be automatically enabled for you.
Metrics explorer gives you freedom to inspect all the metrics available for your cluster. A PromQL UI brings you the chance to run your own PromQL queries.
Sysdig Monitor is a unified portal for any Kubernetes distribution and cloud providers. You have the monitoring data from all your environments in a single place.

As you can see, Sysdig Monitor provides a lot of exclusive features to help customers with Istio monitoring.

How to monitor Istio with Sysdig Monitor

First of all, if you are not a Sysdig Monitor user yet, request a 30-day trial account. It will be activated in a few minutes after registering in the Sysdig portal. This trial account will give you access to all the Sysdig Monitor features, and there is no credit card required!

Sysdig Monitor gets the information from your Kubernetes cluster through agents deployed on your cluster. The Sysdig agent can be installed either by applying a few manifests in yaml files, or installing a helm chart.

The agent deployed in the environment used for this article is 1.5.21. It is part of the sysdig-deploy helm chart 1.3.13. If you need instructions for other versions, or further information on how to deploy the agent, check Sysdig Monitor official documentation.

The Sysdig Agent pods are controlled by a DaemonSet named sysdig-agent. By its nature, DaemonSet ensures that every node has a copy of the Sysdig Agent pod.

$ kubectl get daemonset -n sysdig-agent
NAME                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
sysdig-agent                 3         3         3       3            3           <none>          5m
sysdig-agent-node-analyzer   3         3         3       3            3           <none>          5m

Once you have deployed the agent – no matter whether it was applying the manifests by hand, or installing the helm chart – wait for a few seconds until the pods are up and running.

$ kubectl get pods -n sysdig-agent
NAME                               READY   STATUS    RESTARTS           AGE
sysdig-agent-d987v                 1/1     Running   0                  42s
sysdig-agent-ffr5j                 1/1     Running   0                  42s
sysdig-agent-node-analyzer-jgtbz   2/3     Running   0                  39s
sysdig-agent-node-analyzer-plrfz   2/3     Running   0                  39s
sysdig-agent-node-analyzer-qglg4   2/3     Running   0                  39s
sysdig-agent-s2nwh                 1/1     Running   0                  42s

The agent is ready, sending information from your cluster to the Sysdig Monitor SaaS portal, but…

I want to monitor Istio service mesh, what else should I do in Sysdig Monitor?

Nothing! 🥳

As we mentioned in a previous section, Istio metrics are exposed in Prometheus format by prometheus.io annotations. It facilitates the task for any Prometheus instance that wants to scrape Istio metrics.

So, is the agent able to scrape metrics from Prometheus metrics endpoints?

Yes! 🙌

Actually, a lightweight Prometheus server is embedded into the Sysdig Agent, which enables the agent to collect metrics from the different endpoints exposing metrics in a Kubernetes cluster. For example, other Prometheus instances, endpoints exposing metrics in Prometheus format, metrics exporters, and more.

If you want more information about setting up monitoring with Sysdig, we have prepared a great guide that will help. Find out how to have a fully functional monitoring environment in a few steps with Sysdig Monitor!

Monitoring Istio service mesh control plane

Sysdig Monitor provides some out-of-the-box dashboards for monitoring Istio. In this section, we will start talking about the Istio service mesh control plane dashboard.

This is the dashboard you will want to check to ensure that everything is working properly in the Istio control plane.

Pilot pushes and errors graphs will provide you enough information to determine whether Istio (Pilot) is propagating changes properly or not.

Istio dynamically configures its Envoy proxies with a set of discovery APIs, called xDS. Check the Envoy section to see how it is performing while applying these dynamic configurations.

Last but not least, the Webhook section represents the number of validations and injections that Galley does.

If you want to learn more about the metrics used in this dashboard, refer to the Istio monitoring integration documentation.

Istio services dashboard

The Istio services dashboard gives you a complete view of how your services and applications are behaving within the Istio service mesh.

In terms of HTTP connections, you can check things like the volume of client and server requests, duration of those requests, the rate of non-5xx HTTP code responses, etc.

For TCP connections, Sysdig Monitor provides out-of-the-box graphs to check that the TCP received and sent bytes.

Sysdig Monitor provides some out-of-the-box dashboards for monitoring Istio. In this gif, we see the Istio service mesh control plane dashboard.

For more information on the metrics used in this dashboard, check out the Istio Envoy monitoring integration documentation.

How to monitor Istio workloads

The Istio Workload dashboard provides a collection of graphs designed to easily spot the amount of connections in your Istio service mesh.

In addition, it provides information about the health of the services running in the Istio service mesh, like response codes, latencies, and success rate, among others.

Thanks to this dashboard, you can easily spot the health of your workloads running on Istio. Watch out for latencies, 4xx, and 5xx response codes. These graphs will give you insights on the health of your applications.

The Istio Workload dashboard provides a collection of graphs designed to easily spot the amount of connections in your Istio service mesh.

Troubleshooting issues in Istio service mesh

It’s time to test the troubleshooting capabilities that Sysdig Monitor provides.

Let’s see how Sysdig’s Advisor can help you to troubleshoot issues from the Sysdig Monitor portal.

In this testing scenario, we ran some workloads to generate HTTP and TCP traffic. You can easily reproduce a similar use case deploying the Bookinfo application example, then run curl in an infinite loop to generate traffic.

$ export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
$ export SECURE_INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="https")].nodePort}')
$ export INGRESS_HOST=$(kubectl get po -l istio=ingressgateway -n istio-system -o jsonpath='{.items[0].status.hostIP}')
$ export GATEWAY_URL=$INGRESS_HOST:$INGRESS_PORT
$ while true; do curl -s -o /dev/null "http://$GATEWAY_URL/productpage"; done

After a while, the Istio service dashboards start reporting Client and Server requests volume drops. If you pay attention to the “Server Success Rate” (non-5xx responses), you’ll notice that the reviews-v3 seems to be failing.

Istio service dashboards reporting Client and Server requests volume drops.

Let’s go to the “Workload Status & Performance” dashboard. It will be super useful for confirming there is a problem in some of the workloads that make up the Bookinfo test application.

While there is not a prominent peak in the memory graph (it seems to grow linearly and constantly, though), the CPU took off like a rocket on its way to the moon!

The “Workload Status & Performance” dashboard - super useful for confirming there is a problem in some of the workloads that make up the Bookinfo test application.

In summary, there is a problem with the reviews v3 workload. Certainly, it could be the culprit of the traffic drop, server failed responses, etc. It looks like you have some clues so far, but…

What else can you do to find the root cause of this problem?

Let’s play around with Advisor!

You can access Advisor from the top icon on the left bar menu.

In the Advisor section, checking the “Containers” tab, you will spot some of the data you have already seen before for this pod (memory and CPU usage).

You can explore other projects/pods/containers navigating on the tree. It could be useful to ensure there are no other problems with other services/pods.

Image showing the Advisor tab on the Sysdig platform

The “Processes” tab gives more information on the processes involved with containers running in the pods. In this particular case, it seems like a Java process is consuming the whole memory and CPU resources.

Finally, let’s use Sysdig’s Advisor to check the current container log.

Bingo! 🎉

You found the root cause of the issue. The Java application is reporting an OutOfMemory error in the log.

The Java application in Sysdig Advisor reporting an OutOfMemory error in the log.

Bonus track

You already figured out what was causing trouble. This time, it was a specific application that stopped working because of an OutOfMemoryError, preventing the whole service from running properly. But…

What if you can configure an alert – just in case something similar happens again – creating a capture every time the alert is triggered to do a post-mortem analysis?

Let’s create an alert that will be fired every time the reviews-v3 application reaches or exceeds 100ms.

This alert will trigger a capture automatically, which will include tons of data (syscalls, processes, files, CPU and memory in use, etc). You’ll use this capture with Sysdig Inspect to figure out what happened at that time.

An alert that will trigger a capture automatically, which will include tons of data (syscalls, processes, files, CPU and memory in use, etc).

Sysdig Inspect is an open source tool integrated with Sysdig Monitor. It enables you to analyze what happened for a specific time in a container. Also, it allows you to get which processes were running at that time, memory and CPU consumption, network data, files, and more.

With this capture file, it’s easier and quicker to figure out what happened in your container when the problem came up. Just open the capture from the Sysdig Monitor portal, and Inspector will provide a new UI to navigate through the container snapshot.

Conclusion

Istio service mesh for Kubernetes provides a lot of great capabilities for users. That includes network management for microservices, security features, and even an observability stack that allows you to not only monitor, but manage the Istio service mesh infrastructure.

This adds an extra layer of complexity to the application and Kubernetes. Monitoring Istio service mesh shouldn’t be an option, it is a must. Sysdig Monitor offers extra capabilities that helps customers monitor Istio control plane, services, workloads, and even troubleshoot issues in real time.

If you want to learn more about how Sysdig Monitor can help you with monitoring and troubleshooting your Kubernetes clusters, visit the Sysdig Monitor trial page and request a 30-day free account. You will be up and running in a few minutes!