What if we could do application monitoring without complex dependencies, overhead and for free? That’s why we recently launched Sysdig Tracers: a new functionality that extends the open source Sysdig troubleshooting tool from the infrastructure layer towards the application layer.
Tracers allow you to monitor execution time of any segment of code from a start point to the exit point. Tagging these code spans is as easy as writing a string to /dev/null, with almost no performance impact and easy to accomplish with any programming language (even your bash scripts!). Sysdig also allows you to go much deeper than measuring API requests – you can get deep into understanding application, file and network processing that makes up the latency in your processes. There is no need to link new libraries or instrument application frameworks, although if you don’t want to write tags to /dev/null manually, we have wrappers for some of the most popular programming languages like Python or Go.
In order to illustrate how to use Sysdig Tracers with a real example, we have created a simple API service. Let’s run some requests against it and see what Sysdig can show us.
API endpoint tracing instrumentation
The server uses Python SimpleHTTPServer and listens on port 8888 for incoming HTTP GET requests on these endpoints:
/fib/{number}: returns the Fibonacci of the given parameter {number}
/empty: returns an empty response
/download: returns a randomly chosen file
We are going to be tracing the performance of every HTTP GET request to each endpoint. As opposed to what Sysdig users are accustomed to, Sysdig Tracers requires a very small amount of code instrumentation (actually just tagging). Sysdig is awesome in its ability to give you application layer metrics without any code instrumentation, decoding application protocols like HTTP requests or SQL queries from system calls writing on file descriptors (including sockets and also files). Now we are digging even deeper into apps, and this requires tagging the code we want to profile.
Using Tracers in your Python code is as simple as importing sysdig_tracer and encapsulating your code within a with section. To know more about the Python wrapper we suggest reading Introducing Python Sysdig Tracers.
Here we can see how the entire code that handles a request to /download is under the scope of the download_handler tracer using a with statement. Inside, we have identified two subsections we want to measure individually: writing the HTTP headers and writing the body content in the response. Using tracer spans we can tag each section individually.
In order to get some traffic we’ll use the HTTP load testing tool siege to simulate a bunch of clients accessing our API. siege will load a text file with a long list of URLs with the 3 existing endpoints that have random parameters.
To launch this scenario on your own you can either install sysdig_tracers library from pip and the siege package to run things manually or let docker-compose do everything for you running the server and the client in Docker containers.
With both the server and the client running, let’s use sysdig to dump a capture into a file. This capture will contain all the syscalls, some of them tagged with the tracer name that scoped the piece of code that generated the syscall. Leave sysdig running around 30 seconds to gather some data and stop with Ctrl+C:
We can now stop both the server and the client to explore the capture with csysdig:
$ csysdig -v traces_summary -r capture.scap
This view gives us a report containing the average, minimum and maximum execution time of the different Tracers we instrumented. In this example we can see how the download handler sometimes is quite slow compared with the rest of handlers. This is especially interesting as the maximum exec time peaks at 1.17 seconds. We can drill down and observe the spans inside that trace:
So far these views have given us good information, but it’s all aggregate. Sysdig also offers an excellent view to observe the execution time distribution: Sysdig spectrograms. This view can be accessed through F2 Views > Traces Spectrogram or with $ csysdig -v spectro_traces.
Here we can clearly identify how the execution time of a number of API requests goes up an order of magnitude (~1.5 s). How can we leverage Sysdig to troubleshoot this? Well, all the Sysdig filters we know are also available when using Sysdig Tracers.
Troubleshooting slow requests with Sysdig
A good way to start would be to filter out Tracers from requests that took more than 500ms to complete:
We are pretty sure that code inside the download handler is going to be very I/O resource intensive: needs to read a file and then write it on the network socket. We can confirm that grouping the syscalls by the span tags and report on I/O. chisels are great for this kind of aggregation and reporting, if you haven’t used Sysdig chisels before, we wrote about them in Using Sysdig to Explore I/O with the “fdbytes_by” Chisel.
8.82G of I/O on download_handler.download_writes, it did a lot of I/O indeed!
Was this disk I/O or network I/O? Again filters to the rescue. Let’s see network traffic first:
Interesting! from all the files the download_handler is serving we have identified one which generated way more I/O than the rest:/tmp/blob.bin.4. What do these files look like?
$ ls -lh /tmp/blob.bin.*
-rw-r--r-- 1 root root 1.0M Dec 14 10:50 /tmp/blob.bin.1
-rw-r--r-- 1 root root 1.5M Dec 14 10:50 /tmp/blob.bin.2
-rw-r--r-- 1 root root 1.3M Dec 14 10:50 /tmp/blob.bin.3
-rw-r--r-- 1 root root 512M Dec 14 10:50 /tmp/blob.bin.4
Bingo! the reason for the slow down is this big (512M) file being sent over the network.
Summary
When trying to find performance bottlenecks in your application, traditional operating system tools are sometimes not enough as they have a limited visibility of what’s happening.
Today we learned how to combine system calls profiling with code transaction tracing to analyze the performance and find why slow requests are happening in a simple application.
Instead of dealing with complex profiling solutions, Sysdig Tracers tags different spans of your code and these tags can be used to filter the syscalls, do aggregations and reporting like we already know with Sysdig.
Download this example from GitHub and install Sysdig to start profiling your applications code. Find those bottlenecks and kill them all!
Subscribe and get the latest updates
Thank You For Signing Up!
This form failed to load.
An ad blocking extension or strict tracking protection is preventing this form from loading. Please temporarily disable ad blocking or whitelist this site, use less restrictive tracking protection, or enable JavaScript to load this form. If you are unable to complete this form, please email us at [email protected] and a sales rep will contact you.