It’s fun to read about new tools on HackerNews, but I’ve always enjoyed getting my hands dirty and trying something new myself. That’s why we’re kicking off this series to give users an easy to follow, hands on experience using Sysdig to troubleshoot real problems.
Let’s get started with a scenario where we need to to find out the IP address and username of someone trying to SSH into our system. We want you to find your own path to the bug so we won’t tell you what to do right away, but we will offer some clues to get you on track! Each exercise takes a common troubleshooting use case and packages it up in a Sysdig capture, which is a file of all the system calls that happened on a host at a given point in time.
Before we begin
Setup
- Install Sysdig troubleshooting tool
- Download Trace Capture 1
- Check out this Linux Troubleshooting Cheatsheet (if you’ve used
strace
,tcpdump
,htop
,lsof
,iftop
, or any similar tools. This will be gold to you ☺!
A little bit about Sysdig
Find the Imposter
- Environment: A Kubernetes cluster running tons of stuff: WordPress, Tomcat, MySQL, Cassandra, Redis, MongoDB… with each service replicated via Deployments, and exposed via Services
- Problem: Someone repeatedly tried to log into the host via SSH and kept failing
- Goal: Find the IP address and username of the imposter
Hints
- Tip: To solve the problem, focus on what kind of I/O activity the SSH daemon does when serving a login request
- Tip: Relevent Sysdig chisels:
topconns
andecho_fds
- Tip: Relevant Sysdig filters:
fd.port
,fd.directory
,fd.name
I always like to start off by opening our capture in csysdig
to just get an idea of what we’re looking at. To do that enter:
$ csysdig –r chuck_norris_capture.scap
As you can see here we have thousands of processes running on this machine. Let’s next check out what’s containers are running here by applying the container csysdig
view: [Fn + F2]
This host is jam packed with containers as well. Looking at the bottom right corner shows that there are 32 different containers running on this host. Since there’s so much data if we want to get an answer here we’ll need to get smart with the different filters and chisels.
Hint #1
Since we know that SSH is doing some sort of network activity that’s a good place to start.
The first possible solution here is to check the variety of network connections we have on port 22 (the port on which SSH access occurs)
$ sysdig –r chuck_norris_capture.scap –c topconns fd.port=22
This command will read the capture file, apply a top connections chisel, and apply the filter file descriptor equals port 22.
Bytes Proto Conn
--------------------------------------------------------------------------------
4.53KB tcp 73.2.104.195:29842->10.1.1.214:ssh
4.46KB tcp 73.2.104.195:35925->10.1.1.214:ssh
4.33KB tcp 73.2.104.195:55820->10.1.1.214:ssh
856B tcp 73.2.104.195:57352->10.1.1.214:ssh
200B tcp 73.2.104.195:33762->10.1.1.214:ssh
80B tcp 73.2.104.195:34435->10.1.1.214:ssh
We can see that there are a bunch of connections on port 22. But they all seem to come from the same host: 73.2.104.195
. So probably it’s a fair guess that the IP address of the bad person trying to break into our system is this one.
Hint #2
Once we have found the network connections, what else can we do? What happens if we inspect these connections? We know that there’s traffic on port 22. Now let’s see the the content.
(Pro tip: this can be done with sysdig or csysdig, but the since the file is so large lets use sysdig)
sysdig -r chuck_norris_capture.scap -A -c echo_fds fd.port=22
We can see all the interactions between the host and our local VM SSH. But, of course, SSH is supposed to be a secure protocol. By the time the data hits the kernel and the network, it has already been encrypted. The only plaintext thing we can see is this useless string of algorithm that SSH is exchanging. So looking at the network activity doesn’t seem to bring us a lot of value.
Hint #3
Since we’ve crossed network activity off the list, let’s think what other activities SSH does that would generate a lot of activity. What about log file activity?
sysdig -r chuck_norris_capture.scap -A -c echo_fds fd.name contains auth.log
This brings us much more useful information. We can see that rs
, which is the rsyslog
process, wrote a bunch of things in the auth.log
file, among which are: Failed password for invalid user chucknorris
and the IP address that we had found earlier.
Solution: Detecting SSH User Activity Video
Conclusion
While finding Chuck Norris was kind of trivial, the concept behind it is very powerful. Think about a container that is restarting constantly in your infrastructure and is not even giving you the time to go there and inspect what’s going on. By the time you are able to do a docker exec
, the container has already restarted ten times. With this, you are able to take a snapshot of your system and look at what happens inside your container on your own time. For more real world examples check out, How we found a bug in Amazon ELB, Fishing for Hackers, Troubleshooting Cassandra Column Selection, or many of the other posts on our blog.
Stay tuned for more! Future exercises will include:
- Unraveling Kubernetes messes: Finding a failing HTTP request, the respective TCP connection, and the Pod it originated from
- Snooping tar activity
- Troubleshooting container connectivity issues
How to detect SSH attempts by Chuck Norris https://t.co/hUvqIK58Vw
— Sysdig (@sysdig) March 16, 2017
If there are any examples with Kubernetes, Docker, Mesos, or anything going on in your system that you’d want us to do a post on. Let us know on @sysdig or our Sysdig community Slack!