Attackers use container escape techniques when they manage to control a container so the impact they can cause is much greater. This’s why it is a recurring topic in infosec and why it is so important to have tools like Falco to detect it.
Container technologies rely on various features such as namespaces, cgroups, SecComp filters, and capabilities to isolate services running on the same host and apply the least privileges principle.
Capabilities provide a way to limit the level of access a container can have, splitting the power of the root user into more granular units. However, they are often misconfigured, granting excessive privileges to processes and threads.
CVEs published in recent years have shown that those features can be misconfigured and lead an attacker to escape and escalate the privilege inside the container and the host. Here, we indicate some container breakout vulnerabilities:
- CVE-2022-0847: “Dirty Pipe” Linux Local Privilege Escalation.
- CVE-2022-0492: Privilege escalation vulnerability causing container escape.
- CVE-2022-0185: Detecting and mitigating Linux Kernel vulnerability causing container escape.
- CVE-2019-5736: runc container breakout.
- CVE-2022-0811: Arbitrary code execution affecting CRI-O.
In this article, we explain how you can detect and monitor capabilities using Falco, analyzing a well-known container escaping technique.
What are capabilities?
Linux documentation clearly defines capabilities as:
“Starting with kernel 2.2, Linux divides the privileges traditionally associated with superuser into distinct units, known as capabilities, which can be independently enabled and disabled. Capabilities are a per-thread attribute.”
As of Linux 3.2, there are 41 capabilities and they are reported in the Linux documentation.
In other words, capabilities divide the privileges of root user into small pieces to grant a thread just enough power to perform specific privileged tasks. Suppose the pieces are small enough and well picked. In that case, even if a privileged program is compromised, the possible damages are limited by the set of capabilities that are available to the process.
A diagram which shows the difference between running with default capabilities and with restricted capabilities by Snyk.
Among all capabilities available, the ones worth a special mention are CAP_SYS_ADMIN and CAP_NET_ADMIN, which are very broad and permissive capabilities.
CAP_SYS_ADMIN is required to perform administrative operations, which are difficult to drop from containers if privileged operations are performed within the container. Due to the broad permissions, it can easily lead to additional capabilities or full root (typical access to all capabilities).
CAP_NET_ADMIN is required to perform all the network-related operations from changing interface configurations, administrating the host firewall and setting promiscuous mode. Even for this capability, the potential damage might be huge if the permissions are misused.
In containers which are isolated environments by definition, the most permissive capabilities are already removed by default. That means if you run a Docker container without specifying additional settings, Docker will use the limited set of capabilities.
So where is the problem?
The key point from the explanation provided before is “small pieces” and this is where the problem begins. Splitting root privileges into small pieces is useful from a security perspective, although we don’t want too many pieces. In addition, the Linux development model doesn’t have a central authority determining how capabilities should be assigned and split.
This confusion brings a lot of doubts and misunderstandings to developers hoping to understand how to proceed. So, lacking sufficient information for a decision, the developer chooses CAP_SYS_ADMIN or similar excessive capabilities for their new feature.
And that brings us to where we are today: CAP_SYS_ADMIN is the new root.
On one hand, the goal of capabilities is to limit the power of privileged programs to be less than root. On the other hand, if we have a program CAP_SYS_ADMIN, the game is more or less over.
In containers, even though a set of capabilities are removed by default, it’s always possible to expand the set of capabilities by specifying the ones to add when running the container. As we know containers can also be run directly as privileged and, in this case, the container can use all the capabilities available, CAP_SYS_ADMIN included.
Here is an easy example. If we wanted to see the kernel addresses exposed via
/proc, this kind of operation isn’t allowed if the container is run without excess capabilities, and if we execute, here is what happens if we run the command.
[email protected] falco % docker run -it alpine:latest / # cat /proc/kmsg cat: can't open 'cat /proc/kmsg': Operation not permitted
Here is what happens if we run the container with CAP_SYS_ADMIN capability instead.
[email protected] falco % docker run -it --cap-add CAP_SYS_ADMIN alpine:latest / # cat /proc/kmsg <4>[6226394.148135] printk: cat (980141): Attempt to access syslog with CAP_SYS_ADMIN but no CAP_SYSLOG (deprecated).
As pointed out in the warning message, we should use the specific capability CAP_SYSLOG to perform this action since it has been created to segregate the permissions from CAP_SYS_ADMIN.
[email protected] falco % docker run -it --cap-add CAP_SYSLOG alpine:latest / # cat /proc/kmsg
As you can see, with the right capabilities we can open the file without any warning message by using the right capability.
Let’s now have a look at another example where CAP_SYS_ADMIN is actually required to perform specific actions. In this example, we use the command unshare which to create a new namespace; in this case, inside a container. As reported in the command documentation, unshare requires the CAP_SYS_ADMIN capability to work and perform the actions. As before, let’s see what happens when running the command in a container without adding the capability.
[email protected] falco % docker run -it alpine:latest / # unshare unshare: unshare(0x0): Operation not permitted
Here is what happens if we run the container with CAP_SYS_ADMIN capability instead.
[email protected] falco % docker run -it --cap-add CAP_SYS_ADMIN alpine:latest / # unshare 3b28503d0205:/#
As you can see in the last example, it was possible to create the new namespace thanks to the extra privileges.
As the last example points out, there are some actions that require CAP_SYS_ADMIN by design. Thus, the only way to see if there are misuses or malicious behaviors is monitoring the capabilities for threads and processes.
Monitoring capabilities using Falco
Thankfully for us, in the new Falco version 0.32 it’s possible to monitor the thread capabilities and be sure that just the allowed capabilities are available.
Three new fields have been added into Falco to accomplish this task:
- Thread.cap_permitted: superset of capabilities a thread may ever get.
- Thread.cap_inheritable: set of capabilities that might go in the permitted set after an execve event.
- Thread.cap_effective: set of capabilities used by the kernel to perform permission checks needed to run.
In this case, we are in a container run as privileged and we can easily see the list of capabilities applied. Using the command
cat /proc/self/status, you can find the cap values applied.
CapInh: 00000000a82425fb CapPrm: 00000000a82425fb CapEff: 00000000a82425fb CapBnd: 00000000a82425fb
Decoding the result 00000000a82425fb value using capsh, you can see the list of capabilities.
capsh --decode=00000000a82425fb 0x00000000a82425fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_sys_admin,cap_mknod,cap_audit_write,cap_setfcap
Among the others, we can see the famous CAP_SYS_ADMIN.
Having this information available in Falco allows us to create detection over those capabilities and raise alerts if misconfigured capabilities are applied in our environment.
Detecting container escaping with Falco
In this scenario, we see a well-known container escaping technique which relies on cgroup v1 virtual filesystem and, big surprise, CAP_SYS_ADMIN.
The exploitation has been presented in this blog. In order to perform the escaping, we need the following:
- We must be running as root inside the container.
- The container must be run with the CAP_SYS_ADMIN Linux capability.
- The container must lack an AppArmor profile, or otherwise allow the mount syscall.
- The cgroup v1 virtual filesystem must be mounted read-write inside the container.
In particular, cgroup v1 relies on two files notify_on_release and release_agent which are used in the exploitation to execute commands as root and perform container escaping. However, those files are also used in other exploitation to reach the same goal to escalate privileges and break the isolation.
One very recent example is the CVE-2022-0492.
In the case that your container has this capability and has not been detected when it has been created, we have a last line of defense: runtime security with Falco.
Using the Falco and the new visibility over capabilities available in a thread, we are able to detect if the file release_agent is open and modified by a threat which has excessive capabilities. In this case, we check if the thread explicitly contains CAP_SYS_ADMIN in the set of effective capabilities.
- rule: Detect release_agent File Container Escapes desc: "This rule detects an attempt to exploit a container escape using release_agent file. By running a container with certains capabilities, a privileged user can modify release_agent file and escape from the container" condition: open_write and container and fd.name endswith release_agent and (user.uid=0 or thread.cap_effective contains CAP_DAC_OVERRIDE) and thread.cap_effective contains CAP_SYS_ADMIN output: "Detect an attempt to exploit a container escape using release_agent file (user=%user.name user_loginuid=%user.loginuid filename=%fd.name %container.info image=%container.image.repository:%container.image.tag cap_effective=%thread.cap_effective)" priority: CRITICAL tags: [container, mitre_privilege_escalation, mitre_lateral_movement]
Thanks to this rule, we can create a strong and noiseless detection on all the techniques that use release_agent and excessive capabilities to break container isolation and comprise the entire node.
10:11:27.415074914: Critical Detect an attempt to exploit a container escape using release_agent file (user=<NA> user_loginuid=-1 filename=/tmp/cgrp/release_agent cool_williamson (id=8ed46a770162) image=ubuntu:latest cap_effective=CAP_CHOWN CAP_DAC_OVERRIDE CAP_FOWNER CAP_FSETID CAP_KILL CAP_SETGID CAP_SETUID CAP_SETPCAP CAP_NET_BIND_SERVICE CAP_NET_RAW CAP_SYS_CHROOT CAP_SYS_ADMIN CAP_MKNOD CAP_AUDIT_WRITE CAP_SETFCAP)
Capabilities provide a way to isolate containers although, as we have seen, misconfiguration and excessive capabilities are often part of new CVEs and they might cause significant security issues.
With a tool like Falco, it’s possible to monitor when specific capabilities like CAP_SYS_ADMIN are misused. Using the new Falco fields in rules, it’s now possible to raise security alerts and get flagged when these malicious behaviors happen in your environment.
After that, if you would like to find out more about Falco:
- Get started at Falco.org.
- Check out the Falco project on GitHub.
- Get involved with the Falco community.
- Meet the maintainers on the Falco Slack.
- Follow @falco_org on Twitter.