Once you have defined Kubernetes RBAC: users and services credentials and permissions, we can start leveraging Kubernetes orchestration capabilities to configure security at the pod level. In this part, we will learn how to configure security at the pod level using Kubernetes orchestration capabilities: Kubernetes Security Context, Kubernetes Security Policy and Kubernetes Network Policy resources to define the container privileges, permissions, capabilities and network communication rules. We will also discuss how to limit resource starvation with allocation management. The 1st part of this Kubernetes Security guide focus on Kubernetes RBAC and TLS certificates while part 3 goes on Securing Kubernetes components: kubelet, etcd and Docker registry. #Kubernetes all-the-#security-things: Kubernetes Security Context, Kubernetes Security Policy, Kubernetes Network Policy: who is who and how to use them Click to tweet
Kubernetes admission controllers
An admission controller is a piece of code that intercepts requests to the Kubernetes API server prior to persistence of the object, but after the request is authenticated and authorized. Admission controllers pre-process the requests and can provide utility functions (such as filling out empty parameters with default values), but can also be used to enforce security policies and other checks. Admission controllers are found on the kube-apiserver conf file:--admission-control=Initializers,NamespaceLifecycle,LimitRanger,ServiceAccount,PersistentVolumeLabel,DefaultStorageClass,DefaultTolerationSeconds,NodeRestriction,ResourceQuota
Here are the admission controllers that can help you strengthen security:
DenyEscalatingExec: Forbids executing commands on an “escalated” container. This includes pods that run as privileged, have access to the host IPC namespace, and have access to the host PID namespace. Without this admission controller, a regular user can escalate privileges over the Kubernetes node just spawning a terminal on these containers.
NodeRestriction: Limits the node and pod objects a kubelet can modify. Using this controller, a Kubernetes node will only be able to modify the API representation of itself and the pods bound to this node.
PodSecurityPolicy: This admission controller acts on creation and modification of the pod and determines if it should be admitted based on the requested Security Context and the available Pod Security Policies. The PodSecurityPolicy objects define a set of conditions and security context that a pod must declare in order to be accepted into the cluster, we will cover PSP in more detail below.
ValidatingAdmissionWebhooks: Calls any external service that is implementing your custom security policies to decide if a pod should be accepted in your cluster. For example, you can pre-validate container images using Grafeas, a container-oriented auditing and compliance engine, or validate Anchore scanned images.
There is a recommended set of admission controllers to run depending on your Kubernetes version.
Kubernetes Security Context
When you declare a pod/deployment, you can group several security-related parameters, like SELinux profile, Linux capabilities, etc, in a Security context block:...
spec:
securityContext:
runAsUser: 1000
fsGroup: 2000
...
You can configure the following parameters as part of your security context:
Privileged: Processes inside of a privileged container get almost the same privileges as those outside of a container, such as being able to directly configure the host kernel or host network stack.
Other context parameters that you can enforce include:
User and Group ID for the processes, containers and volumes: When you run a container without any security context, the ‘entrypoint’ command will run as root, this is easy to verify:
$ kubectl run -i --tty busybox --image=busybox --restart=Never -- sh
/ # ps aux
PID USER TIME COMMAND
1 root 0:00 sh
Using the runAsUser parameter you can modify the user ID of the processes inside a container. For example:
apiVersion: v1
kind: Pod
metadata:
name: security-context-demo
spec:
securityContext:
runAsUser: 1000
fsGroup: 2000
volumes:
- name: sec-ctx-vol
emptyDir: {}
containers:
- name: sec-ctx-demo
image: gcr.io/google-samples/node-hello:1.0
volumeMounts:
- name: sec-ctx-vol
mountPath: /data/demo
securityContext:
allowPrivilegeEscalation: false
If you spawn a container using this definition you can check that the initial process is using UID 1000.
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
1000 1 0.0 0.0 4336 724 ? Ss 18:16 0:00 /bin/sh -c node server.js
And any file you create inside the /data/demo
volume will use GID 2000 (due to the fsGroup parameter).
Security Enhanced Linux (SELinux): You can assign SELinuxOptions objects using the seLinuxOptions field. Note that SELinux module needs to be loaded on the underlying Linux nodes for this policies to take effect.
Capabilities: Linux capabilities break down root full unrestricted access into a set of separate permissions. This way, you can grant some privileges to your software, like binding to a port < 1024, without granting full root access.
There is a default set of capabilities granted to any container if you don’t modify the security context. For example, using chown to set file permissions or net_raw to craft raw network packages.
Using the pod security context, you can drop default Linux capabilities and/or add non-default Linux capabilities. Again, applying the principle of least-privilege you can greatly reduce the damage of any malicious attack taking over the pod.
As a quick example, you can spawn the flask-cap pod:
$ kubectl create -f flask-cap.yaml
apiVersion: v1
kind: Pod
metadata:
name: flask-cap
namespace: default
spec:
containers:
- image: mateobur/flask
name: flask-cap
securityContext:
capabilities:
drop:
- NET_RAW
- CHOWN
Note that some securityContext should be applied at the pod level, while other labels are applied at container level.
If you spawn a shell, you can verify that these capabilities have been dropped:
$ kubectl exec -it flask-cap bash
root@flask-cap:/# ping 8.8.8.8
ping: Lacking privilege for raw socket.
root@flask-cap:/# chown daemon /tmp
chown: changing ownership of '/tmp': Operation not permitted
AppArmor and Seccomp: You can also apply the profiles of these security frameworks to Kubernetes pods. This feature is in beta state as of Kubernetes 1.9, profile configurations are referenced using annotations for the time being.
AppArmor, Seccomp or SELinux allow you to define run-time profiles for your containers, but if you want to define run-time profiles at a higher level with more context, Sysdig Falco and Sysdig Secure can be better options. Sysdig Falco monitors the run-time security of your containers according to a set of user-defined rules, it has some similarities and some important differences with the other tools we just mentioned (reviewed in the “SELinux, Seccomp, Sysdig Falco, and you” article).
AllowPrivilegeEscalation: The execve system call can grant a newly-started program privileges that its parent did not have, such as the setuid or setgid Linux flags. This is controlled by the AllowPrivilegeEscalation boolean and should be used with care and only when required.
ReadOnlyRootFilesystem: This controls whether a container will be able to write into the root filesystem. It is common that the containers only need to write on mounted volumes that persist the state, as their root filesystem is supposed to be immutable. You can enforce this behavior using the readOnlyRootFilesystem flag:
$ kubectl create -f https://raw.githubusercontent.com/mateobur/kubernetes-securityguide/master/readonly/flask-ro.yaml
$ kubectl exec -it flask-ro bash
root@flask-ro:/# mount | grep "/ "
none on / type aufs (ro,relatime,si=e6100da9e6227a70,dio,dirperm1)
root@flask-ro:/# touch foo
touch: cannot touch 'foo': Read-only file system
Kubernetes Security Policy
Kubernetes Pod Security Policy (PSP), often shortened to Kubernetes Security Policy is implemented as an admission controller. Using security policies you can restrict the pods that will be allowed to run on your cluster, only if they follow the policy we have defined. You have different control aspects that the cluster administrator can set:Control Aspect | Field Names |
Running of privileged containers | privileged |
Usage of the root namespaces | hostPID, hostIPC |
Usage of host networking and ports | hostNetwork, hostPorts |
Usage of volume types | volumes |
Usage of the host filesystem | allowedHostPaths |
White list of FlexVolume drivers | allowedFlexVolumes |
Allocating an FSGroup that owns the pod’s volumes | fsGroup |
Requiring the use of a read only root file system | readOnlyRootFilesystem |
The user and group IDs of the container | runAsUser, supplementalGroups |
Restricting escalation to root privileges | allowPrivilegeEscalation, defaultAllowPrivilegeEscalation |
Linux capabilities | defaultAddCapabilities, requiredDropCapabilities, allowedCapabilities |
The SELinux context of the container | seLinux |
The AppArmor profile used by containers | annotations |
The seccomp profile used by containers | annotations |
The sysctl profile used by containers | annotations |
- Default pod security context values (i.e.
defaultAddCapabilities
) - Mandatory pod security flags and values (i.e.
allowPrivilegeEscalation: false
) - Whitelists and blacklists for the list-based security flags (i.e. list of allowed host paths to mount).
allowedHostPaths:
# This allows "/foo", "/foo/", "/foo/bar" etc., but
# disallows "/fool", "/etc/foo" etc.
# "/foo/../" is never valid.
- pathPrefix: "/foo"
You need the PodSecurityPolicy admission controller enabled in your API server to enforce these policies.
If you plan to enable PodSecurityPolicy, first configure (or have present already) a default PSP and the associated RBAC permissions, otherwise the cluster will fail to create new pods.
If your cloud provider / deployment design already supports and enables PSP, it will come pre-populated with a default set of policies, for example:
$ kubectl get psp
NAME PRIV CAPS SELINUX RUNASUSER FSGROUP SUPGROUP READONLYROOTFS VOLUMES
gce.event-exporter false [] RunAsAny RunAsAny RunAsAny RunAsAny false [hostPath secret]
gce.fluentd-gcp false [] RunAsAny RunAsAny RunAsAny RunAsAny false [configMap hostPath secret]
gce.persistent-volume-binder false [] RunAsAny RunAsAny RunAsAny RunAsAny false [nfs secret]
gce.privileged true [*] RunAsAny RunAsAny RunAsAny RunAsAny false [*]
gce.unprivileged-addon false [] RunAsAny RunAsAny RunAsAny RunAsAny false [emptyDir configMap secret]
In the event you enabled PSP for a cluster that didn’t have any pre-populated rule, you can create a permissive policy to avoid run-time disruption and then perform iterative adjustments over your configuration:
For example, this policy below will prevent the execution of any pod that tries to use the root user or group, allowing any other security context:
apiVersion: extensions/v1beta1
kind: PodSecurityPolicy
metadata:
name: example
spec:
privileged: true
seLinux:
rule: RunAsAny
supplementalGroups:
rule: RunAsAny
runAsUser:
rule: RunAsAny
fsGroup:
rule: 'MustRunAs'
ranges:
- min: 1
max: 65535
volumes:
- '*'
$ kubectl create -f psp.yaml
podsecuritypolicy "example" created
$ kubectl get psp
NAME PRIV CAPS SELINUX RUNASUSER FSGROUP SUPGROUP READONLYROOTFS VOLUMES
example true [] RunAsAny RunAsAny MustRunAs RunAsAny false [*]
If you try to create new pods without the runAsUser directive you will get:
$ kubectl create -f https://raw.githubusercontent.com/mateobur/kubernetes-securityguide/master/readonly/flask-ro.yaml
$ kubectl describe pod flask-ro
...
Failed Error: container has runAsNonRoot and image will run as root
Kubernetes Network Policies
Kubernetes also defines security at the pod networking level. A network policy is a specification of how groups of pods are allowed to communicate with each other and other network endpoints. You can compare Kubernetes network policies with classic network firewalling (ala iptables) but with one important advantage: using Kubernetes context like pod labels, namespaces, etc. Kubernetes supports several third-party plugins that implement pod overlay networks. You need to check your provider documentation (these for Calico or Weave) to make sure that Kubernetes network policies are supported and enabled, otherwise, the configuration will show up in your cluster but will not have any effect. Let’s use the Kubernetes example scenario guestbook to show how these network policies work:kubectl create -f https://raw.githubusercontent.com/fabric8io/kansible/master/vendor/k8s.io/kubernetes/examples/guestbook/all-in-one/guestbook-all-in-one.yaml
This will create ‘frontend’ and ‘backend’ pods:
$ kubectl describe pod frontend-685d7ff496-7s6kz | grep tier
tier=frontend
$ kubectl describe pod redis-master-7bd4d6ccfd-8dnlq | grep tier
tier=backend
You can configure your network policy with these logical. Abstracting concepts such as IP address or physical node won’t work because Kubernetes can change them dynamically.
Let’s apply the following network policy:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-backend-egress
namespace: default
spec:
podSelector:
matchLabels:
tier: backend
policyTypes:
- Egress
egress:
- to:
- podSelector:
matchLabels:
tier: backend
That you can also find in the repository:
$ kubectl create -f netpol/guestbook-network-policy.yaml
Then you can get the pod names and local IP addresses using:
$ kubectl get pods -o wide
[...]
In order to check that the policy is working as expected, you can ‘exec’ into the ‘redis-master’ pod and try to ping first a ‘redis-slave’ (same tier) and then a ‘frontend’ pod:
$ kubectl exec -it redis-master-7bd4d6ccfd-8dnlq bash
$ ping 10.28.4.21
PING 10.28.4.21 (10.28.4.21) 56(84) bytes of data.
64 bytes from 10.28.4.21: icmp_seq=1 ttl=63 time=0.092 ms
$ ping 10.28.4.23
PING 10.28.4.23 (10.28.4.23) 56(84) bytes of data.
(no response, blocked)
This policy will be enforced even if the pods migrate to another node or they are scaled up/down.
You can also use namespace selectors and CIDR ip blocks for your ingress and egress rules like in the example below:
ingress:
- from:
- ipBlock:
cidr: 172.17.0.0/16
except:
- 172.17.1.0/24
- namespaceSelector:
matchLabels:
project: myproject
- podSelector:
matchLabels:
role: frontend
Kubernetes resource allocation management
Resource limits are usually established to avoid unintended saturation due to design limitations or software bugs, but can also protect against malicious resource abuse. Unauthorized resource consumption that tries to remain undetected is becoming much more common due to cryptojacking attempts. There are two basic concepts: requests and limits. Requests: The Kubernetes node will check if it has enough resources left to fully satisfy the request before scheduling the pod. Kubernetes makes sure that the actual resource consumption never goes over the configured limits. You can run a quick example from theresources/flask-resources.yaml
repository file
apiVersion: v1
kind: Pod
metadata:
name: flask-resources
namespace: default
spec:
containers:
- image: mateobur/flask
name: flask-resources
resources:
requests:
memory: 512Mi
limits:
memory: 700Mi
$ kubectl create -f resources/flask-resources.yaml
Limits: Are the top resource consumption limit the container can make.
Let’s use the stress load generator to test the limits:
root@flask-resources:/# stress --cpu 1 --io 1 --vm 2 --vm-bytes 800M
stress: info: [79] dispatching hogs: 1 cpu, 1 io, 2 vm, 0 hdd
stress: FAIL: [79] (416) <-- worker 83 got signal 9
The resources that you can reserve and limit by default using the pod description are:
- CPU
- Main memory
- Local ephemeral storage