Pod evicted and scheduling problems are side effects of Kubernetes limits and requests, usually caused by a lack of planning.
Beginners tend to think limits are optional, and merely an obstacle for your stuff to run. Why should I set a limit if I can have no limits? I may need all CPU eventually.
With this way of thinking Kubernetes wouldn’t have gone far. Fortunately, Kubernetes developers had this in mind, and the quota mechanism is designed to avoid misuse of resources.
When you create a pod for your application, you can set requests and limits for CPU and memory for every container inside.
Properly setting these values is the way to instruct Kubernetes how to manage your app resources. Kubernetes gives all the pods a score based on its limits and requests and is ready to kick from the cluster those pods which don’t comply with the fair use rules.
Setting requests is declaring how many resources your containers need to run in a normal operation.
Setting limits is declaring how much memory or CPU can occasionally be used.
Are you having pod eviction? Problems to schedule pods? Learn how to setup your workloads to minimize issues.Click to tweet
Kubernetes QoS depending on requests and limits
There are two different quotas for containers in Kubernetes:
resources: requests: cpu: 100m memory: 128Mi limits: cpu: 500m memory: 768Mi
- Requests: This value is used for scheduling. It’s the minimum amount of resources a container needs to run. Be careful, the request does not mean these are always dedicated resources for the container.
- Limits: This is the maximum amount of this resource that the node will allow the containers to use.
Both of them can be used with CPU and memory with different implications, as we covered in this post.
There are three kinds of pods depending on their quota settings:
- Guaranteed: Pods that have request and limit in all of the containers, and they must be the same.
- Burstable: Non “guaranteed” pods with at least one CPU or memory request in one of the containers.
- Best effort: Pods without requests or limits of any kind.
As we will see in this post, the QoS type of pod is important when allocating and reclaiming resources.
Scheduling problems due to poorly set request values
Cluster pod allocation is based on requests (CPU and memory). If a pod requires (claims a request) larger than available CPU or memory in a node, the pod can’t be run on that node. If none of the cluster nodes have enough resources to run the pod, the pod will remain pending of schedule until there are enough resources.
Requesting too many resources will make your pods difficult to schedule. In addition, if your request is higher than your regular use of resources, those resources requested and not used cannot be allocated to other pods, reducing operational capacity of the cluster. Your cluster admin won’t be happy, and they will probably let you know.
If a pod is waiting to be allocated and the scheduler can’t find a node with enough resources to run the pod, it will remain in “pending” phase until there are enough resources:
NAME READY STATUS RESTARTS AGE frontend 0/2 Pending 0 10s
A “kubectl describe pod” command will give information about the issue:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 44s (x2 over 44s) default-scheduler 0/4 nodes are available: 4 Insufficient memory.
Be aware, sometimes the needed resources to deploy a pod are not what they seem. Pod effective request used for allocation is the highest of these two values:
- The sum of the requests of the containers inside the pod
- The request of any init container.
For example, if you have a pod with this definition:
apiVersion: v1 kind: Pod ... containers: - name: myapp-container image: busybox:1.28 resources: requests: cpu: 100m memory: 128Mi limits: cpu: 500m memory: 768Mi - name: myapp2-container image: busybox:1.28 resources: requests: cpu: 100m memory: 128Mi limits: cpu: 500m memory: 768Mi ... initContainers: - name: init-myservice image: busybox:1.28 command: ['sh', '-c', 'sleep 3'] resources: requests: cpu: 300m memory: 750Mi
The containers that will run the application sum 200 millicores of CPU and 256 MB of requests. However, the effective pod request used to schedule the pod, and the amount of resources marked as occupied, will be 300m and 750MB as requested per the init container.
Having a clear perspective of allocatable resources in your cluster will enable cluster admins to better plan their needs depending on present and expected workloads. Having a good insight of which pod is really using requested resources will be an invaluable tool to maximize cluster occupation and density of applications per node.
Find these metrics in Sysdig Monitor in the dashboard: Kubernetes → Resource usage → Kubernetes cluster and node capacity
Pod evicted problems
When a node in a Kubernetes cluster is running out of memory or disk, it activates a flag signaling that it is under pressure. This blocks any new allocation in the node and starts the eviction process.
Tip: You can find this information in Sysdig monitor dashboards.
At that moment, kubelet starts to reclaim resources, killing containers and declaring pods as failed until the resource usage is under the eviction threshold again.
First, kubelet tries to free node resources, especially disk, by deleting dead pods and its containers, and then unused images. If this isn’t enough, kubelet starts to evict end-user pods in the following order:
- Best Effort.
- Burstable pods using more resources than its request of the starved resource.
- Burstable pods using less resources than its request of the starved resource.
You can see some messages like these if one of your pods is evicted by memory use:
NAME READY STATUS RESTARTS AGE frontend 0/2 Evicted 0 10s
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 12m default-scheduler Successfully assigned test/frontend to gke-lab-kube-gke-default-pool-02126501-qcbb Normal Pulling 12m kubelet, gke-lab-kube-gke-default-pool-02126501-qcbb pulling image "nginx" Normal Pulled 12m kubelet, gke-lab-kube-gke-default-pool-02126501-qcbb Successfully pulled image "nginx" Normal Created 12m kubelet, gke-lab-kube-gke-default-pool-02126501-qcbb Created container Normal Started 12m kubelet, gke-lab-kube-gke-default-pool-02126501-qcbb Started container Warning Evicted 4m8s kubelet, gke-lab-kube-gke-default-pool-02126501-qcbb The node was low on resource: memory. Container db was using 1557408Ki, which exceeds its request of 200Mi. Warning ExceededGracePeriod 3m58s kubelet, gke-lab-kube-gke-default-pool-02126501-qcbb Container runtime did not kill the pod within specified grace period. Normal Killing 3m27s kubelet, gke-lab-kube-gke-default-pool-02126501-qcbb Killing container with id docker://db:Need to kill Pod
Guaranteed pods are supposed to be safe in case of eviction. If you are not setting requests and limits in your pods, this is a very good reason to do so. Setting those values properly can protect you from unexpected outages.
There is a special exception to this. If some of the system services require more resources than the amount reserved for them, and there are only guaranteed pods, kubelet will evict those pods in order of resource usage until the pressure is gone.
A good configuration of your requests and limits are important for the coexistence of different applications in the cluster. Understanding those limits will allow you to have happy on call rotations.
Some lessons you should learn from this are:
- Protect your critical pods setting values so they are classified as Guaranteed.
- Burstable pods are fine for most of the tasks.
- Be careful and set reasonable pod limits and requests. This will help to adjust cluster capacity and improve pod evicted issues.
- Try to avoid Best effort pods, keep this for things with no time constraints or that are fault tolerant.
A good monitoring system like sysdig monitor will help you to ensure you avoid pod evicted and pending pods. Request a demo today!