20 Jan 2023

feedKubernetes – Production-Grade Container Orchestration

Blog: Consider All Microservices Vulnerable — And Monitor Their Behavior

Author: David Hadas (IBM Research Labs)

This post warns Devops from a false sense of security. Following security best practices when developing and configuring microservices do not result in non-vulnerable microservices. The post shows that although all deployed microservices are vulnerable, there is much that can be done to ensure microservices are not exploited. It explains how analyzing the behavior of clients and services from a security standpoint, named here "Security-Behavior Analysis", can protect the deployed vulnerable microservices. It points to Guard, an open source project offering security-behavior monitoring and control of Kubernetes microservices presumed vulnerable.

As cyber attacks continue to intensify in sophistication, organizations deploying cloud services continue to grow their cyber investments aiming to produce safe and non-vulnerable services. However, the year-by-year growth in cyber investments does not result in a parallel reduction in cyber incidents. Instead, the number of cyber incidents continues to grow annually. Evidently, organizations are doomed to fail in this struggle - no matter how much effort is made to detect and remove cyber weaknesses from deployed services, it seems offenders always have the upper hand.

Considering the current spread of offensive tools, sophistication of offensive players, and ever-growing cyber financial gains to offenders, any cyber strategy that relies on constructing a non-vulnerable, weakness-free service in 2023 is clearly too naïve. It seems the only viable strategy is to:

Admit that your services are vulnerable!

In other words, consciously accept that you will never create completely invulnerable services. If your opponents find even a single weakness as an entry-point, you lose! Admitting that in spite of your best efforts, all your services are still vulnerable is an important first step. Next, this post discusses what you can do about it...

How to protect microservices from being exploited

Being vulnerable does not necessarily mean that your service will be exploited. Though your services are vulnerable in some ways unknown to you, offenders still need to identify these vulnerabilities and then exploit them. If offenders fail to exploit your service vulnerabilities, you win! In other words, having a vulnerability that can't be exploited, represents a risk that can't be realized.

Image of an example of offender gaining foothold in a service

Figure 1. An Offender gaining foothold in a vulnerable service

The above diagram shows an example in which the offender does not yet have a foothold in the service; that is, it is assumed that your service does not run code controlled by the offender on day 1. In our example the service has vulnerabilities in the API exposed to clients. To gain an initial foothold the offender uses a malicious client to try and exploit one of the service API vulnerabilities. The malicious client sends an exploit that triggers some unplanned behavior of the service.

More specifically, let's assume the service is vulnerable to an SQL injection. The developer failed to sanitize the user input properly, thereby allowing clients to send values that would change the intended behavior. In our example, if a client sends a query string with key "username" and value of "tom or 1=1", the client will receive the data of all users. Exploiting this vulnerability requires the client to send an irregular string as the value. Note that benign users will not be sending a string with spaces or with the equal sign character as a username, instead they will normally send legal usernames which for example may be defined as a short sequence of characters a-z. No legal username can trigger service unplanned behavior.

In this simple example, one can already identify several opportunities to detect and block an attempt to exploit the vulnerability (un)intentionally left behind by the developer, making the vulnerability unexploitable. First, the malicious client behavior differs from the behavior of benign clients, as it sends irregular requests. If such a change in behavior is detected and blocked, the exploit will never reach the service. Second, the service behavior in response to the exploit differs from the service behavior in response to a regular request. Such behavior may include making subsequent irregular calls to other services such as a data store, taking irregular time to respond, and/or responding to the malicious client with an irregular response (for example, containing much more data than normally sent in case of benign clients making regular requests). Service behavioral changes, if detected, will also allow blocking the exploit in different stages of the exploitation attempt.

More generally:

Combining both approaches may add a protection layer to the deployed vulnerable services, drastically decreasing the probability for anyone to successfully exploit any of the deployed vulnerable services. Next, let us identify four use cases where you need to use security-behavior monitoring.

Use cases

One can identify the following four different stages in the life of any service from a security standpoint. In each stage, security-behavior monitoring is required to meet different challenges:

Service State Use case What do you need in order to cope with this use case?
Normal No known vulnerabilities: The service owner is normally not aware of any known vulnerabilities in the service image or configuration. Yet, it is reasonable to assume that the service has weaknesses. Provide generic protection against any unknown, zero-day, service vulnerabilities - Detect/block irregular patterns sent as part of incoming client requests that may be used as exploits.
Vulnerable An applicable CVE is published: The service owner is required to release a new non-vulnerable revision of the service. Research shows that in practice this process of removing a known vulnerability may take many weeks to accomplish (2 months on average). Add protection based on the CVE analysis - Detect/block incoming requests that include specific patterns that may be used to exploit the discovered vulnerability. Continue to offer services, although the service has a known vulnerability.
Exploitable A known exploit is published: The service owner needs a way to filter incoming requests that contain the known exploit. Add protection based on a known exploit signature - Detect/block incoming client requests that carry signatures identifying the exploit. Continue to offer services, although the presence of an exploit.
Misused An offender misuses pods backing the service: The offender can follow an attack pattern enabling him/her to misuse pods. The service owner needs to restart any compromised pods while using non compromised pods to continue offering the service. Note that once a pod is restarted, the offender needs to repeat the attack pattern before he/she may again misuse it. Identify and restart instances of the component that is being misused - At any given time, some backing pods may be compromised and misused, while others behave as designed. Detect/remove the misused pods while allowing other pods to continue servicing client requests.

Fortunately, microservice architecture is well suited to security-behavior monitoring as discussed next.

Security-Behavior of microservices versus monoliths

Kubernetes is often used to support workloads designed with microservice architecture. By design, microservices aim to follow the UNIX philosophy of "Do One Thing And Do It Well". Each microservice has a bounded context and a clear interface. In other words, you can expect the microservice clients to send relatively regular requests and the microservice to present a relatively regular behavior as a response to these requests. Consequently, a microservice architecture is an excellent candidate for security-behavior monitoring.

Image showing why microservices are well suited for security-behavior monitoring

Figure 2. Microservices are well suited for security-behavior monitoring

The diagram above clarifies how dividing a monolithic service to a set of microservices improves our ability to perform security-behavior monitoring and control. In a monolithic service approach, different client requests are intertwined, resulting in a diminished ability to identify irregular client behaviors. Without prior knowledge, an observer of the intertwined client requests will find it hard to distinguish between types of requests and their related characteristics. Further, internal client requests are not exposed to the observer. Lastly, the aggregated behavior of the monolithic service is a compound of the many different internal behaviors of its components, making it hard to identify irregular service behavior.

In a microservice environment, each microservice is expected by design to offer a more well-defined service and serve better defined type of requests. This makes it easier for an observer to identify irregular client behavior and irregular service behavior. Further, a microservice design exposes the internal requests and internal services which offer more security-behavior data to identify irregularities by an observer. Overall, this makes the microservice design pattern better suited for security-behavior monitoring and control.

Security-Behavior monitoring on Kubernetes

Kubernetes deployments seeking to add Security-Behavior may use Guard, developed under the CNCF project Knative. Guard is integrated into the full Knative automation suite that runs on top of Kubernetes. Alternatively, you can deploy Guard as a standalone tool to protect any HTTP-based workload on Kubernetes.


The goal of this post is to invite the Kubernetes community to action and introduce Security-Behavior monitoring and control to help secure Kubernetes based deployments. Hopefully, the community as a follow up will:

  1. Analyze the cyber challenges presented for different Kubernetes use cases
  2. Add appropriate security documentation for users on how to introduce Security-Behavior monitoring and control.
  3. Consider how to integrate with tools that can help users monitor and control their vulnerable services.

Getting involved

You are welcome to get involved and join the effort to develop security behavior monitoring and control for Kubernetes; to share feedback and contribute to code or documentation; and to make or suggest improvements of any kind.

20 Jan 2023 12:00am GMT

12 Jan 2023

feedKubernetes – Production-Grade Container Orchestration

Blog: Protect Your Mission-Critical Pods From Eviction With PriorityClass

Author: Sunny Bhambhani (InfraCloud Technologies)

Kubernetes has been widely adopted, and many organizations use it as their de-facto orchestration engine for running workloads that need to be created and deleted frequently.

Therefore, proper scheduling of the pods is key to ensuring that application pods are up and running within the Kubernetes cluster without any issues. This article delves into the use cases around resource management by leveraging the PriorityClass object to protect mission-critical or high-priority pods from getting evicted and making sure that the application pods are up, running, and serving traffic.

Resource management in Kubernetes

The control plane consists of multiple components, out of which the scheduler (usually the built-in kube-scheduler) is one of the components which is responsible for assigning a node to a pod.

Whenever a pod is created, it enters a "pending" state, after which the scheduler determines which node is best suited for the placement of the new pod.

In the background, the scheduler runs as an infinite loop looking for pods without a nodeName set that are ready for scheduling. For each Pod that needs scheduling, the scheduler tries to decide which node should run that Pod.

If the scheduler cannot find any node, the pod remains in the pending state, which is not ideal.

The below diagram, from point number 1 through 4, explains the request flow:

A diagram showing the scheduling of three Pods that a client has directly created.

Scheduling in Kubernetes

Typical use cases

Below are some real-life scenarios where control over the scheduling and eviction of pods may be required.

  1. Let's say the pod you plan to deploy is critical, and you have some resource constraints. An example would be the DaemonSet of an infrastructure component like Grafana Loki. The Loki pods must run before other pods can on every node. In such cases, you could ensure resource availability by manually identifying and deleting the pods that are not required or by adding a new node to the cluster. Both these approaches are unsuitable since the former would be tedious to execute, and the latter could involve an expenditure of time and money.

  2. Another use case could be a single cluster that holds the pods for the below environments with associated priorities:

    • Production (prod): top priority
    • Preproduction (preprod): intermediate priority
    • Development (dev): least priority

In the event of high resource consumption in the cluster, there is competition for CPU and memory resources on the nodes. While cluster-level autoscaling may add more nodes, it takes time. In the interim, if there are no further nodes to scale the cluster, some Pods could remain in a Pending state, or the service could be degraded as they compete for resources. If the kubelet does evict a Pod from the node, that eviction would be random because the kubelet doesn't have any special information about which Pods to evict and which to keep.

  1. A third example could be a microservice backed by a queuing application or a database running into a resource crunch and the queue or database getting evicted. In such a case, all the other services would be rendered useless until the database can serve traffic again.

There can also be other scenarios where you want to control the order of scheduling or order of eviction of pods.

PriorityClasses in Kubernetes

PriorityClass is a cluster-wide API object in Kubernetes and part of the scheduling.k8s.io/v1 API group. It contains a mapping of the PriorityClass name (defined in .metadata.name) and an integer value (defined in .value). This represents the value that the scheduler uses to determine Pod's relative priority.

Additionally, when you create a cluster using kubeadm or a managed Kubernetes service (for example, Azure Kubernetes Service), Kubernetes uses PriorityClasses to safeguard the pods that are hosted on the control plane nodes. This ensures that critical cluster components such as CoreDNS and kube-proxy can run even if resources are constrained.

This availability of pods is achieved through the use of a special PriorityClass that ensures the pods are up and running and that the overall cluster is not affected.

$ kubectl get priorityclass
system-cluster-critical 2000000000 false 82m
system-node-critical 2000001000 false 82m

The diagram below shows exactly how it works with the help of an example, which will be detailed in the upcoming section.

A flow chart that illustrates how the kube-scheduler prioritizes new Pods and potentially preempts existing Pods

Pod scheduling and preemption

Pod priority and preemption

Pod preemption is a Kubernetes feature that allows the cluster to preempt pods (removing an existing Pod in favor of a new Pod) on the basis of priority. Pod priority indicates the importance of a pod relative to other pods while scheduling. If there aren't enough resources to run all the current pods, the scheduler tries to evict lower-priority pods over high-priority ones.

Also, when a healthy cluster experiences a node failure, typically, lower-priority pods get preempted to create room for higher-priority pods on the available node. This happens even if the cluster can bring up a new node automatically since pod creation is usually much faster than bringing up a new node.

PriorityClass requirements

Before you set up PriorityClasses, there are a few things to consider.

  1. Decide which PriorityClasses are needed. For instance, based on environment, type of pods, type of applications, etc.
  2. The default PriorityClass resource for your cluster. The pods without a priorityClassName will be treated as priority 0.
  3. Use a consistent naming convention for all PriorityClasses.
  4. Make sure that the pods for your workloads are running with the right PriorityClass.

PriorityClass hands-on example

Let's say there are 3 application pods: one for prod, one for preprod, and one for development. Below are three sample YAML manifest files for each of those.

# development
apiVersion: v1
kind: Pod
 name: dev-nginx
 env: dev
 - name: dev-nginx
 image: nginx
 memory: "256Mi"
 cpu: "0.2"
 memory: ".5Gi"
 cpu: "0.5"
# preproduction
apiVersion: v1
kind: Pod
 name: preprod-nginx
 env: preprod
 - name: preprod-nginx
 image: nginx
 memory: "1.5Gi"
 cpu: "1.5"
 memory: "2Gi"
 cpu: "2"
# production
apiVersion: v1
kind: Pod
 name: prod-nginx
 env: prod
 - name: prod-nginx
 image: nginx
 memory: "2Gi"
 cpu: "2"
 memory: "2Gi"
 cpu: "2"

You can create these pods with the kubectl create -f <FILE.yaml> command, and then check their status using the kubectl get pods command. You can see if they are up and look ready to serve traffic:

$ kubectl get pods --show-labels
dev-nginx 1/1 Running 0 55s env=dev
preprod-nginx 1/1 Running 0 55s env=preprod
prod-nginx 0/1 Pending 0 55s env=prod

Bad news. The pod for the Production environment is still Pending and isn't serving any traffic.

Let's see why this is happening:

$ kubectl get events
5s Warning FailedScheduling pod/prod-nginx 0/2 nodes are available: 1 Insufficient cpu, 2 Insufficient memory.

In this example, there is only one worker node, and that node has a resource crunch.

Now, let's look at how PriorityClass can help in this situation since prod should be given higher priority than the other environments.

PriorityClass API

Before creating PriorityClasses based on these requirements, let's see what a basic manifest for a PriorityClass looks like and outline some prerequisites:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
value: 0 # any integer value between -1000000000 to 1000000000 
description: >-
  (Optional) description goes here!
globalDefault: false # or true. Only one PriorityClass can be the global default.

Below are some prerequisites for PriorityClasses:

PriorityClass in action

Here's an example. Next, create some environment-specific PriorityClasses:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
 name: dev-pc
value: 1000000
globalDefault: false
description: >-
  (Optional) This priority class should only be used for all development pods.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
 name: preprod-pc
value: 2000000
globalDefault: false
description: >-
  (Optional) This priority class should only be used for all preprod pods.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
 name: prod-pc
value: 4000000
globalDefault: false
description: >-
  (Optional) This priority class should only be used for all prod pods.

Use kubectl create -f <FILE.YAML> command to create a pc and kubectl get pc to check its status.

$ kubectl get pc
dev-pc 1000000 false 3m13s
preprod-pc 2000000 false 2m3s
prod-pc 4000000 false 7s
system-cluster-critical 2000000000 false 82m
system-node-critical 2000001000 false 82m

The new PriorityClasses are in place now. A small change is needed in the pod manifest or pod template (in a ReplicaSet or Deployment). In other words, you need to specify the priority class name at .spec.priorityClassName (which is a string value).

First update the previous production pod manifest file to have a PriorityClass assigned, then delete the Production pod and recreate it. You can't edit the priority class for a Pod that already exists.

In my cluster, when I tried this, here's what happened. First, that change seems successful; the status of pods has been updated:

$ kubectl get pods --show-labels
dev-nginx 1/1 Terminating 0 55s env=dev
preprod-nginx 1/1 Running 0 55s env=preprod
prod-nginx 0/1 Pending 0 55s env=prod

The dev-nginx pod is getting terminated. Once that is successfully terminated and there are enough resources for the prod pod, the control plane can schedule the prod pod:

Warning FailedScheduling pod/prod-nginx 0/2 nodes are available: 1 Insufficient cpu, 2 Insufficient memory.
Normal Preempted pod/dev-nginx by default/prod-nginx on node node01
Normal Killing pod/dev-nginx Stopping container dev-nginx
Normal Scheduled pod/prod-nginx Successfully assigned default/prod-nginx to node01
Normal Pulling pod/prod-nginx Pulling image "nginx"
Normal Pulled pod/prod-nginx Successfully pulled image "nginx"
Normal Created pod/prod-nginx Created container prod-nginx
Normal Started pod/prod-nginx Started container prod-nginx


When you set up PriorityClasses, they exist just how you defined them. However, people (and tools) that make changes to your cluster are free to set any PriorityClass, or to not set any PriorityClass at all. However, you can use other Kubernetes features to make sure that the priorities you wanted are actually applied.

As an alpha feature, you can define a ValidatingAdmissionPolicy and a ValidatingAdmissionPolicyBinding so that, for example, Pods that go into the prod namespace must use the prod-pc PriorityClass. With another ValidatingAdmissionPolicyBinding you ensure that the preprod namespace uses the preprod-pc PriorityClass, and so on. In any cluster, you can enforce similar controls using external projects such as Kyverno or Gatekeeper, through validating admission webhooks.

However you do it, Kubernetes gives you options to make sure that the PriorityClasses are used how you wanted them to be, or perhaps just to warn users when they pick an unsuitable option.


The above example and its events show you what this feature of Kubernetes brings to the table, along with several scenarios where you can use this feature. To reiterate, this helps ensure that mission-critical pods are up and available to serve the traffic and, in the case of a resource crunch, determines cluster behavior.

It gives you some power to decide the order of scheduling and order of preemption for Pods. Therefore, you need to define the PriorityClasses sensibly. For example, if you have a cluster autoscaler to add nodes on demand, make sure to run it with the system-cluster-critical PriorityClass. You don't want to get in a situation where the autoscaler has been preempted and there are no new nodes coming online.

If you have any queries or feedback, feel free to reach out to me on LinkedIn.

12 Jan 2023 12:00am GMT

06 Jan 2023

feedKubernetes – Production-Grade Container Orchestration

Blog: Kubernetes 1.26: Eviction policy for unhealthy pods guarded by PodDisruptionBudgets

Authors: Filip Křepinský (Red Hat), Morten Torkildsen (Google), Ravi Gudimetla (Apple)

Ensuring the disruptions to your applications do not affect its availability isn't a simple task. Last month's release of Kubernetes v1.26 lets you specify an unhealthy pod eviction policy for PodDisruptionBudgets (PDBs) to help you maintain that availability during node management operations. In this article, we will dive deeper into what modifications were introduced for PDBs to give application owners greater flexibility in managing disruptions.

What problems does this solve?

API-initiated eviction of pods respects PodDisruptionBudgets (PDBs). This means that a requested voluntary disruption via an eviction to a Pod, should not disrupt a guarded application and .status.currentHealthy of a PDB should not fall below .status.desiredHealthy. Running pods that are Unhealthy do not count towards the PDB status, but eviction of these is only possible in case the application is not disrupted. This helps disrupted or not yet started application to achieve availability as soon as possible without additional downtime that would be caused by evictions.

Unfortunately, this poses a problem for cluster administrators that would like to drain nodes without any manual interventions. Misbehaving applications with pods in CrashLoopBackOff state (due to a bug or misconfiguration) or pods that are simply failing to become ready make this task much harder. Any eviction request will fail due to violation of a PDB, when all pods of an application are unhealthy. Draining of a node cannot make any progress in that case.

On the other hand there are users that depend on the existing behavior, in order to:

Kubernetes 1.26 introduced a new experimental field to the PodDisruptionBudget API: .spec.unhealthyPodEvictionPolicy. When enabled, this field lets you support both of those requirements.

How does it work?

API-initiated eviction is the process that triggers graceful pod termination. The process can be initiated either by calling the API directly, by using a kubectl drain command, or other actors in the cluster. During this process every pod removal is consulted with appropriate PDBs, to ensure that a sufficient number of pods is always running in the cluster.

The following policies allow PDB authors to have a greater control how the process deals with unhealthy pods.

There are two policies IfHealthyBudget and AlwaysAllow to choose from.

The former, IfHealthyBudget, follows the existing behavior to achieve the best availability that you get by default. Unhealthy pods can be disrupted only if their application has a minimum available .status.desiredHealthy number of pods.

By setting the spec.unhealthyPodEvictionPolicy field of your PDB to AlwaysAllow, you are choosing the best effort availability for your application. With this policy it is always possible to evict unhealthy pods. This will make it easier to maintain and upgrade your clusters.

We think that AlwaysAllow will often be a better choice, but for some critical workloads you may still prefer to protect even unhealthy Pods from node drains or other forms of API-initiated eviction.

How do I use it?

This is an alpha feature, which means you have to enable the PDBUnhealthyPodEvictionPolicy feature gate, with the command line argument --feature-gates=PDBUnhealthyPodEvictionPolicy=true to the kube-apiserver.

Here's an example. Assume that you've enabled the feature gate in your cluster, and that you already defined a Deployment that runs a plain webserver. You labelled the Pods for that Deployment with app: nginx. You want to limit avoidable disruption, and you know that best effort availability is sufficient for this app. You decide to allow evictions even if those webserver pods are unhealthy. You create a PDB to guard this application, with the AlwaysAllow policy for evicting unhealthy pods:

apiVersion: policy/v1
kind: PodDisruptionBudget
 name: nginx-pdb
 app: nginx
 maxUnavailable: 1
 unhealthyPodEvictionPolicy: AlwaysAllow

How can I learn more?

How do I get involved?

If you have any feedback, please reach out to us in the #sig-apps channel on Slack (visit https://slack.k8s.io/ for an invitation if you need one), or on the SIG Apps mailing list: kubernetes-sig-apps@googlegroups.com

06 Jan 2023 12:00am GMT