20 Jan 2023
Kubernetes – Production-Grade Container Orchestration
Blog: Consider All Microservices Vulnerable — And Monitor Their Behavior
Author: David Hadas (IBM Research Labs)
This post warns Devops from a false sense of security. Following security best practices when developing and configuring microservices do not result in non-vulnerable microservices. The post shows that although all deployed microservices are vulnerable, there is much that can be done to ensure microservices are not exploited. It explains how analyzing the behavior of clients and services from a security standpoint, named here "Security-Behavior Analysis", can protect the deployed vulnerable microservices. It points to Guard, an open source project offering security-behavior monitoring and control of Kubernetes microservices presumed vulnerable.
As cyber attacks continue to intensify in sophistication, organizations deploying cloud services continue to grow their cyber investments aiming to produce safe and non-vulnerable services. However, the year-by-year growth in cyber investments does not result in a parallel reduction in cyber incidents. Instead, the number of cyber incidents continues to grow annually. Evidently, organizations are doomed to fail in this struggle - no matter how much effort is made to detect and remove cyber weaknesses from deployed services, it seems offenders always have the upper hand.
Considering the current spread of offensive tools, sophistication of offensive players, and ever-growing cyber financial gains to offenders, any cyber strategy that relies on constructing a non-vulnerable, weakness-free service in 2023 is clearly too naïve. It seems the only viable strategy is to:
➥ Admit that your services are vulnerable!
In other words, consciously accept that you will never create completely invulnerable services. If your opponents find even a single weakness as an entry-point, you lose! Admitting that in spite of your best efforts, all your services are still vulnerable is an important first step. Next, this post discusses what you can do about it...
How to protect microservices from being exploited
Being vulnerable does not necessarily mean that your service will be exploited. Though your services are vulnerable in some ways unknown to you, offenders still need to identify these vulnerabilities and then exploit them. If offenders fail to exploit your service vulnerabilities, you win! In other words, having a vulnerability that can't be exploited, represents a risk that can't be realized.

Figure 1. An Offender gaining foothold in a vulnerable service
The above diagram shows an example in which the offender does not yet have a foothold in the service; that is, it is assumed that your service does not run code controlled by the offender on day 1. In our example the service has vulnerabilities in the API exposed to clients. To gain an initial foothold the offender uses a malicious client to try and exploit one of the service API vulnerabilities. The malicious client sends an exploit that triggers some unplanned behavior of the service.
More specifically, let's assume the service is vulnerable to an SQL injection. The developer failed to sanitize the user input properly, thereby allowing clients to send values that would change the intended behavior. In our example, if a client sends a query string with key "username" and value of "tom or 1=1", the client will receive the data of all users. Exploiting this vulnerability requires the client to send an irregular string as the value. Note that benign users will not be sending a string with spaces or with the equal sign character as a username, instead they will normally send legal usernames which for example may be defined as a short sequence of characters a-z. No legal username can trigger service unplanned behavior.
In this simple example, one can already identify several opportunities to detect and block an attempt to exploit the vulnerability (un)intentionally left behind by the developer, making the vulnerability unexploitable. First, the malicious client behavior differs from the behavior of benign clients, as it sends irregular requests. If such a change in behavior is detected and blocked, the exploit will never reach the service. Second, the service behavior in response to the exploit differs from the service behavior in response to a regular request. Such behavior may include making subsequent irregular calls to other services such as a data store, taking irregular time to respond, and/or responding to the malicious client with an irregular response (for example, containing much more data than normally sent in case of benign clients making regular requests). Service behavioral changes, if detected, will also allow blocking the exploit in different stages of the exploitation attempt.
More generally:
-
Monitoring the behavior of clients can help detect and block exploits against service API vulnerabilities. In fact, deploying efficient client behavior monitoring makes many vulnerabilities unexploitable and others very hard to achieve. To succeed, the offender needs to create an exploit undetectable from regular requests.
-
Monitoring the behavior of services can help detect services as they are being exploited regardless of the attack vector used. Efficient service behavior monitoring limits what an attacker may be able to achieve as the offender needs to ensure the service behavior is undetectable from regular service behavior.
Combining both approaches may add a protection layer to the deployed vulnerable services, drastically decreasing the probability for anyone to successfully exploit any of the deployed vulnerable services. Next, let us identify four use cases where you need to use security-behavior monitoring.
Use cases
One can identify the following four different stages in the life of any service from a security standpoint. In each stage, security-behavior monitoring is required to meet different challenges:
Service State | Use case | What do you need in order to cope with this use case? |
---|---|---|
Normal | No known vulnerabilities: The service owner is normally not aware of any known vulnerabilities in the service image or configuration. Yet, it is reasonable to assume that the service has weaknesses. | Provide generic protection against any unknown, zero-day, service vulnerabilities - Detect/block irregular patterns sent as part of incoming client requests that may be used as exploits. |
Vulnerable | An applicable CVE is published: The service owner is required to release a new non-vulnerable revision of the service. Research shows that in practice this process of removing a known vulnerability may take many weeks to accomplish (2 months on average). | Add protection based on the CVE analysis - Detect/block incoming requests that include specific patterns that may be used to exploit the discovered vulnerability. Continue to offer services, although the service has a known vulnerability. |
Exploitable | A known exploit is published: The service owner needs a way to filter incoming requests that contain the known exploit. | Add protection based on a known exploit signature - Detect/block incoming client requests that carry signatures identifying the exploit. Continue to offer services, although the presence of an exploit. |
Misused | An offender misuses pods backing the service: The offender can follow an attack pattern enabling him/her to misuse pods. The service owner needs to restart any compromised pods while using non compromised pods to continue offering the service. Note that once a pod is restarted, the offender needs to repeat the attack pattern before he/she may again misuse it. | Identify and restart instances of the component that is being misused - At any given time, some backing pods may be compromised and misused, while others behave as designed. Detect/remove the misused pods while allowing other pods to continue servicing client requests. |
Fortunately, microservice architecture is well suited to security-behavior monitoring as discussed next.
Security-Behavior of microservices versus monoliths
Kubernetes is often used to support workloads designed with microservice architecture. By design, microservices aim to follow the UNIX philosophy of "Do One Thing And Do It Well". Each microservice has a bounded context and a clear interface. In other words, you can expect the microservice clients to send relatively regular requests and the microservice to present a relatively regular behavior as a response to these requests. Consequently, a microservice architecture is an excellent candidate for security-behavior monitoring.

Figure 2. Microservices are well suited for security-behavior monitoring
The diagram above clarifies how dividing a monolithic service to a set of microservices improves our ability to perform security-behavior monitoring and control. In a monolithic service approach, different client requests are intertwined, resulting in a diminished ability to identify irregular client behaviors. Without prior knowledge, an observer of the intertwined client requests will find it hard to distinguish between types of requests and their related characteristics. Further, internal client requests are not exposed to the observer. Lastly, the aggregated behavior of the monolithic service is a compound of the many different internal behaviors of its components, making it hard to identify irregular service behavior.
In a microservice environment, each microservice is expected by design to offer a more well-defined service and serve better defined type of requests. This makes it easier for an observer to identify irregular client behavior and irregular service behavior. Further, a microservice design exposes the internal requests and internal services which offer more security-behavior data to identify irregularities by an observer. Overall, this makes the microservice design pattern better suited for security-behavior monitoring and control.
Security-Behavior monitoring on Kubernetes
Kubernetes deployments seeking to add Security-Behavior may use Guard, developed under the CNCF project Knative. Guard is integrated into the full Knative automation suite that runs on top of Kubernetes. Alternatively, you can deploy Guard as a standalone tool to protect any HTTP-based workload on Kubernetes.
See:
- Guard on Github, for using Guard as a standalone tool.
- The Knative automation suite - Read about Knative, in the blog post Opinionated Kubernetes which describes how Knative simplifies and unifies the way web services are deployed on Kubernetes.
- You may contact Guard maintainers on the SIG Security Slack channel or on the Knative community security Slack channel. The Knative community channel will move soon to the CNCF Slack under the name
#knative-security
.
The goal of this post is to invite the Kubernetes community to action and introduce Security-Behavior monitoring and control to help secure Kubernetes based deployments. Hopefully, the community as a follow up will:
- Analyze the cyber challenges presented for different Kubernetes use cases
- Add appropriate security documentation for users on how to introduce Security-Behavior monitoring and control.
- Consider how to integrate with tools that can help users monitor and control their vulnerable services.
Getting involved
You are welcome to get involved and join the effort to develop security behavior monitoring and control for Kubernetes; to share feedback and contribute to code or documentation; and to make or suggest improvements of any kind.
20 Jan 2023 12:00am GMT
12 Jan 2023
Kubernetes – Production-Grade Container Orchestration
Blog: Protect Your Mission-Critical Pods From Eviction With PriorityClass
Author: Sunny Bhambhani (InfraCloud Technologies)
Kubernetes has been widely adopted, and many organizations use it as their de-facto orchestration engine for running workloads that need to be created and deleted frequently.
Therefore, proper scheduling of the pods is key to ensuring that application pods are up and running within the Kubernetes cluster without any issues. This article delves into the use cases around resource management by leveraging the PriorityClass object to protect mission-critical or high-priority pods from getting evicted and making sure that the application pods are up, running, and serving traffic.
Resource management in Kubernetes
The control plane consists of multiple components, out of which the scheduler (usually the built-in kube-scheduler) is one of the components which is responsible for assigning a node to a pod.
Whenever a pod is created, it enters a "pending" state, after which the scheduler determines which node is best suited for the placement of the new pod.
In the background, the scheduler runs as an infinite loop looking for pods without a nodeName
set that are ready for scheduling. For each Pod that needs scheduling, the scheduler tries to decide which node should run that Pod.
If the scheduler cannot find any node, the pod remains in the pending state, which is not ideal.
nodeSelector
, taints and tolerations
, nodeAffinity
, the rank of nodes based on available resources (for example, CPU and memory), and several other criteria are used to determine the pod's placement.The below diagram, from point number 1 through 4, explains the request flow:
Scheduling in Kubernetes
Typical use cases
Below are some real-life scenarios where control over the scheduling and eviction of pods may be required.
-
Let's say the pod you plan to deploy is critical, and you have some resource constraints. An example would be the DaemonSet of an infrastructure component like Grafana Loki. The Loki pods must run before other pods can on every node. In such cases, you could ensure resource availability by manually identifying and deleting the pods that are not required or by adding a new node to the cluster. Both these approaches are unsuitable since the former would be tedious to execute, and the latter could involve an expenditure of time and money.
-
Another use case could be a single cluster that holds the pods for the below environments with associated priorities:
- Production (
prod
): top priority - Preproduction (
preprod
): intermediate priority - Development (
dev
): least priority
- Production (
In the event of high resource consumption in the cluster, there is competition for CPU and memory resources on the nodes. While cluster-level autoscaling may add more nodes, it takes time. In the interim, if there are no further nodes to scale the cluster, some Pods could remain in a Pending state, or the service could be degraded as they compete for resources. If the kubelet does evict a Pod from the node, that eviction would be random because the kubelet doesn't have any special information about which Pods to evict and which to keep.
- A third example could be a microservice backed by a queuing application or a database running into a resource crunch and the queue or database getting evicted. In such a case, all the other services would be rendered useless until the database can serve traffic again.
There can also be other scenarios where you want to control the order of scheduling or order of eviction of pods.
PriorityClasses in Kubernetes
PriorityClass is a cluster-wide API object in Kubernetes and part of the scheduling.k8s.io/v1
API group. It contains a mapping of the PriorityClass name (defined in .metadata.name
) and an integer value (defined in .value
). This represents the value that the scheduler uses to determine Pod's relative priority.
Additionally, when you create a cluster using kubeadm or a managed Kubernetes service (for example, Azure Kubernetes Service), Kubernetes uses PriorityClasses to safeguard the pods that are hosted on the control plane nodes. This ensures that critical cluster components such as CoreDNS and kube-proxy can run even if resources are constrained.
This availability of pods is achieved through the use of a special PriorityClass that ensures the pods are up and running and that the overall cluster is not affected.
$ kubectl get priorityclass
NAME VALUE GLOBAL-DEFAULT AGE
system-cluster-critical 2000000000 false 82m
system-node-critical 2000001000 false 82m
The diagram below shows exactly how it works with the help of an example, which will be detailed in the upcoming section.
Pod scheduling and preemption
Pod priority and preemption
Pod preemption is a Kubernetes feature that allows the cluster to preempt pods (removing an existing Pod in favor of a new Pod) on the basis of priority. Pod priority indicates the importance of a pod relative to other pods while scheduling. If there aren't enough resources to run all the current pods, the scheduler tries to evict lower-priority pods over high-priority ones.
Also, when a healthy cluster experiences a node failure, typically, lower-priority pods get preempted to create room for higher-priority pods on the available node. This happens even if the cluster can bring up a new node automatically since pod creation is usually much faster than bringing up a new node.
PriorityClass requirements
Before you set up PriorityClasses, there are a few things to consider.
- Decide which PriorityClasses are needed. For instance, based on environment, type of pods, type of applications, etc.
- The default PriorityClass resource for your cluster. The pods without a
priorityClassName
will be treated as priority 0. - Use a consistent naming convention for all PriorityClasses.
- Make sure that the pods for your workloads are running with the right PriorityClass.
PriorityClass hands-on example
Let's say there are 3 application pods: one for prod, one for preprod, and one for development. Below are three sample YAML manifest files for each of those.
---
# development
apiVersion: v1
kind: Pod
metadata:
name: dev-nginx
labels:
env: dev
spec:
containers:
- name: dev-nginx
image: nginx
resources:
requests:
memory: "256Mi"
cpu: "0.2"
limits:
memory: ".5Gi"
cpu: "0.5"
---
# preproduction
apiVersion: v1
kind: Pod
metadata:
name: preprod-nginx
labels:
env: preprod
spec:
containers:
- name: preprod-nginx
image: nginx
resources:
requests:
memory: "1.5Gi"
cpu: "1.5"
limits:
memory: "2Gi"
cpu: "2"
---
# production
apiVersion: v1
kind: Pod
metadata:
name: prod-nginx
labels:
env: prod
spec:
containers:
- name: prod-nginx
image: nginx
resources:
requests:
memory: "2Gi"
cpu: "2"
limits:
memory: "2Gi"
cpu: "2"
You can create these pods with the kubectl create -f <FILE.yaml>
command, and then check their status using the kubectl get pods
command. You can see if they are up and look ready to serve traffic:
$ kubectl get pods --show-labels
NAME READY STATUS RESTARTS AGE LABELS
dev-nginx 1/1 Running 0 55s env=dev
preprod-nginx 1/1 Running 0 55s env=preprod
prod-nginx 0/1 Pending 0 55s env=prod
Bad news. The pod for the Production environment is still Pending and isn't serving any traffic.
Let's see why this is happening:
$ kubectl get events
...
...
5s Warning FailedScheduling pod/prod-nginx 0/2 nodes are available: 1 Insufficient cpu, 2 Insufficient memory.
In this example, there is only one worker node, and that node has a resource crunch.
Now, let's look at how PriorityClass can help in this situation since prod should be given higher priority than the other environments.
PriorityClass API
Before creating PriorityClasses based on these requirements, let's see what a basic manifest for a PriorityClass looks like and outline some prerequisites:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: PRIORITYCLASS_NAME
value: 0 # any integer value between -1000000000 to 1000000000
description: >-
(Optional) description goes here!
globalDefault: false # or true. Only one PriorityClass can be the global default.
Below are some prerequisites for PriorityClasses:
- The name of a PriorityClass must be a valid DNS subdomain name.
- When you make your own PriorityClass, the name should not start with
system-
, as those names are reserved by Kubernetes itself (for example, they are used for two built-in PriorityClasses). - Its absolute value should be between -1000000000 to 1000000000 (1 billion).
- Larger numbers are reserved by PriorityClasses such as
system-cluster-critical
(this Pod is critically important to the cluster) andsystem-node-critical
(the node critically relies on this Pod).system-node-critical
is a higher priority thansystem-cluster-critical
, because a cluster-critical Pod can only work well if the node where it is running has all its node-level critical requirements met. - There are two optional fields:
globalDefault
: When true, this PriorityClass is used for pods where apriorityClassName
is not specified. Only one PriorityClass withglobalDefault
set to true can exist in a cluster.
If there is no PriorityClass defined with globalDefault set to true, all the pods with no priorityClassName defined will be treated with 0 priority (i.e. the least priority).description
: A string with a meaningful value so that people know when to use this PriorityClass.
globalDefault
set to true
does not mean it will apply the same to the existing pods that are already running. This will be applicable only to the pods that came into existence after the PriorityClass was created.PriorityClass in action
Here's an example. Next, create some environment-specific PriorityClasses:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: dev-pc
value: 1000000
globalDefault: false
description: >-
(Optional) This priority class should only be used for all development pods.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: preprod-pc
value: 2000000
globalDefault: false
description: >-
(Optional) This priority class should only be used for all preprod pods.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: prod-pc
value: 4000000
globalDefault: false
description: >-
(Optional) This priority class should only be used for all prod pods.
Use kubectl create -f <FILE.YAML>
command to create a pc and kubectl get pc
to check its status.
$ kubectl get pc
NAME VALUE GLOBAL-DEFAULT AGE
dev-pc 1000000 false 3m13s
preprod-pc 2000000 false 2m3s
prod-pc 4000000 false 7s
system-cluster-critical 2000000000 false 82m
system-node-critical 2000001000 false 82m
The new PriorityClasses are in place now. A small change is needed in the pod manifest or pod template (in a ReplicaSet or Deployment). In other words, you need to specify the priority class name at .spec.priorityClassName
(which is a string value).
First update the previous production pod manifest file to have a PriorityClass assigned, then delete the Production pod and recreate it. You can't edit the priority class for a Pod that already exists.
In my cluster, when I tried this, here's what happened. First, that change seems successful; the status of pods has been updated:
$ kubectl get pods --show-labels
NAME READY STATUS RESTARTS AGE LABELS
dev-nginx 1/1 Terminating 0 55s env=dev
preprod-nginx 1/1 Running 0 55s env=preprod
prod-nginx 0/1 Pending 0 55s env=prod
The dev-nginx pod is getting terminated. Once that is successfully terminated and there are enough resources for the prod pod, the control plane can schedule the prod pod:
Warning FailedScheduling pod/prod-nginx 0/2 nodes are available: 1 Insufficient cpu, 2 Insufficient memory.
Normal Preempted pod/dev-nginx by default/prod-nginx on node node01
Normal Killing pod/dev-nginx Stopping container dev-nginx
Normal Scheduled pod/prod-nginx Successfully assigned default/prod-nginx to node01
Normal Pulling pod/prod-nginx Pulling image "nginx"
Normal Pulled pod/prod-nginx Successfully pulled image "nginx"
Normal Created pod/prod-nginx Created container prod-nginx
Normal Started pod/prod-nginx Started container prod-nginx
Enforcement
When you set up PriorityClasses, they exist just how you defined them. However, people (and tools) that make changes to your cluster are free to set any PriorityClass, or to not set any PriorityClass at all. However, you can use other Kubernetes features to make sure that the priorities you wanted are actually applied.
As an alpha feature, you can define a ValidatingAdmissionPolicy and a ValidatingAdmissionPolicyBinding so that, for example, Pods that go into the prod
namespace must use the prod-pc
PriorityClass. With another ValidatingAdmissionPolicyBinding you ensure that the preprod
namespace uses the preprod-pc
PriorityClass, and so on. In any cluster, you can enforce similar controls using external projects such as Kyverno or Gatekeeper, through validating admission webhooks.
However you do it, Kubernetes gives you options to make sure that the PriorityClasses are used how you wanted them to be, or perhaps just to warn users when they pick an unsuitable option.
Summary
The above example and its events show you what this feature of Kubernetes brings to the table, along with several scenarios where you can use this feature. To reiterate, this helps ensure that mission-critical pods are up and available to serve the traffic and, in the case of a resource crunch, determines cluster behavior.
It gives you some power to decide the order of scheduling and order of preemption for Pods. Therefore, you need to define the PriorityClasses sensibly. For example, if you have a cluster autoscaler to add nodes on demand, make sure to run it with the system-cluster-critical
PriorityClass. You don't want to get in a situation where the autoscaler has been preempted and there are no new nodes coming online.
If you have any queries or feedback, feel free to reach out to me on LinkedIn.
12 Jan 2023 12:00am GMT
06 Jan 2023
Kubernetes – Production-Grade Container Orchestration
Blog: Kubernetes 1.26: Eviction policy for unhealthy pods guarded by PodDisruptionBudgets
Authors: Filip Křepinský (Red Hat), Morten Torkildsen (Google), Ravi Gudimetla (Apple)
Ensuring the disruptions to your applications do not affect its availability isn't a simple task. Last month's release of Kubernetes v1.26 lets you specify an unhealthy pod eviction policy for PodDisruptionBudgets (PDBs) to help you maintain that availability during node management operations. In this article, we will dive deeper into what modifications were introduced for PDBs to give application owners greater flexibility in managing disruptions.
What problems does this solve?
API-initiated eviction of pods respects PodDisruptionBudgets (PDBs). This means that a requested voluntary disruption via an eviction to a Pod, should not disrupt a guarded application and .status.currentHealthy
of a PDB should not fall below .status.desiredHealthy
. Running pods that are Unhealthy do not count towards the PDB status, but eviction of these is only possible in case the application is not disrupted. This helps disrupted or not yet started application to achieve availability as soon as possible without additional downtime that would be caused by evictions.
Unfortunately, this poses a problem for cluster administrators that would like to drain nodes without any manual interventions. Misbehaving applications with pods in CrashLoopBackOff
state (due to a bug or misconfiguration) or pods that are simply failing to become ready make this task much harder. Any eviction request will fail due to violation of a PDB, when all pods of an application are unhealthy. Draining of a node cannot make any progress in that case.
On the other hand there are users that depend on the existing behavior, in order to:
- prevent data-loss that would be caused by deleting pods that are guarding an underlying resource or storage
- achieve the best availability possible for their application
Kubernetes 1.26 introduced a new experimental field to the PodDisruptionBudget API: .spec.unhealthyPodEvictionPolicy
. When enabled, this field lets you support both of those requirements.
How does it work?
API-initiated eviction is the process that triggers graceful pod termination. The process can be initiated either by calling the API directly, by using a kubectl drain
command, or other actors in the cluster. During this process every pod removal is consulted with appropriate PDBs, to ensure that a sufficient number of pods is always running in the cluster.
The following policies allow PDB authors to have a greater control how the process deals with unhealthy pods.
There are two policies IfHealthyBudget
and AlwaysAllow
to choose from.
The former, IfHealthyBudget
, follows the existing behavior to achieve the best availability that you get by default. Unhealthy pods can be disrupted only if their application has a minimum available .status.desiredHealthy
number of pods.
By setting the spec.unhealthyPodEvictionPolicy
field of your PDB to AlwaysAllow
, you are choosing the best effort availability for your application. With this policy it is always possible to evict unhealthy pods. This will make it easier to maintain and upgrade your clusters.
We think that AlwaysAllow
will often be a better choice, but for some critical workloads you may still prefer to protect even unhealthy Pods from node drains or other forms of API-initiated eviction.
How do I use it?
This is an alpha feature, which means you have to enable the PDBUnhealthyPodEvictionPolicy
feature gate, with the command line argument --feature-gates=PDBUnhealthyPodEvictionPolicy=true
to the kube-apiserver.
Here's an example. Assume that you've enabled the feature gate in your cluster, and that you already defined a Deployment that runs a plain webserver. You labelled the Pods for that Deployment with app: nginx
. You want to limit avoidable disruption, and you know that best effort availability is sufficient for this app. You decide to allow evictions even if those webserver pods are unhealthy. You create a PDB to guard this application, with the AlwaysAllow
policy for evicting unhealthy pods:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: nginx-pdb
spec:
selector:
matchLabels:
app: nginx
maxUnavailable: 1
unhealthyPodEvictionPolicy: AlwaysAllow
How can I learn more?
- Read the KEP: Unhealthy Pod Eviction Policy for PDBs
- Read the documentation: Unhealthy Pod Eviction Policy for PodDisruptionBudgets
- Review the Kubernetes documentation for PodDisruptionBudgets, draining of Nodes and evictions
How do I get involved?
If you have any feedback, please reach out to us in the #sig-apps channel on Slack (visit https://slack.k8s.io/ for an invitation if you need one), or on the SIG Apps mailing list: kubernetes-sig-apps@googlegroups.com
06 Jan 2023 12:00am GMT
05 Jan 2023
Kubernetes – Production-Grade Container Orchestration
Blog: Kubernetes 1.26: Retroactive Default StorageClass
Author: Roman Bednář (Red Hat)
The v1.25 release of Kubernetes introduced an alpha feature to change how a default StorageClass was assigned to a PersistentVolumeClaim (PVC). With the feature enabled, you no longer need to create a default StorageClass first and PVC second to assign the class. Additionally, any PVCs without a StorageClass assigned can be updated later. This feature was graduated to beta in Kubernetes 1.26.
You can read retroactive default StorageClass assignment in the Kubernetes documentation for more details about how to use that, or you can read on to learn about why the Kubernetes project is making this change.
Why did StorageClass assignment need improvements
Users might already be familiar with a similar feature that assigns default StorageClasses to new PVCs at the time of creation. This is currently handled by the admission controller.
But what if there wasn't a default StorageClass defined at the time of PVC creation? Users would end up with a PVC that would never be assigned a class. As a result, no storage would be provisioned, and the PVC would be somewhat "stuck" at this point. Generally, two main scenarios could result in "stuck" PVCs and cause problems later down the road. Let's take a closer look at each of them.
Changing default StorageClass
With the alpha feature enabled, there were two options admins had when they wanted to change the default StorageClass:
-
Creating a new StorageClass as default before removing the old one associated with the PVC. This would result in having two defaults for a short period. At this point, if a user were to create a PersistentVolumeClaim with storageClassName set to
null
(implying default StorageClass), the newest default StorageClass would be chosen and assigned to this PVC. -
Removing the old default first and creating a new default StorageClass. This would result in having no default for a short time. Subsequently, if a user were to create a PersistentVolumeClaim with storageClassName set to
null
(implying default StorageClass), the PVC would be inPending
state forever. The user would have to fix this by deleting the PVC and recreating it once the default StorageClass was available.
Resource ordering during cluster installation
If a cluster installation tool needed to create resources that required storage, for example, an image registry, it was difficult to get the ordering right. This is because any Pods that required storage would rely on the presence of a default StorageClass and would fail to be created if it wasn't defined.
What changed
We've changed the PersistentVolume (PV) controller to assign a default StorageClass to any unbound PersistentVolumeClaim that has the storageClassName set to null
. We've also modified the PersistentVolumeClaim admission within the API server to allow the change of values from an unset value to an actual StorageClass name.
Null storageClassName
versus storageClassName: ""
- does it matter?
Before this feature was introduced, those values were equal in terms of behavior. Any PersistentVolumeClaim with the storageClassName set to null
or ""
would bind to an existing PersistentVolume resource with storageClassName also set to null
or ""
.
With this new feature enabled we wanted to maintain this behavior but also be able to update the StorageClass name. With these constraints in mind, the feature changes the semantics of null
. If a default StorageClass is present, null
would translate to "Give me a default" and ""
would mean "Give me PersistentVolume that also has ""
StorageClass name." In the absence of a StorageClass, the behavior would remain unchanged.
Summarizing the above, we've changed the semantics of null
so that its behavior depends on the presence or absence of a definition of default StorageClass.
The tables below show all these cases to better describe when PVC binds and when its StorageClass gets updated.
PVC storageClassName = "" |
PVC storageClassName = null |
||
---|---|---|---|
Without default class | PV storageClassName = "" |
binds | binds |
PV without storageClassName | binds | binds | |
With default class | PV storageClassName = "" |
binds | class updates |
PV without storageClassName | binds | class updates |
How to use it
If you want to test the feature whilst it's alpha, you need to enable the relevant feature gate in the kube-controller-manager and the kube-apiserver. Use the --feature-gates
command line argument:
--feature-gates="...,RetroactiveDefaultStorageClass=true"
Test drive
If you would like to see the feature in action and verify it works fine in your cluster here's what you can try:
-
Define a basic PersistentVolumeClaim:
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: pvc-1 spec: accessModes: - ReadWriteOnce resources: requests: storage: 1Gi
-
Create the PersistentVolumeClaim when there is no default StorageClass. The PVC won't provision or bind (unless there is an existing, suitable PV already present) and will remain in
Pending
state.$ kc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE pvc-1 Pending
-
Configure one StorageClass as default.
$ kc patch sc -p '{"metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}' storageclass.storage.k8s.io/my-storageclass patched
-
Verify that PersistentVolumeClaims is now provisioned correctly and was updated retroactively with new default StorageClass.
$ kc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE pvc-1 Bound pvc-06a964ca-f997-4780-8627-b5c3bf5a87d8 1Gi RWO my-storageclass 87m
New metrics
To help you see that the feature is working as expected we also introduced a new retroactive_storageclass_total
metric to show how many times that the PV controller attempted to update PersistentVolumeClaim, and retroactive_storageclass_errors_total
to show how many of those attempts failed.
Getting involved
We always welcome new contributors so if you would like to get involved you can join our Kubernetes Storage Special-Interest-Group (SIG).
If you would like to share feedback, you can do so on our public Slack channel.
Special thanks to all the contributors that provided great reviews, shared valuable insight and helped implement this feature (alphabetical order):
- Deep Debroy (ddebroy)
- Divya Mohan (divya-mohan0209)
- Jan Šafránek (jsafrane)
- Joe Betz (jpbetz)
- Jordan Liggitt (liggitt)
- Michelle Au (msau42)
- Seokho Son (seokho-son)
- Shannon Kularathna (shannonxtreme)
- Tim Bannister (sftim)
- Tim Hockin (thockin)
- Wojciech Tyczynski (wojtek-t)
- Xing Yang (xing-yang)
05 Jan 2023 12:00am GMT
02 Jan 2023
Kubernetes – Production-Grade Container Orchestration
Blog: Kubernetes v1.26: Alpha support for cross-namespace storage data sources
Author: Takafumi Takahashi (Hitachi Vantara)
Kubernetes v1.26, released last month, introduced an alpha feature that lets you specify a data source for a PersistentVolumeClaim, even where the source data belong to a different namespace. With the new feature enabled, you specify a namespace in the dataSourceRef
field of a new PersistentVolumeClaim. Once Kubernetes checks that access is OK, the new PersistentVolume can populate its data from the storage source specified in that other namespace. Before Kubernetes v1.26, provided your cluster had the AnyVolumeDataSource
feature enabled, you could already provision new volumes from a data source in the same namespace. However, that only worked for the data source in the same namespace, therefore users couldn't provision a PersistentVolume with a claim in one namespace from a data source in other namespace. To solve this problem, Kubernetes v1.26 added a new alpha namespace
field to dataSourceRef
field in PersistentVolumeClaim the API.
How it works
Once the csi-provisioner finds that a data source is specified with a dataSourceRef
that has a non-empty namespace name, it checks all reference grants within the namespace that's specified by the.spec.dataSourceRef.namespace
field of the PersistentVolumeClaim, in order to see if access to the data source is allowed. If any ReferenceGrant allows access, the csi-provisioner provisions a volume from the data source.
Trying it out
The following things are required to use cross namespace volume provisioning:
- Enable the
AnyVolumeDataSource
andCrossNamespaceVolumeDataSource
feature gates for the kube-apiserver and kube-controller-manager - Install a CRD for the specific
VolumeSnapShot
controller - Install the CSI Provisioner controller and enable the
CrossNamespaceVolumeDataSource
feature gate - Install the CSI driver
- Install a CRD for ReferenceGrants
Putting it all together
To see how this works, you can install the sample and try it out. This sample do to create PVC in dev namespace from VolumeSnapshot in prod namespace. That is a simple example. For real world use, you might want to use a more complex approach.
Assumptions for this example
- Your Kubernetes cluster was deployed with
AnyVolumeDataSource
andCrossNamespaceVolumeDataSource
feature gates enabled - There are two namespaces, dev and prod
- CSI driver is being deployed
- There is an existing VolumeSnapshot named
new-snapshot-demo
in the prod namespace - The ReferenceGrant CRD (from the Gateway API project) is already deployed
Grant ReferenceGrants read permission to the CSI Provisioner
Access to ReferenceGrants is only needed when the CSI driver has the CrossNamespaceVolumeDataSource
controller capability. For this example, the external-provisioner needs get, list, and watch permissions for referencegrants
(API group gateway.networking.k8s.io
).
- apiGroups: ["gateway.networking.k8s.io"]
resources: ["referencegrants"]
verbs: ["get", "list", "watch"]
Enable the CrossNamespaceVolumeDataSource feature gate for the CSI Provisioner
Add --feature-gates=CrossNamespaceVolumeDataSource=true
to the csi-provisioner command line. For example, use this manifest snippet to redefine the container:
- args:
- -v=5
- --csi-address=/csi/csi.sock
- --feature-gates=Topology=true
- --feature-gates=CrossNamespaceVolumeDataSource=true
image: csi-provisioner:latest
imagePullPolicy: IfNotPresent
name: csi-provisioner
Create a ReferenceGrant
Here's a manifest for an example ReferenceGrant.
apiVersion: gateway.networking.k8s.io/v1beta1
kind: ReferenceGrant
metadata:
name: allow-prod-pvc
namespace: prod
spec:
from:
- group: ""
kind: PersistentVolumeClaim
namespace: dev
to:
- group: snapshot.storage.k8s.io
kind: VolumeSnapshot
name: new-snapshot-demo
Create a PersistentVolumeClaim by using cross namespace data source
Kubernetes creates a PersistentVolumeClaim on dev and the CSI driver populates the PersistentVolume used on dev from snapshots on prod.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: example-pvc
namespace: dev
spec:
storageClassName: example
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
dataSourceRef:
apiGroup: snapshot.storage.k8s.io
kind: VolumeSnapshot
name: new-snapshot-demo
namespace: prod
volumeMode: Filesystem
How can I learn more?
The enhancement proposal, Provision volumes from cross-namespace snapshots, includes lots of detail about the history and technical implementation of this feature.
Please get involved by joining the Kubernetes Storage Special Interest Group (SIG) to help us enhance this feature. There are a lot of good ideas already and we'd be thrilled to have more!
Acknowledgments
It takes a wonderful group to make wonderful software. Special thanks to the following people for the insightful reviews, thorough consideration and valuable contribution to the CrossNamespaceVolumeDataSouce feature:
- Michelle Au (msau42)
- Xing Yang (xing-yang)
- Masaki Kimura (mkimuram)
- Tim Hockin (thockin)
- Ben Swartzlander (bswartz)
- Rob Scott (robscott)
- John Griffith (j-griffith)
- Michael Henriksen (mhenriks)
- Mustafa Elbehery (Elbehery)
It's been a joy to work with y'all on this.
02 Jan 2023 12:00am GMT
30 Dec 2022
Kubernetes – Production-Grade Container Orchestration
Blog: Kubernetes v1.26: Advancements in Kubernetes Traffic Engineering
Authors: Andrew Sy Kim (Google)
Kubernetes v1.26 includes significant advancements in network traffic engineering with the graduation of two features (Service internal traffic policy support, and EndpointSlice terminating conditions) to GA, and a third feature (Proxy terminating endpoints) to beta. The combination of these enhancements aims to address short-comings in traffic engineering that people face today, and unlock new capabilities for the future.
Traffic Loss from Load Balancers During Rolling Updates
Prior to Kubernetes v1.26, clusters could experience loss of traffic from Service load balancers during rolling updates when setting the externalTrafficPolicy
field to Local
. There are a lot of moving parts at play here so a quick overview of how Kubernetes manages load balancers might help!
In Kubernetes, you can create a Service with type: LoadBalancer
to expose an application externally with a load balancer. The load balancer implementation varies between clusters and platforms, but the Service provides a generic abstraction representing the load balancer that is consistent across all Kubernetes installations.
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app.kubernetes.io/name: my-app
ports:
- protocol: TCP
port: 80
targetPort: 9376
type: LoadBalancer
Under the hood, Kubernetes allocates a NodePort for the Service, which is then used by kube-proxy to provide a network data path from the NodePort to the Pod. A controller will then add all available Nodes in the cluster to the load balancer's backend pool, using the designated NodePort for the Service as the backend target port.

Figure 1: Overview of Service load balancers
Oftentimes it is beneficial to set externalTrafficPolicy: Local
for Services, to avoid extra hops between Nodes that are not running healthy Pods backing that Service. When using externalTrafficPolicy: Local
, an additional NodePort is allocated for health checking purposes, such that Nodes that do not contain healthy Pods are excluded from the backend pool for a load balancer.

Figure 2: Load balancer traffic to a healthy Node, when externalTrafficPolicy is Local
One such scenario where traffic can be lost is when a Node loses all Pods for a Service, but the external load balancer has not probed the health check NodePort yet. The likelihood of this situation is largely dependent on the health checking interval configured on the load balancer. The larger the interval, the more likely this will happen, since the load balancer will continue to send traffic to a node even after kube-proxy has removed forwarding rules for that Service. This also occurrs when Pods start terminating during rolling updates. Since Kubernetes does not consider terminating Pods as "Ready", traffic can be loss when there are only terminating Pods on any given Node during a rolling update.

Figure 3: Load balancer traffic to terminating endpoints, when externalTrafficPolicy is Local
Starting in Kubernetes v1.26, kube-proxy enables the ProxyTerminatingEndpoints
feature by default, which adds automatic failover and routing to terminating endpoints in scenarios where the traffic would otherwise be dropped. More specifically, when there is a rolling update and a Node only contains terminating Pods, kube-proxy will route traffic to the terminating Pods based on their readiness. In addition, kube-proxy will actively fail the health check NodePort if there are only terminating Pods available. By doing so, kube-proxy alerts the external load balancer that new connections should not be sent to that Node but will gracefully handle requests for existing connections.

Figure 4: Load Balancer traffic to terminating endpoints with ProxyTerminatingEndpoints enabled, when externalTrafficPolicy is Local
EndpointSlice Conditions
In order to support this new capability in kube-proxy, the EndpointSlice API introduced new conditions for endpoints: serving
and terminating
.

Figure 5: Overview of EndpointSlice conditions
The serving
condition is semantically identical to ready
, except that it can be true
or false
while a Pod is terminating, unlike ready
which will always be false
for terminating Pods for compatibility reasons. The terminating
condition is true for Pods undergoing termination (non-empty deletionTimestamp), false otherwise.
The addition of these two conditions enables consumers of this API to understand Pod states that were previously not possible. For example, we can now track "ready" and "not ready" Pods that are also terminating.

Figure 6: EndpointSlice conditions with a terminating Pod
Consumers of the EndpointSlice API, such as Kube-proxy and Ingress Controllers, can now use these conditions to coordinate connection draining events, by continuing to forward traffic for existing connections but rerouting new connections to other non-terminating endpoints.
Optimizing Internal Node-Local Traffic
Similar to how Services can set externalTrafficPolicy: Local
to avoid extra hops for externally sourced traffic, Kubernetes now supports internalTrafficPolicy: Local
, to enable the same optimization for traffic originating within the cluster, specifically for traffic using the Service Cluster IP as the destination address. This feature graduated to Beta in Kubernetes v1.24 and is graduating to GA in v1.26.
Services default the internalTrafficPolicy
field to Cluster
, where traffic is randomly distributed to all endpoints.

Figure 7: Service routing when internalTrafficPolicy is Cluster
When internalTrafficPolicy
is set to Local
, kube-proxy will forward internal traffic for a Service only if there is an available endpoint that is local to the same Node.

Figure 8: Service routing when internalTrafficPolicy is Local
internalTrafficPoliy: Local
, traffic will be dropped by kube-proxy when no local endpoints are available.Getting Involved
If you're interested in future discussions on Kubernetes traffic engineering, you can get involved in SIG Network through the following ways:
30 Dec 2022 12:00am GMT
29 Dec 2022
Kubernetes – Production-Grade Container Orchestration
Blog: Kubernetes 1.26: Job Tracking, to Support Massively Parallel Batch Workloads, Is Generally Available
Authors: Aldo Culquicondor (Google)
The Kubernetes 1.26 release includes a stable implementation of the Job controller that can reliably track a large amount of Jobs with high levels of parallelism. SIG Apps and WG Batch have worked on this foundational improvement since Kubernetes 1.22. After multiple iterations and scale verifications, this is now the default implementation of the Job controller.
Paired with the Indexed completion mode, the Job controller can handle massively parallel batch Jobs, supporting up to 100k concurrent Pods.
The new implementation also made possible the development of Pod failure policy, which is in beta in the 1.26 release.
How do I use this feature?
To use Job tracking with finalizers, upgrade to Kubernetes 1.25 or newer and create new Jobs. You can also use this feature in v1.23 and v1.24, if you have the ability to enable the JobTrackingWithFinalizers
feature gate.
If your cluster runs Kubernetes 1.26, Job tracking with finalizers is a stable feature. For v1.25, it's behind that feature gate, and your cluster administrators may have explicitly disabled it - for example, if you have a policy of not using beta features.
Jobs created before the upgrade will still be tracked using the legacy behavior. This is to avoid retroactively adding finalizers to running Pods, which might introduce race conditions.
For maximum performance on large Jobs, the Kubernetes project recommends using the Indexed completion mode. In this mode, the control plane is able to track Job progress with less API calls.
If you are a developer of operator(s) for batch, HPC, AI, ML or related workloads, we encourage you to use the Job API to delegate accurate progress tracking to Kubernetes. If there is something missing in the Job API that forces you to manage plain Pods, the Working Group Batch welcomes your feedback and contributions.
Deprecation notices
During the development of the feature, the control plane added the annotation batch.kubernetes.io/job-tracking
to the Jobs that were created when the feature was enabled. This allowed a safe transition for older Jobs, but it was never meant to stay.
In the 1.26 release, we deprecated the annotation batch.kubernetes.io/job-tracking
and the control plane will stop adding it in Kubernetes 1.27. Along with that change, we will remove the legacy Job tracking implementation. As a result, the Job controller will track all Jobs using finalizers and it will ignore Pods that don't have the aforementioned finalizer.
Before you upgrade your cluster to 1.27, we recommend that you verify that there are no running Jobs that don't have the annotation, or you wait for those jobs to complete. Otherwise, you might observe the control plane recreating some Pods. We expect that this shouldn't affect any users, as the feature is enabled by default since Kubernetes 1.25, giving enough buffer for old jobs to complete.
What problem does the new implementation solve?
Generally, Kubernetes workload controllers, such as ReplicaSet or StatefulSet, rely on the existence of Pods or other objects in the API to determine the status of the workload and whether replacements are needed. For example, if a Pod that belonged to a ReplicaSet terminates or ceases to exist, the ReplicaSet controller needs to create a replacement Pod to satisfy the desired number of replicas (.spec.replicas
).
Since its inception, the Job controller also relied on the existence of Pods in the API to track Job status. A Job has completion and failure handling policies, requiring the end state of a finished Pod to determine whether to create a replacement Pod or mark the Job as completed or failed. As a result, the Job controller depended on Pods, even terminated ones, to remain in the API in order to keep track of the status.
This dependency made the tracking of Job status unreliable, because Pods can be deleted from the API for a number of reasons, including:
- The garbage collector removing orphan Pods when a Node goes down.
- The garbage collector removing terminated Pods when they reach a threshold.
- The Kubernetes scheduler preempting a Pod to accomodate higher priority Pods.
- The taint manager evicting a Pod that doesn't tolerate a
NoExecute
taint. - External controllers, not included as part of Kubernetes, or humans deleting Pods.
The new implementation
When a controller needs to take an action on objects before they are removed, it should add a finalizer to the objects that it manages. A finalizer prevents the objects from being deleted from the API until the finalizers are removed. Once the controller is done with the cleanup and accounting for the deleted object, it can remove the finalizer from the object and the control plane removes the object from the API.
This is what the new Job controller is doing: adding a finalizer during Pod creation, and removing the finalizer after the Pod has terminated and has been accounted for in the Job status. However, it wasn't that simple.
The main challenge is that there are at least two objects involved: the Pod and the Job. While the finalizer lives in the Pod object, the accounting lives in the Job object. There is no mechanism to atomically remove the finalizer in the Pod and update the counters in the Job status. Additionally, there could be more than one terminated Pod at a given time.
To solve this problem, we implemented a three staged approach, each translating to an API call.
- For each terminated Pod, add the unique ID (UID) of the Pod into short-lived lists stored in the
.status
of the owning Job (.status.uncountedTerminatedPods). - Remove the finalizer from the Pods(s).
- Atomically do the following operations:
- remove UIDs from the short-lived lists
- increment the overall
succeeded
andfailed
counters in thestatus
of the Job.
Additional complications come from the fact that the Job controller might receive the results of the API changes in steps 1 and 2 out of order. We solved this by adding an in-memory cache for removed finalizers.
Still, we faced some issues during the beta stage, leaving some pods stuck with finalizers in some conditions (#108645, #109485, and #111646). As a result, we decided to switch that feature gate to be disabled by default for the 1.23 and 1.24 releases.
Once resolved, we re-enabled the feature for the 1.25 release. Since then, we have received reports from our customers running tens of thousands of Pods at a time in their clusters through the Job API. Seeing this success, we decided to graduate the feature to stable in 1.26, as part of our long term commitment to make the Job API the best way to run large batch Jobs in a Kubernetes cluster.
To learn more about the feature, you can read the KEP.
Acknowledgments
As with any Kubernetes feature, multiple people contributed to getting this done, from testing and filing bugs to reviewing code.
On behalf of SIG Apps, I would like to especially thank Jordan Liggitt (Google) for helping me debug and brainstorm solutions for more than one race condition and Maciej Szulik (Red Hat) for his thorough reviews.
29 Dec 2022 12:00am GMT
27 Dec 2022
Kubernetes – Production-Grade Container Orchestration
Blog: Kubernetes v1.26: CPUManager goes GA
Author: Francesco Romani (Red Hat)
The CPU Manager is a part of the kubelet, the Kubernetes node agent, which enables the user to allocate exclusive CPUs to containers. Since Kubernetes v1.10, where it graduated to Beta, the CPU Manager proved itself reliable and fulfilled its role of allocating exclusive CPUs to containers, so adoption has steadily grown making it a staple component of performance-critical and low-latency setups. Over time, most changes were about bugfixes or internal refactoring, with the following noteworthy user-visible changes:
- support explicit reservation of CPUs: it was already possible to request to reserve a given number of CPUs for system resources, including the kubelet itself, which will not be used for exclusive CPU allocation. Now it is possible to also explicitly select which CPUs to reserve instead of letting the kubelet pick them up automatically.
- report the exclusively allocated CPUs to containers, much like is already done for devices, using the kubelet-local PodResources API.
- optimize the usage of system resources, eliminating unnecessary sysfs changes.
The CPU Manager reached the point on which it "just works", so in Kubernetes v1.26 it has graduated to generally available (GA).
Customization options for CPU Manager
The CPU Manager supports two operation modes, configured using its policies. With the none
policy, the CPU Manager allocates CPUs to containers without any specific constraint except the (optional) quota set in the Pod spec. With the static
policy, then provided that the pod is in the Guaranteed QoS class and every container in that Pod requests an integer amount of vCPU cores, then the CPU Manager allocates CPUs exclusively. Exclusive assignment means that other containers (whether from the same Pod, or from a different Pod) do not get scheduled onto that CPU.
This simple operational model served the user base pretty well, but as the CPU Manager matured more and more, users started to look at more elaborate use cases and how to better support them.
Rather than add more policies, the community realized that pretty much all the novel use cases are some variation of the behavior enabled by the static
CPU Manager policy. Hence, it was decided to add options to tune the behavior of the static policy. The options have a varying degree of maturity, like any other Kubernetes feature, and in order to be accepted, each new option provides a backward compatible behavior when disabled, and to document how to interact with each other, should they interact at all.
This enabled the Kubernetes project to graduate to GA the CPU Manager core component and core CPU allocation algorithms to GA, while also enabling a new age of experimentation in this area. In Kubernetes v1.26, the CPU Manager supports three different policy options:
full-pcpus-only
- restrict the CPU Manager core allocation algorithm to full physical cores only, reducing noisy neighbor issues from hardware technologies that allow sharing cores.
distribute-cpus-across-numa
- drive the CPU Manager to evenly distribute CPUs across NUMA nodes, for cases where more than one NUMA node is required to satisfy the allocation.
align-by-socket
- change how the CPU Manager allocates CPUs to a container: consider CPUs to be aligned at the socket boundary, instead of NUMA node boundary.
Further development
After graduating the main CPU Manager feature, each existing policy option will follow their graduation process, independent from CPU Manager and from each other option. There is room for new options to be added, but there's also a growing demand for even more flexibility than what the CPU Manager, and its policy options, currently grant.
Conversations are in progress in the community about splitting the CPU Manager and the other resource managers currently part of the kubelet executable into pluggable, independent kubelet plugins. If you are interested in this effort, please join the conversation on SIG Node communication channels (Slack, mailing list, weekly meeting).
Further reading
Please check out the Control CPU Management Policies on the Node task page to learn more about the CPU Manager, and how it fits in relation to the other node-level resource managers.
Getting involved
This feature is driven by the SIG Node community. Please join us to connect with the community and share your ideas and feedback around the above feature and beyond. We look forward to hearing from you!
27 Dec 2022 12:00am GMT
26 Dec 2022
Kubernetes – Production-Grade Container Orchestration
Blog: Kubernetes 1.26: Pod Scheduling Readiness
Author: Wei Huang (Apple), Abdullah Gharaibeh (Google)
Kubernetes 1.26 introduced a new Pod feature: scheduling gates. In Kubernetes, scheduling gates are keys that tell the scheduler when a Pod is ready to be considered for scheduling.
What problem does it solve?
When a Pod is created, the scheduler will continuously attempt to find a node that fits it. This infinite loop continues until the scheduler either finds a node for the Pod, or the Pod gets deleted.
Pods that remain unschedulable for long periods of time (e.g., ones that are blocked on some external event) waste scheduling cycles. A scheduling cycle may take ≅20ms or more depending on the complexity of the Pod's scheduling constraints. Therefore, at scale, those wasted cycles significantly impact the scheduler's performance. See the arrows in the "scheduler" box below.
Scheduling gates helps address this problem. It allows declaring that newly created Pods are not ready for scheduling. When scheduling gates are present on a Pod, the scheduler ignores the Pod and therefore saves unnecessary scheduling attempts. Those Pods will also be ignored by Cluster Autoscaler if you have it installed in the cluster.
Clearing the gates is the responsibility of external controllers with knowledge of when the Pod should be considered for scheduling (e.g., a quota manager).
How does it work?
Scheduling gates in general works very similar to Finalizers. Pods with a non-empty spec.schedulingGates
field will show as status SchedulingGated
and be blocked from scheduling. Note that more than one gate can be added, but they all should be added upon Pod creation (e.g., you can add them as part of the spec or via a mutating webhook).
NAME READY STATUS RESTARTS AGE
test-pod 0/1 SchedulingGated 0 10s
To clear the gates, you update the Pod by removing all of the items from the Pod's schedulingGates
field. The gates do not need to be removed all at once, but only when all the gates are removed the scheduler will start to consider the Pod for scheduling.
Under the hood, scheduling gates are implemented as a PreEnqueue scheduler plugin, a new scheduler framework extension point that is invoked at the beginning of each scheduling cycle.
Use Cases
An important use case this feature enables is dynamic quota management. Kubernetes supports ResourceQuota, however the API Server enforces quota at the time you attempt Pod creation. For example, if a new Pod exceeds the CPU quota, it gets rejected. The API Server doesn't queue the Pod; therefore, whoever created the Pod needs to continuously attempt to recreate it again. This either means a delay between resources becoming available and the Pod actually running, or it means load on the API server and Scheduler due to constant attempts.
Scheduling gates allows an external quota manager to address the above limitation of ResourceQuota. Specifically, the manager could add a example.com/quota-check
scheduling gate to all Pods created in the cluster (using a mutating webhook). The manager would then remove the gate when there is quota to start the Pod.
Whats next?
To use this feature, the PodSchedulingReadiness
feature gate must be enabled in the API Server and scheduler. You're more than welcome to test it out and tell us (SIG Scheduling) what you think!
Additional resources
- Pod Scheduling Readiness in the Kubernetes documentation
- Kubernetes Enhancement Proposal
26 Dec 2022 12:00am GMT
23 Dec 2022
Kubernetes – Production-Grade Container Orchestration
Blog: Kubernetes 1.26: Support for Passing Pod fsGroup to CSI Drivers At Mount Time
Authors: Fabio Bertinatto (Red Hat), Hemant Kumar (Red Hat)
Delegation of fsGroup
to CSI drivers was first introduced as alpha in Kubernetes 1.22, and graduated to beta in Kubernetes 1.25. For Kubernetes 1.26, we are happy to announce that this feature has graduated to General Availability (GA).
In this release, if you specify a fsGroup
in the security context, for a (Linux) Pod, all processes in the pod's containers are part of the additional group that you specified.
In previous Kubernetes releases, the kubelet would always apply the fsGroup
ownership and permission changes to files in the volume according to the policy you specified in the Pod's .spec.securityContext.fsGroupChangePolicy
field.
Starting with Kubernetes 1.26, CSI drivers have the option to apply the fsGroup
settings during volume mount time, which frees the kubelet from changing the permissions of files and directories in those volumes.
How does it work?
CSI drivers that support this feature should advertise the VOLUME_MOUNT_GROUP
node capability.
After recognizing this information, the kubelet passes the fsGroup
information to the CSI driver during pod startup. This is done through the NodeStageVolumeRequest
and NodePublishVolumeRequest
CSI calls.
Consequently, the CSI driver is expected to apply the fsGroup
to the files in the volume using a mount option. As an example, Azure File CSIDriver utilizes the gid
mount option to map the fsGroup
information to all the files in the volume.
It should be noted that in the example above the kubelet refrains from directly applying the permission changes into the files and directories in that volume files. Additionally, two policy definitions no longer have an effect: neither .spec.fsGroupPolicy
for the CSIDriver object, nor .spec.securityContext.fsGroupChangePolicy
for the Pod.
For more details about the inner workings of this feature, check out the enhancement proposal and the CSI Driver fsGroup
Support in the CSI developer documentation.
Why is it important?
Without this feature, applying the fsGroup information to files is not possible in certain storage environments.
For instance, Azure File does not support a concept of POSIX-style ownership and permissions of files. The CSI driver is only able to set the file permissions at the volume level.
How do I use it?
This feature should be mostly transparent to users. If you maintain a CSI driver that should support this feature, read CSI Driver fsGroup
Support for more information on how to support this feature in your CSI driver.
Existing CSI drivers that do not support this feature will continue to work as usual: they will not receive any fsGroup
information from the kubelet. In addition to that, the kubelet will continue to perform the ownership and permissions changes to files for those volumes, according to the policies specified in .spec.fsGroupPolicy
for the CSIDriver and .spec.securityContext.fsGroupChangePolicy
for the relevant Pod.
23 Dec 2022 12:00am GMT
22 Dec 2022
Kubernetes – Production-Grade Container Orchestration
Blog: Kubernetes v1.26: GA Support for Kubelet Credential Providers
Authors: Andrew Sy Kim (Google), Dixita Narang (Google)
Kubernetes v1.26 introduced generally available (GA) support for kubelet credential provider plugins, offering an extensible plugin framework to dynamically fetch credentials for any container image registry.
Background
Kubernetes supports the ability to dynamically fetch credentials for a container registry service. Prior to Kubernetes v1.20, this capability was compiled into the kubelet and only available for Amazon Elastic Container Registry, Azure Container Registry, and Google Cloud Container Registry.

Figure 1: Kubelet built-in credential provider support for Amazon Elastic Container Registry, Azure Container Registry, and Google Cloud Container Registry.
Kubernetes v1.20 introduced alpha support for kubelet credential providers plugins, which provides a mechanism for the kubelet to dynamically authenticate and pull images for arbitrary container registries - whether these are public registries, managed services, or even a self-hosted registry. In Kubernetes v1.26, this feature is now GA

Figure 2: Kubelet credential provider overview
Why is it important?
Prior to Kubernetes v1.20, if you wanted to dynamically fetch credentials for image registries other than ACR (Azure Container Registry), ECR (Elastic Container Registry), or GCR (Google Container Registry), you needed to modify the kubelet code. The new plugin mechanism can be used in any cluster, and lets you authenticate to new registries without any changes to Kubernetes itself. Any cloud provider or vendor can publish a plugin that lets you authenticate with their image registry.
How it works
The kubelet and the exec plugin binary communicate through stdio (stdin, stdout, and stderr) by sending and receiving json-serialized api-versioned types. If the exec plugin is enabled and the kubelet requires authentication information for an image that matches against a plugin, the kubelet will execute the plugin binary, passing the CredentialProviderRequest
API via stdin. Then the exec plugin communicates with the container registry to dynamically fetch the credentials and returns the credentials in an encoded response of the CredentialProviderResponse
API to the kubelet via stdout.

Figure 3: Kubelet credential provider plugin flow
On receiving credentials from the kubelet, the plugin can also indicate how long credentials can be cached for, to prevent unnecessary execution of the plugin by the kubelet for subsequent image pull requests to the same registry. In cases where the cache duration is not specified by the plugin, a default cache duration can be specified by the kubelet (more details below).
{
"apiVersion": "kubelet.k8s.io/v1",
"kind": "CredentialProviderResponse",
"auth": {
"cacheDuration": "6h",
"private-registry.io/my-app": {
"username": "exampleuser",
"password": "token12345"
}
}
}
In addition, the plugin can specify the scope in which cached credentials are valid for. This is specified through the cacheKeyType
field in CredentialProviderResponse
. When the value is Image
, the kubelet will only use cached credentials for future image pulls that exactly match the image of the first request. When the value is Registry
, the kubelet will use cached credentials for any subsequent image pulls destined for the same registry host but using different paths (for example, gcr.io/foo/bar
and gcr.io/bar/foo
refer to different images from the same registry). Lastly, when the value is Global
, the kubelet will use returned credentials for all images that match against the plugin, including images that can map to different registry hosts (for example, gcr.io vs k8s.gcr.io). The cacheKeyType
field is required by plugin implementations.
{
"apiVersion": "kubelet.k8s.io/v1",
"kind": "CredentialProviderResponse",
"auth": {
"cacheKeyType": "Registry",
"private-registry.io/my-app": {
"username": "exampleuser",
"password": "token12345"
}
}
}
Using kubelet credential providers
You can configure credential providers by installing the exec plugin(s) into a local directory accessible by the kubelet on every node. Then you set two command line arguments for the kubelet:
--image-credential-provider-config
: the path to the credential provider plugin config file.--image-credential-provider-bin-dir
: the path to the directory where credential provider plugin binaries are located.
The configuration file passed into --image-credential-provider-config
is read by the kubelet to determine which exec plugins should be invoked for a container image used by a Pod. Note that the name of each provider must match the name of the binary located in the local directory specified in --image-credential-provider-bin-dir
, otherwise the kubelet cannot locate the path of the plugin to invoke.
kind: CredentialProviderConfig
apiVersion: kubelet.config.k8s.io/v1
providers:
- name: auth-provider-gcp
apiVersion: credentialprovider.kubelet.k8s.io/v1
matchImages:
- "container.cloud.google.com"
- "gcr.io"
- "*.gcr.io"
- "*.pkg.dev"
args:
- get-credentials
- --v=3
defaultCacheDuration: 1m
Below is an overview of how the Kubernetes project is using kubelet credential providers for end-to-end testing.

Figure 4: Kubelet credential provider configuration used for Kubernetes e2e testing
For more configuration details, see Kubelet Credential Providers.
Getting Involved
Come join SIG Node if you want to report bugs or have feature requests for the Kubelet Credential Provider. You can reach us through the following ways:
22 Dec 2022 12:00am GMT
20 Dec 2022
Kubernetes – Production-Grade Container Orchestration
Blog: Kubernetes 1.26: Introducing Validating Admission Policies
Authors: Joe Betz (Google), Cici Huang (Google)
In Kubernetes 1.26, the 1st alpha release of validating admission policies is available!
Validating admission policies use the Common Expression Language (CEL) to offer a declarative, in-process alternative to validating admission webhooks.
CEL was first introduced to Kubernetes for the Validation rules for CustomResourceDefinitions. This enhancement expands the use of CEL in Kubernetes to support a far wider range of admission use cases.
Admission webhooks can be burdensome to develop and operate. Webhook developers must implement and maintain a webhook binary to handle admission requests. Also, admission webhooks are complex to operate. Each webhook must be deployed, monitored and have a well defined upgrade and rollback plan. To make matters worse, if a webhook times out or becomes unavailable, the Kubernetes control plane can become unavailable. This enhancement avoids much of this complexity of admission webhooks by embedding CEL expressions into Kubernetes resources instead of calling out to a remote webhook binary.
For example, to set a limit on how many replicas a Deployment can have. Start by defining a validation policy:
apiVersion: admissionregistration.k8s.io/v1alpha1
kind: ValidatingAdmissionPolicy
metadata:
name: "demo-policy.example.com"
spec:
matchConstraints:
resourceRules:
- apiGroups: ["apps"]
apiVersions: ["v1"]
operations: ["CREATE", "UPDATE"]
resources: ["deployments"]
validations:
- expression: "object.spec.replicas <= 5"
The expression
field contains the CEL expression that is used to validate admission requests. matchConstraints
declares what types of requests this ValidatingAdmissionPolicy
is may validate.
Next bind the policy to the appropriate resources:
apiVersion: admissionregistration.k8s.io/v1alpha1
kind: ValidatingAdmissionPolicyBinding
metadata:
name: "demo-binding-test.example.com"
spec:
policyName: "demo-policy.example.com"
matchResources:
namespaceSelector:
matchExpressions:
- key: environment
operator: In
values:
- test
This ValidatingAdmissionPolicyBinding
resource binds the above policy only to namespaces where the environment
label is set to test
. Once this binding is created, the kube-apiserver will begin enforcing this admission policy.
To emphasize how much simpler this approach is than admission webhooks, if this example were instead implemented with a webhook, an entire binary would need to be developed and maintained just to perform a <=
check. In our review of a wide range of admission webhooks used in production, the vast majority performed relatively simple checks, all of which can easily be expressed using CEL.
Validation admission policies are highly configurable, enabling policy authors to define policies that can be parameterized and scoped to resources as needed by cluster administrators.
For example, the above admission policy can be modified to make it configurable:
apiVersion: admissionregistration.k8s.io/v1alpha1
kind: ValidatingAdmissionPolicy
metadata:
name: "demo-policy.example.com"
spec:
paramKind:
apiVersion: rules.example.com/v1 # You also need a CustomResourceDefinition for this API
kind: ReplicaLimit
matchConstraints:
resourceRules:
- apiGroups: ["apps"]
apiVersions: ["v1"]
operations: ["CREATE", "UPDATE"]
resources: ["deployments"]
validations:
- expression: "object.spec.replicas <= params.maxReplicas"
Here, paramKind
defines the resources used to configure the policy and the expression
uses the params
variable to access the parameter resource.
This allows multiple bindings to be defined, each configured differently. For example:
apiVersion: admissionregistration.k8s.io/v1alpha1
kind: ValidatingAdmissionPolicyBinding
metadata:
name: "demo-binding-production.example.com"
spec:
policyName: "demo-policy.example.com"
paramRef:
name: "demo-params-production.example.com"
matchResources:
namespaceSelector:
matchExpressions:
- key: environment
operator: In
values:
- production
apiVersion: rules.example.com/v1 # defined via a CustomResourceDefinition
kind: ReplicaLimit
metadata:
name: "demo-params-production.example.com"
maxReplicas: 1000
This binding and parameter resource pair limit deployments in namespaces with the environment
label set to production
to a max of 1000 replicas.
You can then use a separate binding and parameter pair to set a different limit for namespaces in the test
environment.
I hope this has given you a glimpse of what is possible with validating admission policies! There are many features that we have not yet touched on.
To learn more, read Validating Admission Policy.
We are working hard to add more features to admission policies and make the enhancement easier to use. Try it out, send us your feedback and help us build a simpler alternative to admission webhooks!
How do I get involved?
If you want to get involved in development of admission policies, discuss enhancement roadmaps, or report a bug, you can get in touch with developers at SIG API Machinery.
20 Dec 2022 12:00am GMT
19 Dec 2022
Kubernetes – Production-Grade Container Orchestration
Blog: Kubernetes 1.26: Device Manager graduates to GA
Author: Swati Sehgal (Red Hat)
The Device Plugin framework was introduced in the Kubernetes v1.8 release as a vendor independent framework to enable discovery, advertisement and allocation of external devices without modifying core Kubernetes. The feature graduated to Beta in v1.10. With the recent release of Kubernetes v1.26, Device Manager is now generally available (GA).
Within the kubelet, the Device Manager facilitates communication with device plugins using gRPC through Unix sockets. Device Manager and Device plugins both act as gRPC servers and clients by serving and connecting to the exposed gRPC services respectively. Device plugins serve a gRPC service that kubelet connects to for device discovery, advertisement (as extended resources) and allocation. Device Manager connects to the Registration
gRPC service served by kubelet to register itself with kubelet.
Please refer to the documentation for an example on how a pod can request a device exposed to the cluster by a device plugin.
Here are some example implementations of device plugins:
- AMD GPU device plugin
- Collection of Intel device plugins for Kubernetes
- NVIDIA device plugin for Kubernetes
- SRIOV network device plugin for Kubernetes
Noteworthy developments since Device Plugin framework introduction
Kubelet APIs moved to kubelet staging repo
External facing deviceplugin
API packages moved from k8s.io/kubernetes/pkg/kubelet/apis/
to k8s.io/kubelet/pkg/apis/
in v1.17. Refer to Move external facing kubelet apis to staging for more details on the rationale behind this change.
Device Plugin API updates
Additional gRPC endpoints introduced:
GetDevicePluginOptions
is used by device plugins to communicate options to theDeviceManager
in order to indicate ifPreStartContainer
,GetPreferredAllocation
or other future optional calls are supported and can be called before making devices available to the container.GetPreferredAllocation
allows a device plugin to forward allocation preferrence to theDeviceManager
so it can incorporate this information into its allocation decisions. TheDeviceManager
will call out to a plugin at pod admission time asking for a preferred device allocation of a given size from a list of available devices to make a more informed decision. E.g. Specifying inter-device constraints to indicate preferrence on best-connected set of devices when allocating devices to a container.PreStartContainer
is called before each container start if indicated by device plugins during registration phase. It allows Device Plugins to run device specific operations on the Devices requested. E.g. reconfiguring or reprogramming FPGAs before the container starts running.
Pull Requests that introduced these changes are here:
- Invoke preStart RPC call before container start, if desired by plugin
- Add GetPreferredAllocation() call to the v1beta1 device plugin API
With introduction of the above endpoints the interaction between Device Manager in kubelet and Device Manager can be shown as below:
Device Plugin framework Overview
Change in semantics of device plugin registration process
Device plugin code was refactored to separate 'plugin' package under the devicemanager
package to lay the groundwork for introducing a v1beta2
device plugin API. This would allow adding support in devicemanager
to service multiple device plugin APIs at the same time.
With this refactoring work, it is now mandatory for a device plugin to start serving its gRPC service before registering itself with kubelet. Previously, these two operations were asynchronous and device plugin could register itself before starting its gRPC server which is no longer the case. For more details, refer to PR #109016 and Issue #112395.
Dynamic resource allocation
In Kubernetes 1.26, inspired by how Persistent Volumes are handled in Kubernetes, Dynamic Resource Allocation has been introduced to cater to devices that have more sophisticated resource requirements like:
- Decouple device initialization and allocation from the pod lifecycle.
- Facilitate dynamic sharing of devices between containers and pods.
- Support custom resource-specific parameters
- Enable resource-specific setup and cleanup actions
- Enable support for Network-attached resources, not just node-local resources
Is the Device Plugin API stable now?
No, the Device Plugin API is still not stable; the latest Device Plugin API version available is v1beta1
. There are plans in the community to introduce v1beta2
API to service multiple plugin APIs at once. A per-API call with request/response types would allow adding support for newer API versions without explicitly bumping the API.
In addition to that, there are existing proposals in the community to introduce additional endpoints KEP-3162: Add Deallocate and PostStopContainer to Device Manager API.
19 Dec 2022 12:00am GMT
16 Dec 2022
Kubernetes – Production-Grade Container Orchestration
Blog: Kubernetes 1.26: Non-Graceful Node Shutdown Moves to Beta
Author: Xing Yang (VMware), Ashutosh Kumar (VMware)
Kubernetes v1.24 introduced an alpha quality implementation of improvements for handling a non-graceful node shutdown. In Kubernetes v1.26, this feature moves to beta. This feature allows stateful workloads to failover to a different node after the original node is shut down or in a non-recoverable state, such as the hardware failure or broken OS.
What is a node shutdown in Kubernetes?
In a Kubernetes cluster, it is possible for a node to shut down. This could happen either in a planned way or it could happen unexpectedly. You may plan for a security patch, or a kernel upgrade and need to reboot the node, or it may shut down due to preemption of VM instances. A node may also shut down due to a hardware failure or a software problem.
To trigger a node shutdown, you could run a shutdown
or poweroff
command in a shell, or physically press a button to power off a machine.
A node shutdown could lead to workload failure if the node is not drained before the shutdown.
In the following, we will describe what is a graceful node shutdown and what is a non-graceful node shutdown.
What is a graceful node shutdown?
The kubelet's handling for a graceful node shutdown allows the kubelet to detect a node shutdown event, properly terminate the pods on that node, and release resources before the actual shutdown. Critical pods are terminated after all the regular pods are terminated, to ensure that the essential functions of an application can continue to work as long as possible.
What is a non-graceful node shutdown?
A Node shutdown can be graceful only if the kubelet's node shutdown manager can detect the upcoming node shutdown action. However, there are cases where a kubelet does not detect a node shutdown action. This could happen because the shutdown
command does not trigger the Inhibitor Locks mechanism used by the kubelet on Linux, or because of a user error. For example, if the shutdownGracePeriod
and shutdownGracePeriodCriticalPods
details are not configured correctly for that node.
When a node is shut down (or crashes), and that shutdown was not detected by the kubelet node shutdown manager, it becomes a non-graceful node shutdown. Non-graceful node shutdown is a problem for stateful apps. If a node containing a pod that is part of a StatefulSet is shut down in a non-graceful way, the Pod will be stuck in Terminating
status indefinitely, and the control plane cannot create a replacement Pod for that StatefulSet on a healthy node. You can delete the failed Pods manually, but this is not ideal for a self-healing cluster. Similarly, pods that ReplicaSets created as part of a Deployment will be stuck in Terminating
status, and that were bound to the now-shutdown node, stay as Terminating
indefinitely. If you have set a horizontal scaling limit, even those terminating Pods count against the limit, so your workload may struggle to self-heal if it was already at maximum scale. (By the way: if the node that had done a non-graceful shutdown comes back up, the kubelet does delete the old Pod, and the control plane can make a replacement.)
What's new for the beta?
For Kubernetes v1.26, the non-graceful node shutdown feature is beta and enabled by default. The NodeOutOfServiceVolumeDetach
feature gate is enabled by default on kube-controller-manager
instead of being opt-in; you can still disable it if needed (please also file an issue to explain the problem).
On the instrumentation side, the kube-controller-manager reports two new metrics.
force_delete_pods_total
- number of pods that are being forcibly deleted (resets on Pod garbage collection controller restart)
force_delete_pod_errors_total
- number of errors encountered when attempting forcible Pod deletion (also resets on Pod garbage collection controller restart)
How does it work?
In the case of a node shutdown, if a graceful shutdown is not working or the node is in a non-recoverable state due to hardware failure or broken OS, you can manually add an out-of-service
taint on the Node. For example, this can be node.kubernetes.io/out-of-service=nodeshutdown:NoExecute
or node.kubernetes.io/out-of-service=nodeshutdown:NoSchedule
. This taint trigger pods on the node to be forcefully deleted if there are no matching tolerations on the pods. Persistent volumes attached to the shutdown node will be detached, and new pods will be created successfully on a different running node.
kubectl taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute
Note: Before applying the out-of-service taint, you must verify that a node is already in shutdown or power-off state (not in the middle of restarting), either because the user intentionally shut it down or the node is down due to hardware failures, OS issues, etc.
Once all the workload pods that are linked to the out-of-service node are moved to a new running node, and the shutdown node has been recovered, you should remove that taint on the affected node after the node is recovered.
What's next?
Depending on feedback and adoption, the Kubernetes team plans to push the Non-Graceful Node Shutdown implementation to GA in either 1.27 or 1.28.
This feature requires a user to manually add a taint to the node to trigger the failover of workloads and remove the taint after the node is recovered.
The cluster operator can automate this process by automatically applying the out-of-service
taint if there is a programmatic way to determine that the node is really shut down and there isn't IO between the node and storage. The cluster operator can then automatically remove the taint after the workload fails over successfully to another running node and that the shutdown node has been recovered.
In the future, we plan to find ways to automatically detect and fence nodes that are shut down or in a non-recoverable state and fail their workloads over to another node.
How can I learn more?
To learn more, read Non Graceful node shutdown in the Kubernetes documentation.
How to get involved?
We offer a huge thank you to all the contributors who helped with design, implementation, and review of this feature:
- Michelle Au (msau42)
- Derek Carr (derekwaynecarr)
- Danielle Endocrimes (endocrimes)
- Tim Hockin (thockin)
- Ashutosh Kumar (sonasingh46)
- Hemant Kumar (gnufied)
- Yuiko Mouri(YuikoTakada)
- Mrunal Patel (mrunalp)
- David Porter (bobbypage)
- Yassine Tijani (yastij)
- Jing Xu (jingxu97)
- Xing Yang (xing-yang)
There are many people who have helped review the design and implementation along the way. We want to thank everyone who has contributed to this effort including the about 30 people who have reviewed the KEP and implementation over the last couple of years.
This feature is a collaboration between SIG Storage and SIG Node. For those interested in getting involved with the design and development of any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). For those interested in getting involved with the design and development of the components that support the controlled interactions between pods and host resources, join the Kubernetes Node SIG.
16 Dec 2022 6:00pm GMT
15 Dec 2022
Kubernetes – Production-Grade Container Orchestration
Blog: Kubernetes 1.26: Alpha API For Dynamic Resource Allocation
Authors: Patrick Ohly (Intel), Kevin Klues (NVIDIA)
Dynamic resource allocation is a new API for requesting resources. It is a generalization of the persistent volumes API for generic resources, making it possible to:
- access the same resource instance in different pods and containers,
- attach arbitrary constraints to a resource request to get the exact resource you are looking for,
- initialize a resource according to parameters provided by the user.
Third-party resource drivers are responsible for interpreting these parameters as well as tracking and allocating resources as requests come in.
Dynamic resource allocation is an alpha feature and only enabled when the DynamicResourceAllocation
feature gate and the resource.k8s.io/v1alpha1
API group are enabled. For details, see the --feature-gates
and --runtime-config
kube-apiserver parameters. The kube-scheduler, kube-controller-manager and kubelet components all need the feature gate enabled as well.
The default configuration of kube-scheduler enables the DynamicResources
plugin if and only if the feature gate is enabled. Custom configurations may have to be modified to include it.
Once dynamic resource allocation is enabled, resource drivers can be installed to manage certain kinds of hardware. Kubernetes has a test driver that is used for end-to-end testing, but also can be run manually. See below for step-by-step instructions.
API
The new resource.k8s.io/v1alpha1
API group provides four new types:
- ResourceClass
- Defines which resource driver handles a certain kind of resource and provides common parameters for it. ResourceClasses are created by a cluster administrator when installing a resource driver.
- ResourceClaim
- Defines a particular resource instances that is required by a workload. Created by a user (lifecycle managed manually, can be shared between different Pods) or for individual Pods by the control plane based on a ResourceClaimTemplate (automatic lifecycle, typically used by just one Pod).
- ResourceClaimTemplate
- Defines the spec and some meta data for creating ResourceClaims. Created by a user when deploying a workload.
- PodScheduling
- Used internally by the control plane and resource drivers to coordinate pod scheduling when ResourceClaims need to be allocated for a Pod.
Parameters for ResourceClass and ResourceClaim are stored in separate objects, typically using the type defined by a CRD that was created when installing a resource driver.
With this alpha feature enabled, the spec
of Pod defines ResourceClaims that are needed for a Pod to run: this information goes into a new resourceClaims
field. Entries in that list reference either a ResourceClaim or a ResourceClaimTemplate. When referencing a ResourceClaim, all Pods using this .spec
(for example, inside a Deployment or StatefulSet) share the same ResourceClaim instance. When referencing a ResourceClaimTemplate, each Pod gets its own ResourceClaim instance.
For a container defined within a Pod, the resources.claims
list defines whether that container gets access to these resource instances, which makes it possible to share resources between one or more containers inside the same Pod. For example, an init container could set up the resource before the application uses it.
Here is an example of a fictional resource driver. Two ResourceClaim objects will get created for this Pod and each container gets access to one of them.
Assuming a resource driver called resource-driver.example.com
was installed together with the following resource class:
apiVersion: resource.k8s.io/v1alpha1
kind: ResourceClass
name: resource.example.com
driverName: resource-driver.example.com
An end-user could then allocate two specific resources of type resource.example.com
as follows:
---
apiVersion: cats.resource.example.com/v1
kind: ClaimParameters
name: large-black-cats
spec:
color: black
size: large
---
apiVersion: resource.k8s.io/v1alpha1
kind: ResourceClaimTemplate
metadata:
name: large-black-cats
spec:
spec:
resourceClassName: resource.example.com
parametersRef:
apiGroup: cats.resource.example.com
kind: ClaimParameters
name: large-black-cats
---
apiVersion: v1
kind: Pod
metadata:
name: pod-with-cats
spec:
containers: # two example containers; each container claims one cat resource
- name: first-example
image: ubuntu:22.04
command: ["sleep", "9999"]
resources:
claims:
- name: cat-0
- name: second-example
image: ubuntu:22.04
command: ["sleep", "9999"]
resources:
claims:
- name: cat-1
resourceClaims:
- name: cat-0
source:
resourceClaimTemplateName: large-black-cats
- name: cat-1
source:
resourceClaimTemplateName: large-black-cats
Scheduling
In contrast to native resources (such as CPU or RAM) and extended resources (managed by a device plugin, advertised by kubelet), the scheduler has no knowledge of what dynamic resources are available in a cluster or how they could be split up to satisfy the requirements of a specific ResourceClaim. Resource drivers are responsible for that. Drivers mark ResourceClaims as allocated once resources for it are reserved. This also then tells the scheduler where in the cluster a claimed resource is actually available.
ResourceClaims can get resources allocated as soon as the ResourceClaim is created (immediate allocation), without considering which Pods will use the resource. The default (wait for first consumer) is to delay allocation until a Pod that relies on the ResourceClaim becomes eligible for scheduling. This design with two allocation options is similar to how Kubernetes handles storage provisioning with PersistentVolumes and PersistentVolumeClaims.
In the wait for first consumer mode, the scheduler checks all ResourceClaims needed by a Pod. If the Pods has any ResourceClaims, the scheduler creates a PodScheduling (a special object that requests scheduling details on behalf of the Pod). The PodScheduling has the same name and namespace as the Pod and the Pod as its as owner. Using its PodScheduling, the scheduler informs the resource drivers responsible for those ResourceClaims about nodes that the scheduler considers suitable for the Pod. The resource drivers respond by excluding nodes that don't have enough of the driver's resources left.
Once the scheduler has that resource information, it selects one node and stores that choice in the PodScheduling object. The resource drivers then allocate resources based on the relevant ResourceClaims so that the resources will be available on that selected node. Once that resource allocation is complete, the scheduler attempts to schedule the Pod to a suitable node. Scheduling can still fail at this point; for example, a different Pod could be scheduled to the same node in the meantime. If this happens, already allocated ResourceClaims may get deallocated to enable scheduling onto a different node.
As part of this process, ResourceClaims also get reserved for the Pod. Currently ResourceClaims can either be used exclusively by a single Pod or an unlimited number of Pods.
One key feature is that Pods do not get scheduled to a node unless all of their resources are allocated and reserved. This avoids the scenario where a Pod gets scheduled onto one node and then cannot run there, which is bad because such a pending Pod also blocks all other resources like RAM or CPU that were set aside for it.
Limitations
The scheduler plugin must be involved in scheduling Pods which use ResourceClaims. Bypassing the scheduler by setting the nodeName
field leads to Pods that the kubelet refuses to start because the ResourceClaims are not reserved or not even allocated. It may be possible to remove this limitation in the future.
Writing a resource driver
A dynamic resource allocation driver typically consists of two separate-but-coordinating components: a centralized controller, and a DaemonSet of node-local kubelet plugins. Most of the work required by the centralized controller to coordinate with the scheduler can be handled by boilerplate code. Only the business logic required to actually allocate ResourceClaims against the ResourceClasses owned by the plugin needs to be customized. As such, Kubernetes provides the following package, including APIs for invoking this boilerplate code as well as a Driver
interface that you can implement to provide their custom business logic:
Likewise, boilerplate code can be used to register the node-local plugin with the kubelet, as well as start a gRPC server to implement the kubelet plugin API. For drivers written in Go, the following package is recommended:
It is up to the driver developer to decide how these two components communicate. The KEP outlines an approach using CRDs.
Within SIG Node, we also plan to provide a complete example driver that can serve as a template for other drivers.
Running the test driver
The following steps bring up a local, one-node cluster directly from the Kubernetes source code. As a prerequisite, your cluster must have nodes with a container runtime that supports the Container Device Interface (CDI). For example, you can run CRI-O v1.23.2 or later. Once containerd v1.7.0 is released, we expect that you can run that or any later version. In the example below, we use CRI-O.
First, clone the Kubernetes source code. Inside that directory, run:
$ hack/install-etcd.sh
...
$ RUNTIME_CONFIG=resource.k8s.io/v1alpha1 \
FEATURE_GATES=DynamicResourceAllocation=true \
DNS_ADDON="coredns" \
CGROUP_DRIVER=systemd \
CONTAINER_RUNTIME_ENDPOINT=unix:///var/run/crio/crio.sock \
LOG_LEVEL=6 \
ENABLE_CSI_SNAPSHOTTER=false \
API_SECURE_PORT=6444 \
ALLOW_PRIVILEGED=1 \
PATH=$(pwd)/third_party/etcd:$PATH \
./hack/local-up-cluster.sh -O
...
To start using your cluster, you can open up another terminal/tab and run:
export KUBECONFIG=/var/run/kubernetes/admin.kubeconfig
...
Once the cluster is up, in another terminal run the test driver controller. KUBECONFIG
must be set for all of the following commands.
$ go run ./test/e2e/dra/test-driver --feature-gates ContextualLogging=true -v=5 controller
In another terminal, run the kubelet plugin:
$ sudo mkdir -p /var/run/cdi && \
sudo chmod a+rwx /var/run/cdi /var/lib/kubelet/plugins_registry /var/lib/kubelet/plugins/
$ go run ./test/e2e/dra/test-driver --feature-gates ContextualLogging=true -v=6 kubelet-plugin
Changing the permissions of the directories makes it possible to run and (when using delve) debug the kubelet plugin as a normal user, which is convenient because it uses the already populated Go cache. Remember to restore permissions with sudo chmod go-w
when done. Alternatively, you can also build the binary and run that as root.
Now the cluster is ready to create objects:
$ kubectl create -f test/e2e/dra/test-driver/deploy/example/resourceclass.yaml
resourceclass.resource.k8s.io/example created
$ kubectl create -f test/e2e/dra/test-driver/deploy/example/pod-inline.yaml
configmap/test-inline-claim-parameters created
resourceclaimtemplate.resource.k8s.io/test-inline-claim-template created
pod/test-inline-claim created
$ kubectl get resourceclaims
NAME RESOURCECLASSNAME ALLOCATIONMODE STATE AGE
test-inline-claim-resource example WaitForFirstConsumer allocated,reserved 8s
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
test-inline-claim 0/2 Completed 0 21s
The test driver doesn't do much, it only sets environment variables as defined in the ConfigMap. The test pod dumps the environment, so the log can be checked to verify that everything worked:
$ kubectl logs test-inline-claim with-resource | grep user_a
user_a='b'
Next steps
- See the Dynamic Resource Allocation KEP for more information on the design.
- Read Dynamic Resource Allocation in the official Kubernetes documentation.
- You can participate in SIG Node and / or the CNCF Container Orchestrated Device Working Group.
- You can view or comment on the project board for dynamic resource allocation.
- In order to move this feature towards beta, we need feedback from hardware vendors, so here's a call to action: try out this feature, consider how it can help with problems that your users are having, and write resource drivers…
15 Dec 2022 12:00am GMT
13 Dec 2022
Kubernetes – Production-Grade Container Orchestration
Blog: Kubernetes 1.26: Windows HostProcess Containers Are Generally Available
Authors: Brandon Smith (Microsoft) and Mark Rossetti (Microsoft)
The long-awaited day has arrived: HostProcess containers, the Windows equivalent to Linux privileged containers, has finally made it to GA in Kubernetes 1.26!
What are HostProcess containers and why are they useful?
Cluster operators are often faced with the need to configure their nodes upon provisioning such as installing Windows services, configuring registry keys, managing TLS certificates, making network configuration changes, or even deploying monitoring tools such as a Prometheus's node-exporter. Previously, performing these actions on Windows nodes was usually done by running PowerShell scripts over SSH or WinRM sessions and/or working with your cloud provider's virtual machine management tooling. HostProcess containers now enable you to do all of this and more with minimal effort using Kubernetes native APIs.
With HostProcess containers you can now package any payload into the container image, map volumes into containers at runtime, and manage them like any other Kubernetes workload. You get all the benefits of containerized packaging and deployment methods combined with a reduction in both administrative and development cost. Gone are the days where cluster operators would need to manually log onto Windows nodes to perform administrative duties.
HostProcess containers differ quite significantly from regular Windows Server containers. They are run directly as processes on the host with the access policies of a user you specify. HostProcess containers run as either the built-in Windows system accounts or ephemeral users within a user group defined by you. HostProcess containers also share the host's network namespace and access/configure storage mounts visible to the host. On the other hand, Windows Server containers are highly isolated and exist in a separate execution namespace. Direct access to the host from a Windows Server container is explicitly disallowed by default.
How does it work?
Windows HostProcess containers are implemented with Windows Job Objects, a break from the previous container model which use server silos. Job Objects are components of the Windows OS which offer the ability to manage a group of processes as a group (also known as a job) and assign resource constraints to the group as a whole. Job objects are specific to the Windows OS and are not associated with the Kubernetes Job API. They have no process or file system isolation, enabling the privileged payload to view and edit the host file system with the desired permissions, among other host resources. The init process, and any processes it launches (including processes explicitly launched by the user) are all assigned to the job object of that container. When the init process exits or is signaled to exit, all the processes in the job will be signaled to exit, the job handle will be closed and the storage will be unmounted.
HostProcess and Linux privileged containers enable similar scenarios but differ greatly in their implementation (hence the naming difference). HostProcess containers have their own PodSecurityContext fields. Those used to configure Linux privileged containers do not apply. Enabling privileged access to a Windows host is a fundamentally different process than with Linux so the configuration and capabilities of each differ significantly. Below is a diagram detailing the overall architecture of Windows HostProcess containers:
Two major features were added prior to moving to stable: the ability to run as local user accounts, and a simplified method of accessing volume mounts. To learn more, read Create a Windows HostProcess Pod.
HostProcess containers in action
Kubernetes SIG Windows has been busy putting HostProcess containers to use - even before GA! They've been very excited to use HostProcess containers for a number of important activities that were a pain to perform in the past.
Here are just a few of the many use use cases with example deployments:
How do I use it?
A HostProcess container can be built using any base image of your choosing, however, for convenience we have created a HostProcess container base image. This image is only a few KB in size and does not inherit any of the same compatibility requirements as regular Windows server containers which allows it to run on any Windows server version.
To use that Microsoft image, put this in your Dockerfile
:
FROM mcr.microsoft.com/oss/kubernetes/windows-host-process-containers-base-image:v1.0.0
You can run HostProcess containers from within a HostProcess Pod.
To get started with running Windows containers, see the general guidance for deploying Windows nodes. If you have a compatible node (for example: Windows as the operating system with containerd v1.7 or later as the container runtime), you can deploy a Pod with one or more HostProcess containers. See the Create a Windows HostProcess Pod - Prerequisites for more information.
Please note that within a Pod, you can't mix HostProcess containers with normal Windows containers.
How can I learn more?
-
Work through Create a Windows HostProcess Pod
-
Read about Kubernetes Pod Security Standards and Pod Security Admission
-
Read the enhancement proposal Windows Privileged Containers and Host Networking Mode (KEP-1981)
-
Watch the Windows HostProcess for Configuration and Beyond KubeCon NA 2022 talk
How do I get involved?
Get involved with SIG Windows to contribute!
13 Dec 2022 12:00am GMT