20 Jan 2023

feedKubernetes – Production-Grade Container Orchestration

Blog: Consider All Microservices Vulnerable — And Monitor Their Behavior

Author: David Hadas (IBM Research Labs)

This post warns Devops from a false sense of security. Following security best practices when developing and configuring microservices do not result in non-vulnerable microservices. The post shows that although all deployed microservices are vulnerable, there is much that can be done to ensure microservices are not exploited. It explains how analyzing the behavior of clients and services from a security standpoint, named here "Security-Behavior Analysis", can protect the deployed vulnerable microservices. It points to Guard, an open source project offering security-behavior monitoring and control of Kubernetes microservices presumed vulnerable.

As cyber attacks continue to intensify in sophistication, organizations deploying cloud services continue to grow their cyber investments aiming to produce safe and non-vulnerable services. However, the year-by-year growth in cyber investments does not result in a parallel reduction in cyber incidents. Instead, the number of cyber incidents continues to grow annually. Evidently, organizations are doomed to fail in this struggle - no matter how much effort is made to detect and remove cyber weaknesses from deployed services, it seems offenders always have the upper hand.

Considering the current spread of offensive tools, sophistication of offensive players, and ever-growing cyber financial gains to offenders, any cyber strategy that relies on constructing a non-vulnerable, weakness-free service in 2023 is clearly too naïve. It seems the only viable strategy is to:

Admit that your services are vulnerable!

In other words, consciously accept that you will never create completely invulnerable services. If your opponents find even a single weakness as an entry-point, you lose! Admitting that in spite of your best efforts, all your services are still vulnerable is an important first step. Next, this post discusses what you can do about it...

How to protect microservices from being exploited

Being vulnerable does not necessarily mean that your service will be exploited. Though your services are vulnerable in some ways unknown to you, offenders still need to identify these vulnerabilities and then exploit them. If offenders fail to exploit your service vulnerabilities, you win! In other words, having a vulnerability that can't be exploited, represents a risk that can't be realized.

Image of an example of offender gaining foothold in a service

Figure 1. An Offender gaining foothold in a vulnerable service

The above diagram shows an example in which the offender does not yet have a foothold in the service; that is, it is assumed that your service does not run code controlled by the offender on day 1. In our example the service has vulnerabilities in the API exposed to clients. To gain an initial foothold the offender uses a malicious client to try and exploit one of the service API vulnerabilities. The malicious client sends an exploit that triggers some unplanned behavior of the service.

More specifically, let's assume the service is vulnerable to an SQL injection. The developer failed to sanitize the user input properly, thereby allowing clients to send values that would change the intended behavior. In our example, if a client sends a query string with key "username" and value of "tom or 1=1", the client will receive the data of all users. Exploiting this vulnerability requires the client to send an irregular string as the value. Note that benign users will not be sending a string with spaces or with the equal sign character as a username, instead they will normally send legal usernames which for example may be defined as a short sequence of characters a-z. No legal username can trigger service unplanned behavior.

In this simple example, one can already identify several opportunities to detect and block an attempt to exploit the vulnerability (un)intentionally left behind by the developer, making the vulnerability unexploitable. First, the malicious client behavior differs from the behavior of benign clients, as it sends irregular requests. If such a change in behavior is detected and blocked, the exploit will never reach the service. Second, the service behavior in response to the exploit differs from the service behavior in response to a regular request. Such behavior may include making subsequent irregular calls to other services such as a data store, taking irregular time to respond, and/or responding to the malicious client with an irregular response (for example, containing much more data than normally sent in case of benign clients making regular requests). Service behavioral changes, if detected, will also allow blocking the exploit in different stages of the exploitation attempt.

More generally:

Combining both approaches may add a protection layer to the deployed vulnerable services, drastically decreasing the probability for anyone to successfully exploit any of the deployed vulnerable services. Next, let us identify four use cases where you need to use security-behavior monitoring.

Use cases

One can identify the following four different stages in the life of any service from a security standpoint. In each stage, security-behavior monitoring is required to meet different challenges:

Service State Use case What do you need in order to cope with this use case?
Normal No known vulnerabilities: The service owner is normally not aware of any known vulnerabilities in the service image or configuration. Yet, it is reasonable to assume that the service has weaknesses. Provide generic protection against any unknown, zero-day, service vulnerabilities - Detect/block irregular patterns sent as part of incoming client requests that may be used as exploits.
Vulnerable An applicable CVE is published: The service owner is required to release a new non-vulnerable revision of the service. Research shows that in practice this process of removing a known vulnerability may take many weeks to accomplish (2 months on average). Add protection based on the CVE analysis - Detect/block incoming requests that include specific patterns that may be used to exploit the discovered vulnerability. Continue to offer services, although the service has a known vulnerability.
Exploitable A known exploit is published: The service owner needs a way to filter incoming requests that contain the known exploit. Add protection based on a known exploit signature - Detect/block incoming client requests that carry signatures identifying the exploit. Continue to offer services, although the presence of an exploit.
Misused An offender misuses pods backing the service: The offender can follow an attack pattern enabling him/her to misuse pods. The service owner needs to restart any compromised pods while using non compromised pods to continue offering the service. Note that once a pod is restarted, the offender needs to repeat the attack pattern before he/she may again misuse it. Identify and restart instances of the component that is being misused - At any given time, some backing pods may be compromised and misused, while others behave as designed. Detect/remove the misused pods while allowing other pods to continue servicing client requests.

Fortunately, microservice architecture is well suited to security-behavior monitoring as discussed next.

Security-Behavior of microservices versus monoliths

Kubernetes is often used to support workloads designed with microservice architecture. By design, microservices aim to follow the UNIX philosophy of "Do One Thing And Do It Well". Each microservice has a bounded context and a clear interface. In other words, you can expect the microservice clients to send relatively regular requests and the microservice to present a relatively regular behavior as a response to these requests. Consequently, a microservice architecture is an excellent candidate for security-behavior monitoring.

Image showing why microservices are well suited for security-behavior monitoring

Figure 2. Microservices are well suited for security-behavior monitoring

The diagram above clarifies how dividing a monolithic service to a set of microservices improves our ability to perform security-behavior monitoring and control. In a monolithic service approach, different client requests are intertwined, resulting in a diminished ability to identify irregular client behaviors. Without prior knowledge, an observer of the intertwined client requests will find it hard to distinguish between types of requests and their related characteristics. Further, internal client requests are not exposed to the observer. Lastly, the aggregated behavior of the monolithic service is a compound of the many different internal behaviors of its components, making it hard to identify irregular service behavior.

In a microservice environment, each microservice is expected by design to offer a more well-defined service and serve better defined type of requests. This makes it easier for an observer to identify irregular client behavior and irregular service behavior. Further, a microservice design exposes the internal requests and internal services which offer more security-behavior data to identify irregularities by an observer. Overall, this makes the microservice design pattern better suited for security-behavior monitoring and control.

Security-Behavior monitoring on Kubernetes

Kubernetes deployments seeking to add Security-Behavior may use Guard, developed under the CNCF project Knative. Guard is integrated into the full Knative automation suite that runs on top of Kubernetes. Alternatively, you can deploy Guard as a standalone tool to protect any HTTP-based workload on Kubernetes.

See:

The goal of this post is to invite the Kubernetes community to action and introduce Security-Behavior monitoring and control to help secure Kubernetes based deployments. Hopefully, the community as a follow up will:

  1. Analyze the cyber challenges presented for different Kubernetes use cases
  2. Add appropriate security documentation for users on how to introduce Security-Behavior monitoring and control.
  3. Consider how to integrate with tools that can help users monitor and control their vulnerable services.

Getting involved

You are welcome to get involved and join the effort to develop security behavior monitoring and control for Kubernetes; to share feedback and contribute to code or documentation; and to make or suggest improvements of any kind.

20 Jan 2023 12:00am GMT

12 Jan 2023

feedKubernetes – Production-Grade Container Orchestration

Blog: Protect Your Mission-Critical Pods From Eviction With PriorityClass

Author: Sunny Bhambhani (InfraCloud Technologies)

Kubernetes has been widely adopted, and many organizations use it as their de-facto orchestration engine for running workloads that need to be created and deleted frequently.

Therefore, proper scheduling of the pods is key to ensuring that application pods are up and running within the Kubernetes cluster without any issues. This article delves into the use cases around resource management by leveraging the PriorityClass object to protect mission-critical or high-priority pods from getting evicted and making sure that the application pods are up, running, and serving traffic.

Resource management in Kubernetes

The control plane consists of multiple components, out of which the scheduler (usually the built-in kube-scheduler) is one of the components which is responsible for assigning a node to a pod.

Whenever a pod is created, it enters a "pending" state, after which the scheduler determines which node is best suited for the placement of the new pod.

In the background, the scheduler runs as an infinite loop looking for pods without a nodeName set that are ready for scheduling. For each Pod that needs scheduling, the scheduler tries to decide which node should run that Pod.

If the scheduler cannot find any node, the pod remains in the pending state, which is not ideal.

The below diagram, from point number 1 through 4, explains the request flow:

A diagram showing the scheduling of three Pods that a client has directly created.

Scheduling in Kubernetes

Typical use cases

Below are some real-life scenarios where control over the scheduling and eviction of pods may be required.

  1. Let's say the pod you plan to deploy is critical, and you have some resource constraints. An example would be the DaemonSet of an infrastructure component like Grafana Loki. The Loki pods must run before other pods can on every node. In such cases, you could ensure resource availability by manually identifying and deleting the pods that are not required or by adding a new node to the cluster. Both these approaches are unsuitable since the former would be tedious to execute, and the latter could involve an expenditure of time and money.

  2. Another use case could be a single cluster that holds the pods for the below environments with associated priorities:

    • Production (prod): top priority
    • Preproduction (preprod): intermediate priority
    • Development (dev): least priority

In the event of high resource consumption in the cluster, there is competition for CPU and memory resources on the nodes. While cluster-level autoscaling may add more nodes, it takes time. In the interim, if there are no further nodes to scale the cluster, some Pods could remain in a Pending state, or the service could be degraded as they compete for resources. If the kubelet does evict a Pod from the node, that eviction would be random because the kubelet doesn't have any special information about which Pods to evict and which to keep.

  1. A third example could be a microservice backed by a queuing application or a database running into a resource crunch and the queue or database getting evicted. In such a case, all the other services would be rendered useless until the database can serve traffic again.

There can also be other scenarios where you want to control the order of scheduling or order of eviction of pods.

PriorityClasses in Kubernetes

PriorityClass is a cluster-wide API object in Kubernetes and part of the scheduling.k8s.io/v1 API group. It contains a mapping of the PriorityClass name (defined in .metadata.name) and an integer value (defined in .value). This represents the value that the scheduler uses to determine Pod's relative priority.

Additionally, when you create a cluster using kubeadm or a managed Kubernetes service (for example, Azure Kubernetes Service), Kubernetes uses PriorityClasses to safeguard the pods that are hosted on the control plane nodes. This ensures that critical cluster components such as CoreDNS and kube-proxy can run even if resources are constrained.

This availability of pods is achieved through the use of a special PriorityClass that ensures the pods are up and running and that the overall cluster is not affected.

$ kubectl get priorityclass
NAME VALUE GLOBAL-DEFAULT AGE
system-cluster-critical 2000000000 false 82m
system-node-critical 2000001000 false 82m

The diagram below shows exactly how it works with the help of an example, which will be detailed in the upcoming section.

A flow chart that illustrates how the kube-scheduler prioritizes new Pods and potentially preempts existing Pods

Pod scheduling and preemption

Pod priority and preemption

Pod preemption is a Kubernetes feature that allows the cluster to preempt pods (removing an existing Pod in favor of a new Pod) on the basis of priority. Pod priority indicates the importance of a pod relative to other pods while scheduling. If there aren't enough resources to run all the current pods, the scheduler tries to evict lower-priority pods over high-priority ones.

Also, when a healthy cluster experiences a node failure, typically, lower-priority pods get preempted to create room for higher-priority pods on the available node. This happens even if the cluster can bring up a new node automatically since pod creation is usually much faster than bringing up a new node.

PriorityClass requirements

Before you set up PriorityClasses, there are a few things to consider.

  1. Decide which PriorityClasses are needed. For instance, based on environment, type of pods, type of applications, etc.
  2. The default PriorityClass resource for your cluster. The pods without a priorityClassName will be treated as priority 0.
  3. Use a consistent naming convention for all PriorityClasses.
  4. Make sure that the pods for your workloads are running with the right PriorityClass.

PriorityClass hands-on example

Let's say there are 3 application pods: one for prod, one for preprod, and one for development. Below are three sample YAML manifest files for each of those.

---
# development
apiVersion: v1
kind: Pod
metadata:
 name: dev-nginx
 labels:
 env: dev
spec:
 containers:
 - name: dev-nginx
 image: nginx
 resources:
 requests:
 memory: "256Mi"
 cpu: "0.2"
 limits:
 memory: ".5Gi"
 cpu: "0.5"
---
# preproduction
apiVersion: v1
kind: Pod
metadata:
 name: preprod-nginx
 labels:
 env: preprod
spec:
 containers:
 - name: preprod-nginx
 image: nginx
 resources:
 requests:
 memory: "1.5Gi"
 cpu: "1.5"
 limits:
 memory: "2Gi"
 cpu: "2"
---
# production
apiVersion: v1
kind: Pod
metadata:
 name: prod-nginx
 labels:
 env: prod
spec:
 containers:
 - name: prod-nginx
 image: nginx
 resources:
 requests:
 memory: "2Gi"
 cpu: "2"
 limits:
 memory: "2Gi"
 cpu: "2"

You can create these pods with the kubectl create -f <FILE.yaml> command, and then check their status using the kubectl get pods command. You can see if they are up and look ready to serve traffic:

$ kubectl get pods --show-labels
NAME READY STATUS RESTARTS AGE LABELS
dev-nginx 1/1 Running 0 55s env=dev
preprod-nginx 1/1 Running 0 55s env=preprod
prod-nginx 0/1 Pending 0 55s env=prod

Bad news. The pod for the Production environment is still Pending and isn't serving any traffic.

Let's see why this is happening:

$ kubectl get events
...
...
5s Warning FailedScheduling pod/prod-nginx 0/2 nodes are available: 1 Insufficient cpu, 2 Insufficient memory.

In this example, there is only one worker node, and that node has a resource crunch.

Now, let's look at how PriorityClass can help in this situation since prod should be given higher priority than the other environments.

PriorityClass API

Before creating PriorityClasses based on these requirements, let's see what a basic manifest for a PriorityClass looks like and outline some prerequisites:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
 name: PRIORITYCLASS_NAME
value: 0 # any integer value between -1000000000 to 1000000000 
description: >-
  (Optional) description goes here!
globalDefault: false # or true. Only one PriorityClass can be the global default.

Below are some prerequisites for PriorityClasses:

PriorityClass in action

Here's an example. Next, create some environment-specific PriorityClasses:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
 name: dev-pc
value: 1000000
globalDefault: false
description: >-
  (Optional) This priority class should only be used for all development pods.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
 name: preprod-pc
value: 2000000
globalDefault: false
description: >-
  (Optional) This priority class should only be used for all preprod pods.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
 name: prod-pc
value: 4000000
globalDefault: false
description: >-
  (Optional) This priority class should only be used for all prod pods.

Use kubectl create -f <FILE.YAML> command to create a pc and kubectl get pc to check its status.

$ kubectl get pc
NAME VALUE GLOBAL-DEFAULT AGE
dev-pc 1000000 false 3m13s
preprod-pc 2000000 false 2m3s
prod-pc 4000000 false 7s
system-cluster-critical 2000000000 false 82m
system-node-critical 2000001000 false 82m

The new PriorityClasses are in place now. A small change is needed in the pod manifest or pod template (in a ReplicaSet or Deployment). In other words, you need to specify the priority class name at .spec.priorityClassName (which is a string value).

First update the previous production pod manifest file to have a PriorityClass assigned, then delete the Production pod and recreate it. You can't edit the priority class for a Pod that already exists.

In my cluster, when I tried this, here's what happened. First, that change seems successful; the status of pods has been updated:

$ kubectl get pods --show-labels
NAME READY STATUS RESTARTS AGE LABELS
dev-nginx 1/1 Terminating 0 55s env=dev
preprod-nginx 1/1 Running 0 55s env=preprod
prod-nginx 0/1 Pending 0 55s env=prod

The dev-nginx pod is getting terminated. Once that is successfully terminated and there are enough resources for the prod pod, the control plane can schedule the prod pod:

Warning FailedScheduling pod/prod-nginx 0/2 nodes are available: 1 Insufficient cpu, 2 Insufficient memory.
Normal Preempted pod/dev-nginx by default/prod-nginx on node node01
Normal Killing pod/dev-nginx Stopping container dev-nginx
Normal Scheduled pod/prod-nginx Successfully assigned default/prod-nginx to node01
Normal Pulling pod/prod-nginx Pulling image "nginx"
Normal Pulled pod/prod-nginx Successfully pulled image "nginx"
Normal Created pod/prod-nginx Created container prod-nginx
Normal Started pod/prod-nginx Started container prod-nginx

Enforcement

When you set up PriorityClasses, they exist just how you defined them. However, people (and tools) that make changes to your cluster are free to set any PriorityClass, or to not set any PriorityClass at all. However, you can use other Kubernetes features to make sure that the priorities you wanted are actually applied.

As an alpha feature, you can define a ValidatingAdmissionPolicy and a ValidatingAdmissionPolicyBinding so that, for example, Pods that go into the prod namespace must use the prod-pc PriorityClass. With another ValidatingAdmissionPolicyBinding you ensure that the preprod namespace uses the preprod-pc PriorityClass, and so on. In any cluster, you can enforce similar controls using external projects such as Kyverno or Gatekeeper, through validating admission webhooks.

However you do it, Kubernetes gives you options to make sure that the PriorityClasses are used how you wanted them to be, or perhaps just to warn users when they pick an unsuitable option.

Summary

The above example and its events show you what this feature of Kubernetes brings to the table, along with several scenarios where you can use this feature. To reiterate, this helps ensure that mission-critical pods are up and available to serve the traffic and, in the case of a resource crunch, determines cluster behavior.

It gives you some power to decide the order of scheduling and order of preemption for Pods. Therefore, you need to define the PriorityClasses sensibly. For example, if you have a cluster autoscaler to add nodes on demand, make sure to run it with the system-cluster-critical PriorityClass. You don't want to get in a situation where the autoscaler has been preempted and there are no new nodes coming online.

If you have any queries or feedback, feel free to reach out to me on LinkedIn.

12 Jan 2023 12:00am GMT

06 Jan 2023

feedKubernetes – Production-Grade Container Orchestration

Blog: Kubernetes 1.26: Eviction policy for unhealthy pods guarded by PodDisruptionBudgets

Authors: Filip Křepinský (Red Hat), Morten Torkildsen (Google), Ravi Gudimetla (Apple)

Ensuring the disruptions to your applications do not affect its availability isn't a simple task. Last month's release of Kubernetes v1.26 lets you specify an unhealthy pod eviction policy for PodDisruptionBudgets (PDBs) to help you maintain that availability during node management operations. In this article, we will dive deeper into what modifications were introduced for PDBs to give application owners greater flexibility in managing disruptions.

What problems does this solve?

API-initiated eviction of pods respects PodDisruptionBudgets (PDBs). This means that a requested voluntary disruption via an eviction to a Pod, should not disrupt a guarded application and .status.currentHealthy of a PDB should not fall below .status.desiredHealthy. Running pods that are Unhealthy do not count towards the PDB status, but eviction of these is only possible in case the application is not disrupted. This helps disrupted or not yet started application to achieve availability as soon as possible without additional downtime that would be caused by evictions.

Unfortunately, this poses a problem for cluster administrators that would like to drain nodes without any manual interventions. Misbehaving applications with pods in CrashLoopBackOff state (due to a bug or misconfiguration) or pods that are simply failing to become ready make this task much harder. Any eviction request will fail due to violation of a PDB, when all pods of an application are unhealthy. Draining of a node cannot make any progress in that case.

On the other hand there are users that depend on the existing behavior, in order to:

Kubernetes 1.26 introduced a new experimental field to the PodDisruptionBudget API: .spec.unhealthyPodEvictionPolicy. When enabled, this field lets you support both of those requirements.

How does it work?

API-initiated eviction is the process that triggers graceful pod termination. The process can be initiated either by calling the API directly, by using a kubectl drain command, or other actors in the cluster. During this process every pod removal is consulted with appropriate PDBs, to ensure that a sufficient number of pods is always running in the cluster.

The following policies allow PDB authors to have a greater control how the process deals with unhealthy pods.

There are two policies IfHealthyBudget and AlwaysAllow to choose from.

The former, IfHealthyBudget, follows the existing behavior to achieve the best availability that you get by default. Unhealthy pods can be disrupted only if their application has a minimum available .status.desiredHealthy number of pods.

By setting the spec.unhealthyPodEvictionPolicy field of your PDB to AlwaysAllow, you are choosing the best effort availability for your application. With this policy it is always possible to evict unhealthy pods. This will make it easier to maintain and upgrade your clusters.

We think that AlwaysAllow will often be a better choice, but for some critical workloads you may still prefer to protect even unhealthy Pods from node drains or other forms of API-initiated eviction.

How do I use it?

This is an alpha feature, which means you have to enable the PDBUnhealthyPodEvictionPolicy feature gate, with the command line argument --feature-gates=PDBUnhealthyPodEvictionPolicy=true to the kube-apiserver.

Here's an example. Assume that you've enabled the feature gate in your cluster, and that you already defined a Deployment that runs a plain webserver. You labelled the Pods for that Deployment with app: nginx. You want to limit avoidable disruption, and you know that best effort availability is sufficient for this app. You decide to allow evictions even if those webserver pods are unhealthy. You create a PDB to guard this application, with the AlwaysAllow policy for evicting unhealthy pods:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
 name: nginx-pdb
spec:
 selector:
 matchLabels:
 app: nginx
 maxUnavailable: 1
 unhealthyPodEvictionPolicy: AlwaysAllow

How can I learn more?

How do I get involved?

If you have any feedback, please reach out to us in the #sig-apps channel on Slack (visit https://slack.k8s.io/ for an invitation if you need one), or on the SIG Apps mailing list: kubernetes-sig-apps@googlegroups.com

06 Jan 2023 12:00am GMT

05 Jan 2023

feedKubernetes – Production-Grade Container Orchestration

Blog: Kubernetes 1.26: Retroactive Default StorageClass

Author: Roman Bednář (Red Hat)

The v1.25 release of Kubernetes introduced an alpha feature to change how a default StorageClass was assigned to a PersistentVolumeClaim (PVC). With the feature enabled, you no longer need to create a default StorageClass first and PVC second to assign the class. Additionally, any PVCs without a StorageClass assigned can be updated later. This feature was graduated to beta in Kubernetes 1.26.

You can read retroactive default StorageClass assignment in the Kubernetes documentation for more details about how to use that, or you can read on to learn about why the Kubernetes project is making this change.

Why did StorageClass assignment need improvements

Users might already be familiar with a similar feature that assigns default StorageClasses to new PVCs at the time of creation. This is currently handled by the admission controller.

But what if there wasn't a default StorageClass defined at the time of PVC creation? Users would end up with a PVC that would never be assigned a class. As a result, no storage would be provisioned, and the PVC would be somewhat "stuck" at this point. Generally, two main scenarios could result in "stuck" PVCs and cause problems later down the road. Let's take a closer look at each of them.

Changing default StorageClass

With the alpha feature enabled, there were two options admins had when they wanted to change the default StorageClass:

  1. Creating a new StorageClass as default before removing the old one associated with the PVC. This would result in having two defaults for a short period. At this point, if a user were to create a PersistentVolumeClaim with storageClassName set to null (implying default StorageClass), the newest default StorageClass would be chosen and assigned to this PVC.

  2. Removing the old default first and creating a new default StorageClass. This would result in having no default for a short time. Subsequently, if a user were to create a PersistentVolumeClaim with storageClassName set to null (implying default StorageClass), the PVC would be in Pending state forever. The user would have to fix this by deleting the PVC and recreating it once the default StorageClass was available.

Resource ordering during cluster installation

If a cluster installation tool needed to create resources that required storage, for example, an image registry, it was difficult to get the ordering right. This is because any Pods that required storage would rely on the presence of a default StorageClass and would fail to be created if it wasn't defined.

What changed

We've changed the PersistentVolume (PV) controller to assign a default StorageClass to any unbound PersistentVolumeClaim that has the storageClassName set to null. We've also modified the PersistentVolumeClaim admission within the API server to allow the change of values from an unset value to an actual StorageClass name.

Null storageClassName versus storageClassName: "" - does it matter?

Before this feature was introduced, those values were equal in terms of behavior. Any PersistentVolumeClaim with the storageClassName set to null or "" would bind to an existing PersistentVolume resource with storageClassName also set to null or "".

With this new feature enabled we wanted to maintain this behavior but also be able to update the StorageClass name. With these constraints in mind, the feature changes the semantics of null. If a default StorageClass is present, null would translate to "Give me a default" and "" would mean "Give me PersistentVolume that also has "" StorageClass name." In the absence of a StorageClass, the behavior would remain unchanged.

Summarizing the above, we've changed the semantics of null so that its behavior depends on the presence or absence of a definition of default StorageClass.

The tables below show all these cases to better describe when PVC binds and when its StorageClass gets updated.

PVC binding behavior with Retroactive default StorageClass
PVC storageClassName = "" PVC storageClassName = null
Without default class PV storageClassName = "" binds binds
PV without storageClassName binds binds
With default class PV storageClassName = "" binds class updates
PV without storageClassName binds class updates

How to use it

If you want to test the feature whilst it's alpha, you need to enable the relevant feature gate in the kube-controller-manager and the kube-apiserver. Use the --feature-gates command line argument:

--feature-gates="...,RetroactiveDefaultStorageClass=true"

Test drive

If you would like to see the feature in action and verify it works fine in your cluster here's what you can try:

  1. Define a basic PersistentVolumeClaim:

    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
     name: pvc-1
    spec:
     accessModes:
     - ReadWriteOnce
     resources:
     requests:
     storage: 1Gi
    
  2. Create the PersistentVolumeClaim when there is no default StorageClass. The PVC won't provision or bind (unless there is an existing, suitable PV already present) and will remain in Pending state.

    $ kc get pvc
    NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
    pvc-1 Pending
    
  3. Configure one StorageClass as default.

    $ kc patch sc -p '{"metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
    storageclass.storage.k8s.io/my-storageclass patched
    
  4. Verify that PersistentVolumeClaims is now provisioned correctly and was updated retroactively with new default StorageClass.

    $ kc get pvc
    NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
    pvc-1 Bound pvc-06a964ca-f997-4780-8627-b5c3bf5a87d8 1Gi RWO my-storageclass 87m
    

New metrics

To help you see that the feature is working as expected we also introduced a new retroactive_storageclass_total metric to show how many times that the PV controller attempted to update PersistentVolumeClaim, and retroactive_storageclass_errors_total to show how many of those attempts failed.

Getting involved

We always welcome new contributors so if you would like to get involved you can join our Kubernetes Storage Special-Interest-Group (SIG).

If you would like to share feedback, you can do so on our public Slack channel.

Special thanks to all the contributors that provided great reviews, shared valuable insight and helped implement this feature (alphabetical order):

05 Jan 2023 12:00am GMT

02 Jan 2023

feedKubernetes – Production-Grade Container Orchestration

Blog: Kubernetes v1.26: Alpha support for cross-namespace storage data sources

Author: Takafumi Takahashi (Hitachi Vantara)

Kubernetes v1.26, released last month, introduced an alpha feature that lets you specify a data source for a PersistentVolumeClaim, even where the source data belong to a different namespace. With the new feature enabled, you specify a namespace in the dataSourceRef field of a new PersistentVolumeClaim. Once Kubernetes checks that access is OK, the new PersistentVolume can populate its data from the storage source specified in that other namespace. Before Kubernetes v1.26, provided your cluster had the AnyVolumeDataSource feature enabled, you could already provision new volumes from a data source in the same namespace. However, that only worked for the data source in the same namespace, therefore users couldn't provision a PersistentVolume with a claim in one namespace from a data source in other namespace. To solve this problem, Kubernetes v1.26 added a new alpha namespace field to dataSourceRef field in PersistentVolumeClaim the API.

How it works

Once the csi-provisioner finds that a data source is specified with a dataSourceRef that has a non-empty namespace name, it checks all reference grants within the namespace that's specified by the.spec.dataSourceRef.namespace field of the PersistentVolumeClaim, in order to see if access to the data source is allowed. If any ReferenceGrant allows access, the csi-provisioner provisions a volume from the data source.

Trying it out

The following things are required to use cross namespace volume provisioning:

Putting it all together

To see how this works, you can install the sample and try it out. This sample do to create PVC in dev namespace from VolumeSnapshot in prod namespace. That is a simple example. For real world use, you might want to use a more complex approach.

Assumptions for this example

Grant ReferenceGrants read permission to the CSI Provisioner

Access to ReferenceGrants is only needed when the CSI driver has the CrossNamespaceVolumeDataSource controller capability. For this example, the external-provisioner needs get, list, and watch permissions for referencegrants (API group gateway.networking.k8s.io).

 - apiGroups: ["gateway.networking.k8s.io"]
 resources: ["referencegrants"]
 verbs: ["get", "list", "watch"]

Enable the CrossNamespaceVolumeDataSource feature gate for the CSI Provisioner

Add --feature-gates=CrossNamespaceVolumeDataSource=true to the csi-provisioner command line. For example, use this manifest snippet to redefine the container:

 - args:
 - -v=5
 - --csi-address=/csi/csi.sock
 - --feature-gates=Topology=true
 - --feature-gates=CrossNamespaceVolumeDataSource=true
 image: csi-provisioner:latest
 imagePullPolicy: IfNotPresent
 name: csi-provisioner

Create a ReferenceGrant

Here's a manifest for an example ReferenceGrant.

apiVersion: gateway.networking.k8s.io/v1beta1
kind: ReferenceGrant
metadata:
 name: allow-prod-pvc
 namespace: prod
spec:
 from:
 - group: ""
 kind: PersistentVolumeClaim
 namespace: dev
 to:
 - group: snapshot.storage.k8s.io
 kind: VolumeSnapshot
 name: new-snapshot-demo

Create a PersistentVolumeClaim by using cross namespace data source

Kubernetes creates a PersistentVolumeClaim on dev and the CSI driver populates the PersistentVolume used on dev from snapshots on prod.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: example-pvc
 namespace: dev
spec:
 storageClassName: example
 accessModes:
 - ReadWriteOnce
 resources:
 requests:
 storage: 1Gi
 dataSourceRef:
 apiGroup: snapshot.storage.k8s.io
 kind: VolumeSnapshot
 name: new-snapshot-demo
 namespace: prod
 volumeMode: Filesystem

How can I learn more?

The enhancement proposal, Provision volumes from cross-namespace snapshots, includes lots of detail about the history and technical implementation of this feature.

Please get involved by joining the Kubernetes Storage Special Interest Group (SIG) to help us enhance this feature. There are a lot of good ideas already and we'd be thrilled to have more!

Acknowledgments

It takes a wonderful group to make wonderful software. Special thanks to the following people for the insightful reviews, thorough consideration and valuable contribution to the CrossNamespaceVolumeDataSouce feature:

It's been a joy to work with y'all on this.

02 Jan 2023 12:00am GMT

30 Dec 2022

feedKubernetes – Production-Grade Container Orchestration

Blog: Kubernetes v1.26: Advancements in Kubernetes Traffic Engineering

Authors: Andrew Sy Kim (Google)

Kubernetes v1.26 includes significant advancements in network traffic engineering with the graduation of two features (Service internal traffic policy support, and EndpointSlice terminating conditions) to GA, and a third feature (Proxy terminating endpoints) to beta. The combination of these enhancements aims to address short-comings in traffic engineering that people face today, and unlock new capabilities for the future.

Traffic Loss from Load Balancers During Rolling Updates

Prior to Kubernetes v1.26, clusters could experience loss of traffic from Service load balancers during rolling updates when setting the externalTrafficPolicy field to Local. There are a lot of moving parts at play here so a quick overview of how Kubernetes manages load balancers might help!

In Kubernetes, you can create a Service with type: LoadBalancer to expose an application externally with a load balancer. The load balancer implementation varies between clusters and platforms, but the Service provides a generic abstraction representing the load balancer that is consistent across all Kubernetes installations.

apiVersion: v1
kind: Service
metadata:
 name: my-service
spec:
 selector:
 app.kubernetes.io/name: my-app
 ports:
 - protocol: TCP
 port: 80
 targetPort: 9376
 type: LoadBalancer

Under the hood, Kubernetes allocates a NodePort for the Service, which is then used by kube-proxy to provide a network data path from the NodePort to the Pod. A controller will then add all available Nodes in the cluster to the load balancer's backend pool, using the designated NodePort for the Service as the backend target port.

Figure 1: Overview of Service load balancers

Figure 1: Overview of Service load balancers

Oftentimes it is beneficial to set externalTrafficPolicy: Local for Services, to avoid extra hops between Nodes that are not running healthy Pods backing that Service. When using externalTrafficPolicy: Local, an additional NodePort is allocated for health checking purposes, such that Nodes that do not contain healthy Pods are excluded from the backend pool for a load balancer.

Figure 2: Load balancer traffic to a healthy Node, when externalTrafficPolicy is Local

Figure 2: Load balancer traffic to a healthy Node, when externalTrafficPolicy is Local

One such scenario where traffic can be lost is when a Node loses all Pods for a Service, but the external load balancer has not probed the health check NodePort yet. The likelihood of this situation is largely dependent on the health checking interval configured on the load balancer. The larger the interval, the more likely this will happen, since the load balancer will continue to send traffic to a node even after kube-proxy has removed forwarding rules for that Service. This also occurrs when Pods start terminating during rolling updates. Since Kubernetes does not consider terminating Pods as "Ready", traffic can be loss when there are only terminating Pods on any given Node during a rolling update.

Figure 3: Load balancer traffic to terminating endpoints, when externalTrafficPolicy is Local

Figure 3: Load balancer traffic to terminating endpoints, when externalTrafficPolicy is Local

Starting in Kubernetes v1.26, kube-proxy enables the ProxyTerminatingEndpoints feature by default, which adds automatic failover and routing to terminating endpoints in scenarios where the traffic would otherwise be dropped. More specifically, when there is a rolling update and a Node only contains terminating Pods, kube-proxy will route traffic to the terminating Pods based on their readiness. In addition, kube-proxy will actively fail the health check NodePort if there are only terminating Pods available. By doing so, kube-proxy alerts the external load balancer that new connections should not be sent to that Node but will gracefully handle requests for existing connections.

Figure 4: Load Balancer traffic to terminating endpoints with ProxyTerminatingEndpoints enabled, when externalTrafficPolicy is Local

Figure 4: Load Balancer traffic to terminating endpoints with ProxyTerminatingEndpoints enabled, when externalTrafficPolicy is Local

EndpointSlice Conditions

In order to support this new capability in kube-proxy, the EndpointSlice API introduced new conditions for endpoints: serving and terminating.

Figure 5: Overview of EndpointSlice conditions

Figure 5: Overview of EndpointSlice conditions

The serving condition is semantically identical to ready, except that it can be true or false while a Pod is terminating, unlike ready which will always be false for terminating Pods for compatibility reasons. The terminating condition is true for Pods undergoing termination (non-empty deletionTimestamp), false otherwise.

The addition of these two conditions enables consumers of this API to understand Pod states that were previously not possible. For example, we can now track "ready" and "not ready" Pods that are also terminating.

Figure 6: EndpointSlice conditions with a terminating Pod

Figure 6: EndpointSlice conditions with a terminating Pod

Consumers of the EndpointSlice API, such as Kube-proxy and Ingress Controllers, can now use these conditions to coordinate connection draining events, by continuing to forward traffic for existing connections but rerouting new connections to other non-terminating endpoints.

Optimizing Internal Node-Local Traffic

Similar to how Services can set externalTrafficPolicy: Local to avoid extra hops for externally sourced traffic, Kubernetes now supports internalTrafficPolicy: Local, to enable the same optimization for traffic originating within the cluster, specifically for traffic using the Service Cluster IP as the destination address. This feature graduated to Beta in Kubernetes v1.24 and is graduating to GA in v1.26.

Services default the internalTrafficPolicy field to Cluster, where traffic is randomly distributed to all endpoints.

Figure 7: Service routing when internalTrafficPolicy is Cluster

Figure 7: Service routing when internalTrafficPolicy is Cluster

When internalTrafficPolicy is set to Local, kube-proxy will forward internal traffic for a Service only if there is an available endpoint that is local to the same Node.

Figure 8: Service routing when internalTrafficPolicy is Local

Figure 8: Service routing when internalTrafficPolicy is Local

Getting Involved

If you're interested in future discussions on Kubernetes traffic engineering, you can get involved in SIG Network through the following ways:

30 Dec 2022 12:00am GMT

29 Dec 2022

feedKubernetes – Production-Grade Container Orchestration

Blog: Kubernetes 1.26: Job Tracking, to Support Massively Parallel Batch Workloads, Is Generally Available

Authors: Aldo Culquicondor (Google)

The Kubernetes 1.26 release includes a stable implementation of the Job controller that can reliably track a large amount of Jobs with high levels of parallelism. SIG Apps and WG Batch have worked on this foundational improvement since Kubernetes 1.22. After multiple iterations and scale verifications, this is now the default implementation of the Job controller.

Paired with the Indexed completion mode, the Job controller can handle massively parallel batch Jobs, supporting up to 100k concurrent Pods.

The new implementation also made possible the development of Pod failure policy, which is in beta in the 1.26 release.

How do I use this feature?

To use Job tracking with finalizers, upgrade to Kubernetes 1.25 or newer and create new Jobs. You can also use this feature in v1.23 and v1.24, if you have the ability to enable the JobTrackingWithFinalizers feature gate.

If your cluster runs Kubernetes 1.26, Job tracking with finalizers is a stable feature. For v1.25, it's behind that feature gate, and your cluster administrators may have explicitly disabled it - for example, if you have a policy of not using beta features.

Jobs created before the upgrade will still be tracked using the legacy behavior. This is to avoid retroactively adding finalizers to running Pods, which might introduce race conditions.

For maximum performance on large Jobs, the Kubernetes project recommends using the Indexed completion mode. In this mode, the control plane is able to track Job progress with less API calls.

If you are a developer of operator(s) for batch, HPC, AI, ML or related workloads, we encourage you to use the Job API to delegate accurate progress tracking to Kubernetes. If there is something missing in the Job API that forces you to manage plain Pods, the Working Group Batch welcomes your feedback and contributions.

Deprecation notices

During the development of the feature, the control plane added the annotation batch.kubernetes.io/job-tracking to the Jobs that were created when the feature was enabled. This allowed a safe transition for older Jobs, but it was never meant to stay.

In the 1.26 release, we deprecated the annotation batch.kubernetes.io/job-tracking and the control plane will stop adding it in Kubernetes 1.27. Along with that change, we will remove the legacy Job tracking implementation. As a result, the Job controller will track all Jobs using finalizers and it will ignore Pods that don't have the aforementioned finalizer.

Before you upgrade your cluster to 1.27, we recommend that you verify that there are no running Jobs that don't have the annotation, or you wait for those jobs to complete. Otherwise, you might observe the control plane recreating some Pods. We expect that this shouldn't affect any users, as the feature is enabled by default since Kubernetes 1.25, giving enough buffer for old jobs to complete.

What problem does the new implementation solve?

Generally, Kubernetes workload controllers, such as ReplicaSet or StatefulSet, rely on the existence of Pods or other objects in the API to determine the status of the workload and whether replacements are needed. For example, if a Pod that belonged to a ReplicaSet terminates or ceases to exist, the ReplicaSet controller needs to create a replacement Pod to satisfy the desired number of replicas (.spec.replicas).

Since its inception, the Job controller also relied on the existence of Pods in the API to track Job status. A Job has completion and failure handling policies, requiring the end state of a finished Pod to determine whether to create a replacement Pod or mark the Job as completed or failed. As a result, the Job controller depended on Pods, even terminated ones, to remain in the API in order to keep track of the status.

This dependency made the tracking of Job status unreliable, because Pods can be deleted from the API for a number of reasons, including:

The new implementation

When a controller needs to take an action on objects before they are removed, it should add a finalizer to the objects that it manages. A finalizer prevents the objects from being deleted from the API until the finalizers are removed. Once the controller is done with the cleanup and accounting for the deleted object, it can remove the finalizer from the object and the control plane removes the object from the API.

This is what the new Job controller is doing: adding a finalizer during Pod creation, and removing the finalizer after the Pod has terminated and has been accounted for in the Job status. However, it wasn't that simple.

The main challenge is that there are at least two objects involved: the Pod and the Job. While the finalizer lives in the Pod object, the accounting lives in the Job object. There is no mechanism to atomically remove the finalizer in the Pod and update the counters in the Job status. Additionally, there could be more than one terminated Pod at a given time.

To solve this problem, we implemented a three staged approach, each translating to an API call.

  1. For each terminated Pod, add the unique ID (UID) of the Pod into short-lived lists stored in the .status of the owning Job (.status.uncountedTerminatedPods).
  2. Remove the finalizer from the Pods(s).
  3. Atomically do the following operations:
    • remove UIDs from the short-lived lists
    • increment the overall succeeded and failed counters in the status of the Job.

Additional complications come from the fact that the Job controller might receive the results of the API changes in steps 1 and 2 out of order. We solved this by adding an in-memory cache for removed finalizers.

Still, we faced some issues during the beta stage, leaving some pods stuck with finalizers in some conditions (#108645, #109485, and #111646). As a result, we decided to switch that feature gate to be disabled by default for the 1.23 and 1.24 releases.

Once resolved, we re-enabled the feature for the 1.25 release. Since then, we have received reports from our customers running tens of thousands of Pods at a time in their clusters through the Job API. Seeing this success, we decided to graduate the feature to stable in 1.26, as part of our long term commitment to make the Job API the best way to run large batch Jobs in a Kubernetes cluster.

To learn more about the feature, you can read the KEP.

Acknowledgments

As with any Kubernetes feature, multiple people contributed to getting this done, from testing and filing bugs to reviewing code.

On behalf of SIG Apps, I would like to especially thank Jordan Liggitt (Google) for helping me debug and brainstorm solutions for more than one race condition and Maciej Szulik (Red Hat) for his thorough reviews.

29 Dec 2022 12:00am GMT

27 Dec 2022

feedKubernetes – Production-Grade Container Orchestration

Blog: Kubernetes v1.26: CPUManager goes GA

Author: Francesco Romani (Red Hat)

The CPU Manager is a part of the kubelet, the Kubernetes node agent, which enables the user to allocate exclusive CPUs to containers. Since Kubernetes v1.10, where it graduated to Beta, the CPU Manager proved itself reliable and fulfilled its role of allocating exclusive CPUs to containers, so adoption has steadily grown making it a staple component of performance-critical and low-latency setups. Over time, most changes were about bugfixes or internal refactoring, with the following noteworthy user-visible changes:

The CPU Manager reached the point on which it "just works", so in Kubernetes v1.26 it has graduated to generally available (GA).

Customization options for CPU Manager

The CPU Manager supports two operation modes, configured using its policies. With the none policy, the CPU Manager allocates CPUs to containers without any specific constraint except the (optional) quota set in the Pod spec. With the static policy, then provided that the pod is in the Guaranteed QoS class and every container in that Pod requests an integer amount of vCPU cores, then the CPU Manager allocates CPUs exclusively. Exclusive assignment means that other containers (whether from the same Pod, or from a different Pod) do not get scheduled onto that CPU.

This simple operational model served the user base pretty well, but as the CPU Manager matured more and more, users started to look at more elaborate use cases and how to better support them.

Rather than add more policies, the community realized that pretty much all the novel use cases are some variation of the behavior enabled by the static CPU Manager policy. Hence, it was decided to add options to tune the behavior of the static policy. The options have a varying degree of maturity, like any other Kubernetes feature, and in order to be accepted, each new option provides a backward compatible behavior when disabled, and to document how to interact with each other, should they interact at all.

This enabled the Kubernetes project to graduate to GA the CPU Manager core component and core CPU allocation algorithms to GA, while also enabling a new age of experimentation in this area. In Kubernetes v1.26, the CPU Manager supports three different policy options:

full-pcpus-only
restrict the CPU Manager core allocation algorithm to full physical cores only, reducing noisy neighbor issues from hardware technologies that allow sharing cores.
distribute-cpus-across-numa
drive the CPU Manager to evenly distribute CPUs across NUMA nodes, for cases where more than one NUMA node is required to satisfy the allocation.
align-by-socket
change how the CPU Manager allocates CPUs to a container: consider CPUs to be aligned at the socket boundary, instead of NUMA node boundary.

Further development

After graduating the main CPU Manager feature, each existing policy option will follow their graduation process, independent from CPU Manager and from each other option. There is room for new options to be added, but there's also a growing demand for even more flexibility than what the CPU Manager, and its policy options, currently grant.

Conversations are in progress in the community about splitting the CPU Manager and the other resource managers currently part of the kubelet executable into pluggable, independent kubelet plugins. If you are interested in this effort, please join the conversation on SIG Node communication channels (Slack, mailing list, weekly meeting).

Further reading

Please check out the Control CPU Management Policies on the Node task page to learn more about the CPU Manager, and how it fits in relation to the other node-level resource managers.

Getting involved

This feature is driven by the SIG Node community. Please join us to connect with the community and share your ideas and feedback around the above feature and beyond. We look forward to hearing from you!

27 Dec 2022 12:00am GMT

26 Dec 2022

feedKubernetes – Production-Grade Container Orchestration

Blog: Kubernetes 1.26: Pod Scheduling Readiness

Author: Wei Huang (Apple), Abdullah Gharaibeh (Google)

Kubernetes 1.26 introduced a new Pod feature: scheduling gates. In Kubernetes, scheduling gates are keys that tell the scheduler when a Pod is ready to be considered for scheduling.

What problem does it solve?

When a Pod is created, the scheduler will continuously attempt to find a node that fits it. This infinite loop continues until the scheduler either finds a node for the Pod, or the Pod gets deleted.

Pods that remain unschedulable for long periods of time (e.g., ones that are blocked on some external event) waste scheduling cycles. A scheduling cycle may take ≅20ms or more depending on the complexity of the Pod's scheduling constraints. Therefore, at scale, those wasted cycles significantly impact the scheduler's performance. See the arrows in the "scheduler" box below.

graph LR; pod((New Pod))-->queue subgraph Scheduler queue(scheduler queue) sched_cycle[/scheduling cycle/] schedulable{schedulable?} queue==>|Pop out|sched_cycle sched_cycle==>schedulable schedulable==>|No|queue subgraph note [Cycles wasted on keep rescheduling 'unready' Pods] end end classDef plain fill:#ddd,stroke:#fff,stroke-width:1px,color:#000; classDef k8s fill:#326ce5,stroke:#fff,stroke-width:1px,color:#fff; classDef Scheduler fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5; classDef note fill:#edf2ae,stroke:#fff,stroke-width:1px; class queue,sched_cycle,schedulable k8s; class pod plain; class note note; class Scheduler Scheduler;

Scheduling gates helps address this problem. It allows declaring that newly created Pods are not ready for scheduling. When scheduling gates are present on a Pod, the scheduler ignores the Pod and therefore saves unnecessary scheduling attempts. Those Pods will also be ignored by Cluster Autoscaler if you have it installed in the cluster.

Clearing the gates is the responsibility of external controllers with knowledge of when the Pod should be considered for scheduling (e.g., a quota manager).

graph LR; pod((New Pod))-->queue subgraph Scheduler queue(scheduler queue) sched_cycle[/scheduling cycle/] schedulable{schedulable?} popout{Pop out?} queue==>|PreEnqueue check|popout popout-->|Yes|sched_cycle popout==>|No|queue sched_cycle-->schedulable schedulable-->|No|queue subgraph note [A knob to gate Pod's scheduling] end end classDef plain fill:#ddd,stroke:#fff,stroke-width:1px,color:#000; classDef k8s fill:#326ce5,stroke:#fff,stroke-width:1px,color:#fff; classDef Scheduler fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5; classDef note fill:#edf2ae,stroke:#fff,stroke-width:1px; classDef popout fill:#f96,stroke:#fff,stroke-width:1px; class queue,sched_cycle,schedulable k8s; class pod plain; class note note; class popout popout; class Scheduler Scheduler;

How does it work?

Scheduling gates in general works very similar to Finalizers. Pods with a non-empty spec.schedulingGates field will show as status SchedulingGated and be blocked from scheduling. Note that more than one gate can be added, but they all should be added upon Pod creation (e.g., you can add them as part of the spec or via a mutating webhook).

NAME READY STATUS RESTARTS AGE
test-pod 0/1 SchedulingGated 0 10s

To clear the gates, you update the Pod by removing all of the items from the Pod's schedulingGates field. The gates do not need to be removed all at once, but only when all the gates are removed the scheduler will start to consider the Pod for scheduling.

Under the hood, scheduling gates are implemented as a PreEnqueue scheduler plugin, a new scheduler framework extension point that is invoked at the beginning of each scheduling cycle.

Use Cases

An important use case this feature enables is dynamic quota management. Kubernetes supports ResourceQuota, however the API Server enforces quota at the time you attempt Pod creation. For example, if a new Pod exceeds the CPU quota, it gets rejected. The API Server doesn't queue the Pod; therefore, whoever created the Pod needs to continuously attempt to recreate it again. This either means a delay between resources becoming available and the Pod actually running, or it means load on the API server and Scheduler due to constant attempts.

Scheduling gates allows an external quota manager to address the above limitation of ResourceQuota. Specifically, the manager could add a example.com/quota-check scheduling gate to all Pods created in the cluster (using a mutating webhook). The manager would then remove the gate when there is quota to start the Pod.

Whats next?

To use this feature, the PodSchedulingReadiness feature gate must be enabled in the API Server and scheduler. You're more than welcome to test it out and tell us (SIG Scheduling) what you think!

Additional resources

26 Dec 2022 12:00am GMT

23 Dec 2022

feedKubernetes – Production-Grade Container Orchestration

Blog: Kubernetes 1.26: Support for Passing Pod fsGroup to CSI Drivers At Mount Time

Authors: Fabio Bertinatto (Red Hat), Hemant Kumar (Red Hat)

Delegation of fsGroup to CSI drivers was first introduced as alpha in Kubernetes 1.22, and graduated to beta in Kubernetes 1.25. For Kubernetes 1.26, we are happy to announce that this feature has graduated to General Availability (GA).

In this release, if you specify a fsGroup in the security context, for a (Linux) Pod, all processes in the pod's containers are part of the additional group that you specified.

In previous Kubernetes releases, the kubelet would always apply the fsGroup ownership and permission changes to files in the volume according to the policy you specified in the Pod's .spec.securityContext.fsGroupChangePolicy field.

Starting with Kubernetes 1.26, CSI drivers have the option to apply the fsGroup settings during volume mount time, which frees the kubelet from changing the permissions of files and directories in those volumes.

How does it work?

CSI drivers that support this feature should advertise the VOLUME_MOUNT_GROUP node capability.

After recognizing this information, the kubelet passes the fsGroup information to the CSI driver during pod startup. This is done through the NodeStageVolumeRequest and NodePublishVolumeRequest CSI calls.

Consequently, the CSI driver is expected to apply the fsGroup to the files in the volume using a mount option. As an example, Azure File CSIDriver utilizes the gid mount option to map the fsGroup information to all the files in the volume.

It should be noted that in the example above the kubelet refrains from directly applying the permission changes into the files and directories in that volume files. Additionally, two policy definitions no longer have an effect: neither .spec.fsGroupPolicy for the CSIDriver object, nor .spec.securityContext.fsGroupChangePolicy for the Pod.

For more details about the inner workings of this feature, check out the enhancement proposal and the CSI Driver fsGroup Support in the CSI developer documentation.

Why is it important?

Without this feature, applying the fsGroup information to files is not possible in certain storage environments.

For instance, Azure File does not support a concept of POSIX-style ownership and permissions of files. The CSI driver is only able to set the file permissions at the volume level.

How do I use it?

This feature should be mostly transparent to users. If you maintain a CSI driver that should support this feature, read CSI Driver fsGroup Support for more information on how to support this feature in your CSI driver.

Existing CSI drivers that do not support this feature will continue to work as usual: they will not receive any fsGroup information from the kubelet. In addition to that, the kubelet will continue to perform the ownership and permissions changes to files for those volumes, according to the policies specified in .spec.fsGroupPolicy for the CSIDriver and .spec.securityContext.fsGroupChangePolicy for the relevant Pod.

23 Dec 2022 12:00am GMT

22 Dec 2022

feedKubernetes – Production-Grade Container Orchestration

Blog: Kubernetes v1.26: GA Support for Kubelet Credential Providers

Authors: Andrew Sy Kim (Google), Dixita Narang (Google)

Kubernetes v1.26 introduced generally available (GA) support for kubelet credential provider plugins, offering an extensible plugin framework to dynamically fetch credentials for any container image registry.

Background

Kubernetes supports the ability to dynamically fetch credentials for a container registry service. Prior to Kubernetes v1.20, this capability was compiled into the kubelet and only available for Amazon Elastic Container Registry, Azure Container Registry, and Google Cloud Container Registry.

Figure 1: Kubelet built-in credential provider support for Amazon Elastic Container Registry, Azure Container Registry, and Google Cloud Container Registry.

Figure 1: Kubelet built-in credential provider support for Amazon Elastic Container Registry, Azure Container Registry, and Google Cloud Container Registry.

Kubernetes v1.20 introduced alpha support for kubelet credential providers plugins, which provides a mechanism for the kubelet to dynamically authenticate and pull images for arbitrary container registries - whether these are public registries, managed services, or even a self-hosted registry. In Kubernetes v1.26, this feature is now GA

Figure 2: Kubelet credential provider overview

Figure 2: Kubelet credential provider overview

Why is it important?

Prior to Kubernetes v1.20, if you wanted to dynamically fetch credentials for image registries other than ACR (Azure Container Registry), ECR (Elastic Container Registry), or GCR (Google Container Registry), you needed to modify the kubelet code. The new plugin mechanism can be used in any cluster, and lets you authenticate to new registries without any changes to Kubernetes itself. Any cloud provider or vendor can publish a plugin that lets you authenticate with their image registry.

How it works

The kubelet and the exec plugin binary communicate through stdio (stdin, stdout, and stderr) by sending and receiving json-serialized api-versioned types. If the exec plugin is enabled and the kubelet requires authentication information for an image that matches against a plugin, the kubelet will execute the plugin binary, passing the CredentialProviderRequest API via stdin. Then the exec plugin communicates with the container registry to dynamically fetch the credentials and returns the credentials in an encoded response of the CredentialProviderResponse API to the kubelet via stdout.

Figure 3: Kubelet credential provider plugin flow

Figure 3: Kubelet credential provider plugin flow

On receiving credentials from the kubelet, the plugin can also indicate how long credentials can be cached for, to prevent unnecessary execution of the plugin by the kubelet for subsequent image pull requests to the same registry. In cases where the cache duration is not specified by the plugin, a default cache duration can be specified by the kubelet (more details below).

{
 "apiVersion": "kubelet.k8s.io/v1",
 "kind": "CredentialProviderResponse",
 "auth": {
 "cacheDuration": "6h",
 "private-registry.io/my-app": {
 "username": "exampleuser",
 "password": "token12345"
 }
 }
}

In addition, the plugin can specify the scope in which cached credentials are valid for. This is specified through the cacheKeyType field in CredentialProviderResponse. When the value is Image, the kubelet will only use cached credentials for future image pulls that exactly match the image of the first request. When the value is Registry, the kubelet will use cached credentials for any subsequent image pulls destined for the same registry host but using different paths (for example, gcr.io/foo/bar and gcr.io/bar/foo refer to different images from the same registry). Lastly, when the value is Global, the kubelet will use returned credentials for all images that match against the plugin, including images that can map to different registry hosts (for example, gcr.io vs k8s.gcr.io). The cacheKeyType field is required by plugin implementations.

{
 "apiVersion": "kubelet.k8s.io/v1",
 "kind": "CredentialProviderResponse",
 "auth": {
 "cacheKeyType": "Registry",
 "private-registry.io/my-app": {
 "username": "exampleuser",
 "password": "token12345"
 }
 }
}

Using kubelet credential providers

You can configure credential providers by installing the exec plugin(s) into a local directory accessible by the kubelet on every node. Then you set two command line arguments for the kubelet:

The configuration file passed into --image-credential-provider-config is read by the kubelet to determine which exec plugins should be invoked for a container image used by a Pod. Note that the name of each provider must match the name of the binary located in the local directory specified in --image-credential-provider-bin-dir, otherwise the kubelet cannot locate the path of the plugin to invoke.

kind: CredentialProviderConfig
apiVersion: kubelet.config.k8s.io/v1
providers:
- name: auth-provider-gcp
 apiVersion: credentialprovider.kubelet.k8s.io/v1
 matchImages:
 - "container.cloud.google.com"
 - "gcr.io"
 - "*.gcr.io"
 - "*.pkg.dev"
 args:
 - get-credentials
 - --v=3
 defaultCacheDuration: 1m

Below is an overview of how the Kubernetes project is using kubelet credential providers for end-to-end testing.

Figure 4: Kubelet credential provider configuration used for Kubernetes e2e testing

Figure 4: Kubelet credential provider configuration used for Kubernetes e2e testing

For more configuration details, see Kubelet Credential Providers.

Getting Involved

Come join SIG Node if you want to report bugs or have feature requests for the Kubelet Credential Provider. You can reach us through the following ways:

22 Dec 2022 12:00am GMT

20 Dec 2022

feedKubernetes – Production-Grade Container Orchestration

Blog: Kubernetes 1.26: Introducing Validating Admission Policies

Authors: Joe Betz (Google), Cici Huang (Google)

In Kubernetes 1.26, the 1st alpha release of validating admission policies is available!

Validating admission policies use the Common Expression Language (CEL) to offer a declarative, in-process alternative to validating admission webhooks.

CEL was first introduced to Kubernetes for the Validation rules for CustomResourceDefinitions. This enhancement expands the use of CEL in Kubernetes to support a far wider range of admission use cases.

Admission webhooks can be burdensome to develop and operate. Webhook developers must implement and maintain a webhook binary to handle admission requests. Also, admission webhooks are complex to operate. Each webhook must be deployed, monitored and have a well defined upgrade and rollback plan. To make matters worse, if a webhook times out or becomes unavailable, the Kubernetes control plane can become unavailable. This enhancement avoids much of this complexity of admission webhooks by embedding CEL expressions into Kubernetes resources instead of calling out to a remote webhook binary.

For example, to set a limit on how many replicas a Deployment can have. Start by defining a validation policy:

apiVersion: admissionregistration.k8s.io/v1alpha1
kind: ValidatingAdmissionPolicy
metadata:
 name: "demo-policy.example.com"
spec:
 matchConstraints:
 resourceRules:
 - apiGroups: ["apps"]
 apiVersions: ["v1"]
 operations: ["CREATE", "UPDATE"]
 resources: ["deployments"]
 validations:
 - expression: "object.spec.replicas <= 5"

The expression field contains the CEL expression that is used to validate admission requests. matchConstraints declares what types of requests this ValidatingAdmissionPolicy is may validate.

Next bind the policy to the appropriate resources:

apiVersion: admissionregistration.k8s.io/v1alpha1
kind: ValidatingAdmissionPolicyBinding
metadata:
 name: "demo-binding-test.example.com"
spec:
 policyName: "demo-policy.example.com"
 matchResources:
 namespaceSelector:
 matchExpressions:
 - key: environment
 operator: In
 values:
 - test

This ValidatingAdmissionPolicyBinding resource binds the above policy only to namespaces where the environment label is set to test. Once this binding is created, the kube-apiserver will begin enforcing this admission policy.

To emphasize how much simpler this approach is than admission webhooks, if this example were instead implemented with a webhook, an entire binary would need to be developed and maintained just to perform a <= check. In our review of a wide range of admission webhooks used in production, the vast majority performed relatively simple checks, all of which can easily be expressed using CEL.

Validation admission policies are highly configurable, enabling policy authors to define policies that can be parameterized and scoped to resources as needed by cluster administrators.

For example, the above admission policy can be modified to make it configurable:

apiVersion: admissionregistration.k8s.io/v1alpha1
kind: ValidatingAdmissionPolicy
metadata:
 name: "demo-policy.example.com"
spec:
 paramKind:
 apiVersion: rules.example.com/v1 # You also need a CustomResourceDefinition for this API
 kind: ReplicaLimit
 matchConstraints:
 resourceRules:
 - apiGroups: ["apps"]
 apiVersions: ["v1"]
 operations: ["CREATE", "UPDATE"]
 resources: ["deployments"]
 validations:
 - expression: "object.spec.replicas <= params.maxReplicas"

Here, paramKind defines the resources used to configure the policy and the expression uses the params variable to access the parameter resource.

This allows multiple bindings to be defined, each configured differently. For example:

apiVersion: admissionregistration.k8s.io/v1alpha1
kind: ValidatingAdmissionPolicyBinding
metadata:
 name: "demo-binding-production.example.com"
spec:
 policyName: "demo-policy.example.com"
 paramRef:
 name: "demo-params-production.example.com"
 matchResources:
 namespaceSelector:
 matchExpressions:
 - key: environment
 operator: In
 values:
 - production
apiVersion: rules.example.com/v1 # defined via a CustomResourceDefinition
kind: ReplicaLimit
metadata:
 name: "demo-params-production.example.com"
maxReplicas: 1000

This binding and parameter resource pair limit deployments in namespaces with the environment label set to production to a max of 1000 replicas.

You can then use a separate binding and parameter pair to set a different limit for namespaces in the test environment.

I hope this has given you a glimpse of what is possible with validating admission policies! There are many features that we have not yet touched on.

To learn more, read Validating Admission Policy.

We are working hard to add more features to admission policies and make the enhancement easier to use. Try it out, send us your feedback and help us build a simpler alternative to admission webhooks!

How do I get involved?

If you want to get involved in development of admission policies, discuss enhancement roadmaps, or report a bug, you can get in touch with developers at SIG API Machinery.

20 Dec 2022 12:00am GMT

19 Dec 2022

feedKubernetes – Production-Grade Container Orchestration

Blog: Kubernetes 1.26: Device Manager graduates to GA

Author: Swati Sehgal (Red Hat)

The Device Plugin framework was introduced in the Kubernetes v1.8 release as a vendor independent framework to enable discovery, advertisement and allocation of external devices without modifying core Kubernetes. The feature graduated to Beta in v1.10. With the recent release of Kubernetes v1.26, Device Manager is now generally available (GA).

Within the kubelet, the Device Manager facilitates communication with device plugins using gRPC through Unix sockets. Device Manager and Device plugins both act as gRPC servers and clients by serving and connecting to the exposed gRPC services respectively. Device plugins serve a gRPC service that kubelet connects to for device discovery, advertisement (as extended resources) and allocation. Device Manager connects to the Registration gRPC service served by kubelet to register itself with kubelet.

Please refer to the documentation for an example on how a pod can request a device exposed to the cluster by a device plugin.

Here are some example implementations of device plugins:

Noteworthy developments since Device Plugin framework introduction

Kubelet APIs moved to kubelet staging repo

External facing deviceplugin API packages moved from k8s.io/kubernetes/pkg/kubelet/apis/ to k8s.io/kubelet/pkg/apis/ in v1.17. Refer to Move external facing kubelet apis to staging for more details on the rationale behind this change.

Device Plugin API updates

Additional gRPC endpoints introduced:

  1. GetDevicePluginOptions is used by device plugins to communicate options to the DeviceManager in order to indicate if PreStartContainer, GetPreferredAllocation or other future optional calls are supported and can be called before making devices available to the container.
  2. GetPreferredAllocation allows a device plugin to forward allocation preferrence to the DeviceManager so it can incorporate this information into its allocation decisions. The DeviceManager will call out to a plugin at pod admission time asking for a preferred device allocation of a given size from a list of available devices to make a more informed decision. E.g. Specifying inter-device constraints to indicate preferrence on best-connected set of devices when allocating devices to a container.
  3. PreStartContainer is called before each container start if indicated by device plugins during registration phase. It allows Device Plugins to run device specific operations on the Devices requested. E.g. reconfiguring or reprogramming FPGAs before the container starts running.

Pull Requests that introduced these changes are here:

  1. Invoke preStart RPC call before container start, if desired by plugin
  2. Add GetPreferredAllocation() call to the v1beta1 device plugin API

With introduction of the above endpoints the interaction between Device Manager in kubelet and Device Manager can be shown as below:

Representation of the Device Plugin framework showing the relationship between the kubelet and a device plugin

Device Plugin framework Overview

Change in semantics of device plugin registration process

Device plugin code was refactored to separate 'plugin' package under the devicemanager package to lay the groundwork for introducing a v1beta2 device plugin API. This would allow adding support in devicemanager to service multiple device plugin APIs at the same time.

With this refactoring work, it is now mandatory for a device plugin to start serving its gRPC service before registering itself with kubelet. Previously, these two operations were asynchronous and device plugin could register itself before starting its gRPC server which is no longer the case. For more details, refer to PR #109016 and Issue #112395.

Dynamic resource allocation

In Kubernetes 1.26, inspired by how Persistent Volumes are handled in Kubernetes, Dynamic Resource Allocation has been introduced to cater to devices that have more sophisticated resource requirements like:

  1. Decouple device initialization and allocation from the pod lifecycle.
  2. Facilitate dynamic sharing of devices between containers and pods.
  3. Support custom resource-specific parameters
  4. Enable resource-specific setup and cleanup actions
  5. Enable support for Network-attached resources, not just node-local resources

Is the Device Plugin API stable now?

No, the Device Plugin API is still not stable; the latest Device Plugin API version available is v1beta1. There are plans in the community to introduce v1beta2 API to service multiple plugin APIs at once. A per-API call with request/response types would allow adding support for newer API versions without explicitly bumping the API.

In addition to that, there are existing proposals in the community to introduce additional endpoints KEP-3162: Add Deallocate and PostStopContainer to Device Manager API.

19 Dec 2022 12:00am GMT

16 Dec 2022

feedKubernetes – Production-Grade Container Orchestration

Blog: Kubernetes 1.26: Non-Graceful Node Shutdown Moves to Beta

Author: Xing Yang (VMware), Ashutosh Kumar (VMware)

Kubernetes v1.24 introduced an alpha quality implementation of improvements for handling a non-graceful node shutdown. In Kubernetes v1.26, this feature moves to beta. This feature allows stateful workloads to failover to a different node after the original node is shut down or in a non-recoverable state, such as the hardware failure or broken OS.

What is a node shutdown in Kubernetes?

In a Kubernetes cluster, it is possible for a node to shut down. This could happen either in a planned way or it could happen unexpectedly. You may plan for a security patch, or a kernel upgrade and need to reboot the node, or it may shut down due to preemption of VM instances. A node may also shut down due to a hardware failure or a software problem.

To trigger a node shutdown, you could run a shutdown or poweroff command in a shell, or physically press a button to power off a machine.

A node shutdown could lead to workload failure if the node is not drained before the shutdown.

In the following, we will describe what is a graceful node shutdown and what is a non-graceful node shutdown.

What is a graceful node shutdown?

The kubelet's handling for a graceful node shutdown allows the kubelet to detect a node shutdown event, properly terminate the pods on that node, and release resources before the actual shutdown. Critical pods are terminated after all the regular pods are terminated, to ensure that the essential functions of an application can continue to work as long as possible.

What is a non-graceful node shutdown?

A Node shutdown can be graceful only if the kubelet's node shutdown manager can detect the upcoming node shutdown action. However, there are cases where a kubelet does not detect a node shutdown action. This could happen because the shutdown command does not trigger the Inhibitor Locks mechanism used by the kubelet on Linux, or because of a user error. For example, if the shutdownGracePeriod and shutdownGracePeriodCriticalPods details are not configured correctly for that node.

When a node is shut down (or crashes), and that shutdown was not detected by the kubelet node shutdown manager, it becomes a non-graceful node shutdown. Non-graceful node shutdown is a problem for stateful apps. If a node containing a pod that is part of a StatefulSet is shut down in a non-graceful way, the Pod will be stuck in Terminating status indefinitely, and the control plane cannot create a replacement Pod for that StatefulSet on a healthy node. You can delete the failed Pods manually, but this is not ideal for a self-healing cluster. Similarly, pods that ReplicaSets created as part of a Deployment will be stuck in Terminating status, and that were bound to the now-shutdown node, stay as Terminating indefinitely. If you have set a horizontal scaling limit, even those terminating Pods count against the limit, so your workload may struggle to self-heal if it was already at maximum scale. (By the way: if the node that had done a non-graceful shutdown comes back up, the kubelet does delete the old Pod, and the control plane can make a replacement.)

What's new for the beta?

For Kubernetes v1.26, the non-graceful node shutdown feature is beta and enabled by default. The NodeOutOfServiceVolumeDetach feature gate is enabled by default on kube-controller-manager instead of being opt-in; you can still disable it if needed (please also file an issue to explain the problem).

On the instrumentation side, the kube-controller-manager reports two new metrics.

force_delete_pods_total
number of pods that are being forcibly deleted (resets on Pod garbage collection controller restart)
force_delete_pod_errors_total
number of errors encountered when attempting forcible Pod deletion (also resets on Pod garbage collection controller restart)

How does it work?

In the case of a node shutdown, if a graceful shutdown is not working or the node is in a non-recoverable state due to hardware failure or broken OS, you can manually add an out-of-service taint on the Node. For example, this can be node.kubernetes.io/out-of-service=nodeshutdown:NoExecute or node.kubernetes.io/out-of-service=nodeshutdown:NoSchedule. This taint trigger pods on the node to be forcefully deleted if there are no matching tolerations on the pods. Persistent volumes attached to the shutdown node will be detached, and new pods will be created successfully on a different running node.

kubectl taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute

Note: Before applying the out-of-service taint, you must verify that a node is already in shutdown or power-off state (not in the middle of restarting), either because the user intentionally shut it down or the node is down due to hardware failures, OS issues, etc.

Once all the workload pods that are linked to the out-of-service node are moved to a new running node, and the shutdown node has been recovered, you should remove that taint on the affected node after the node is recovered.

What's next?

Depending on feedback and adoption, the Kubernetes team plans to push the Non-Graceful Node Shutdown implementation to GA in either 1.27 or 1.28.

This feature requires a user to manually add a taint to the node to trigger the failover of workloads and remove the taint after the node is recovered.

The cluster operator can automate this process by automatically applying the out-of-service taint if there is a programmatic way to determine that the node is really shut down and there isn't IO between the node and storage. The cluster operator can then automatically remove the taint after the workload fails over successfully to another running node and that the shutdown node has been recovered.

In the future, we plan to find ways to automatically detect and fence nodes that are shut down or in a non-recoverable state and fail their workloads over to another node.

How can I learn more?

To learn more, read Non Graceful node shutdown in the Kubernetes documentation.

How to get involved?

We offer a huge thank you to all the contributors who helped with design, implementation, and review of this feature:

There are many people who have helped review the design and implementation along the way. We want to thank everyone who has contributed to this effort including the about 30 people who have reviewed the KEP and implementation over the last couple of years.

This feature is a collaboration between SIG Storage and SIG Node. For those interested in getting involved with the design and development of any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). For those interested in getting involved with the design and development of the components that support the controlled interactions between pods and host resources, join the Kubernetes Node SIG.

16 Dec 2022 6:00pm GMT

15 Dec 2022

feedKubernetes – Production-Grade Container Orchestration

Blog: Kubernetes 1.26: Alpha API For Dynamic Resource Allocation

Authors: Patrick Ohly (Intel), Kevin Klues (NVIDIA)

Dynamic resource allocation is a new API for requesting resources. It is a generalization of the persistent volumes API for generic resources, making it possible to:

Third-party resource drivers are responsible for interpreting these parameters as well as tracking and allocating resources as requests come in.

Dynamic resource allocation is an alpha feature and only enabled when the DynamicResourceAllocation feature gate and the resource.k8s.io/v1alpha1 API group are enabled. For details, see the --feature-gates and --runtime-config kube-apiserver parameters. The kube-scheduler, kube-controller-manager and kubelet components all need the feature gate enabled as well.

The default configuration of kube-scheduler enables the DynamicResources plugin if and only if the feature gate is enabled. Custom configurations may have to be modified to include it.

Once dynamic resource allocation is enabled, resource drivers can be installed to manage certain kinds of hardware. Kubernetes has a test driver that is used for end-to-end testing, but also can be run manually. See below for step-by-step instructions.

API

The new resource.k8s.io/v1alpha1 API group provides four new types:

ResourceClass
Defines which resource driver handles a certain kind of resource and provides common parameters for it. ResourceClasses are created by a cluster administrator when installing a resource driver.
ResourceClaim
Defines a particular resource instances that is required by a workload. Created by a user (lifecycle managed manually, can be shared between different Pods) or for individual Pods by the control plane based on a ResourceClaimTemplate (automatic lifecycle, typically used by just one Pod).
ResourceClaimTemplate
Defines the spec and some meta data for creating ResourceClaims. Created by a user when deploying a workload.
PodScheduling
Used internally by the control plane and resource drivers to coordinate pod scheduling when ResourceClaims need to be allocated for a Pod.

Parameters for ResourceClass and ResourceClaim are stored in separate objects, typically using the type defined by a CRD that was created when installing a resource driver.

With this alpha feature enabled, the spec of Pod defines ResourceClaims that are needed for a Pod to run: this information goes into a new resourceClaims field. Entries in that list reference either a ResourceClaim or a ResourceClaimTemplate. When referencing a ResourceClaim, all Pods using this .spec (for example, inside a Deployment or StatefulSet) share the same ResourceClaim instance. When referencing a ResourceClaimTemplate, each Pod gets its own ResourceClaim instance.

For a container defined within a Pod, the resources.claims list defines whether that container gets access to these resource instances, which makes it possible to share resources between one or more containers inside the same Pod. For example, an init container could set up the resource before the application uses it.

Here is an example of a fictional resource driver. Two ResourceClaim objects will get created for this Pod and each container gets access to one of them.

Assuming a resource driver called resource-driver.example.com was installed together with the following resource class:

apiVersion: resource.k8s.io/v1alpha1
kind: ResourceClass
name: resource.example.com
driverName: resource-driver.example.com

An end-user could then allocate two specific resources of type resource.example.com as follows:

---
apiVersion: cats.resource.example.com/v1
kind: ClaimParameters
name: large-black-cats
spec:
 color: black
 size: large
---
apiVersion: resource.k8s.io/v1alpha1
kind: ResourceClaimTemplate
metadata:
 name: large-black-cats
spec:
 spec:
 resourceClassName: resource.example.com
 parametersRef:
 apiGroup: cats.resource.example.com
 kind: ClaimParameters
 name: large-black-cats
---
apiVersion: v1
kind: Pod
metadata:
 name: pod-with-cats
spec:
 containers: # two example containers; each container claims one cat resource
 - name: first-example
 image: ubuntu:22.04
 command: ["sleep", "9999"]
 resources:
 claims:
 - name: cat-0
 - name: second-example
 image: ubuntu:22.04
 command: ["sleep", "9999"]
 resources:
 claims:
 - name: cat-1
 resourceClaims:
 - name: cat-0
 source:
 resourceClaimTemplateName: large-black-cats
 - name: cat-1
 source:
 resourceClaimTemplateName: large-black-cats

Scheduling

In contrast to native resources (such as CPU or RAM) and extended resources (managed by a device plugin, advertised by kubelet), the scheduler has no knowledge of what dynamic resources are available in a cluster or how they could be split up to satisfy the requirements of a specific ResourceClaim. Resource drivers are responsible for that. Drivers mark ResourceClaims as allocated once resources for it are reserved. This also then tells the scheduler where in the cluster a claimed resource is actually available.

ResourceClaims can get resources allocated as soon as the ResourceClaim is created (immediate allocation), without considering which Pods will use the resource. The default (wait for first consumer) is to delay allocation until a Pod that relies on the ResourceClaim becomes eligible for scheduling. This design with two allocation options is similar to how Kubernetes handles storage provisioning with PersistentVolumes and PersistentVolumeClaims.

In the wait for first consumer mode, the scheduler checks all ResourceClaims needed by a Pod. If the Pods has any ResourceClaims, the scheduler creates a PodScheduling (a special object that requests scheduling details on behalf of the Pod). The PodScheduling has the same name and namespace as the Pod and the Pod as its as owner. Using its PodScheduling, the scheduler informs the resource drivers responsible for those ResourceClaims about nodes that the scheduler considers suitable for the Pod. The resource drivers respond by excluding nodes that don't have enough of the driver's resources left.

Once the scheduler has that resource information, it selects one node and stores that choice in the PodScheduling object. The resource drivers then allocate resources based on the relevant ResourceClaims so that the resources will be available on that selected node. Once that resource allocation is complete, the scheduler attempts to schedule the Pod to a suitable node. Scheduling can still fail at this point; for example, a different Pod could be scheduled to the same node in the meantime. If this happens, already allocated ResourceClaims may get deallocated to enable scheduling onto a different node.

As part of this process, ResourceClaims also get reserved for the Pod. Currently ResourceClaims can either be used exclusively by a single Pod or an unlimited number of Pods.

One key feature is that Pods do not get scheduled to a node unless all of their resources are allocated and reserved. This avoids the scenario where a Pod gets scheduled onto one node and then cannot run there, which is bad because such a pending Pod also blocks all other resources like RAM or CPU that were set aside for it.

Limitations

The scheduler plugin must be involved in scheduling Pods which use ResourceClaims. Bypassing the scheduler by setting the nodeName field leads to Pods that the kubelet refuses to start because the ResourceClaims are not reserved or not even allocated. It may be possible to remove this limitation in the future.

Writing a resource driver

A dynamic resource allocation driver typically consists of two separate-but-coordinating components: a centralized controller, and a DaemonSet of node-local kubelet plugins. Most of the work required by the centralized controller to coordinate with the scheduler can be handled by boilerplate code. Only the business logic required to actually allocate ResourceClaims against the ResourceClasses owned by the plugin needs to be customized. As such, Kubernetes provides the following package, including APIs for invoking this boilerplate code as well as a Driver interface that you can implement to provide their custom business logic:

Likewise, boilerplate code can be used to register the node-local plugin with the kubelet, as well as start a gRPC server to implement the kubelet plugin API. For drivers written in Go, the following package is recommended:

It is up to the driver developer to decide how these two components communicate. The KEP outlines an approach using CRDs.

Within SIG Node, we also plan to provide a complete example driver that can serve as a template for other drivers.

Running the test driver

The following steps bring up a local, one-node cluster directly from the Kubernetes source code. As a prerequisite, your cluster must have nodes with a container runtime that supports the Container Device Interface (CDI). For example, you can run CRI-O v1.23.2 or later. Once containerd v1.7.0 is released, we expect that you can run that or any later version. In the example below, we use CRI-O.

First, clone the Kubernetes source code. Inside that directory, run:

$ hack/install-etcd.sh
...

$ RUNTIME_CONFIG=resource.k8s.io/v1alpha1 \
 FEATURE_GATES=DynamicResourceAllocation=true \
 DNS_ADDON="coredns" \
 CGROUP_DRIVER=systemd \
 CONTAINER_RUNTIME_ENDPOINT=unix:///var/run/crio/crio.sock \
 LOG_LEVEL=6 \
 ENABLE_CSI_SNAPSHOTTER=false \
 API_SECURE_PORT=6444 \
 ALLOW_PRIVILEGED=1 \
 PATH=$(pwd)/third_party/etcd:$PATH \
 ./hack/local-up-cluster.sh -O
...
To start using your cluster, you can open up another terminal/tab and run:

 export KUBECONFIG=/var/run/kubernetes/admin.kubeconfig
...

Once the cluster is up, in another terminal run the test driver controller. KUBECONFIG must be set for all of the following commands.

$ go run ./test/e2e/dra/test-driver --feature-gates ContextualLogging=true -v=5 controller

In another terminal, run the kubelet plugin:

$ sudo mkdir -p /var/run/cdi && \
 sudo chmod a+rwx /var/run/cdi /var/lib/kubelet/plugins_registry /var/lib/kubelet/plugins/
$ go run ./test/e2e/dra/test-driver --feature-gates ContextualLogging=true -v=6 kubelet-plugin

Changing the permissions of the directories makes it possible to run and (when using delve) debug the kubelet plugin as a normal user, which is convenient because it uses the already populated Go cache. Remember to restore permissions with sudo chmod go-w when done. Alternatively, you can also build the binary and run that as root.

Now the cluster is ready to create objects:

$ kubectl create -f test/e2e/dra/test-driver/deploy/example/resourceclass.yaml
resourceclass.resource.k8s.io/example created

$ kubectl create -f test/e2e/dra/test-driver/deploy/example/pod-inline.yaml
configmap/test-inline-claim-parameters created
resourceclaimtemplate.resource.k8s.io/test-inline-claim-template created
pod/test-inline-claim created

$ kubectl get resourceclaims
NAME RESOURCECLASSNAME ALLOCATIONMODE STATE AGE
test-inline-claim-resource example WaitForFirstConsumer allocated,reserved 8s

$ kubectl get pods
NAME READY STATUS RESTARTS AGE
test-inline-claim 0/2 Completed 0 21s

The test driver doesn't do much, it only sets environment variables as defined in the ConfigMap. The test pod dumps the environment, so the log can be checked to verify that everything worked:

$ kubectl logs test-inline-claim with-resource | grep user_a
user_a='b'

Next steps

15 Dec 2022 12:00am GMT

13 Dec 2022

feedKubernetes – Production-Grade Container Orchestration

Blog: Kubernetes 1.26: Windows HostProcess Containers Are Generally Available

Authors: Brandon Smith (Microsoft) and Mark Rossetti (Microsoft)

The long-awaited day has arrived: HostProcess containers, the Windows equivalent to Linux privileged containers, has finally made it to GA in Kubernetes 1.26!

What are HostProcess containers and why are they useful?

Cluster operators are often faced with the need to configure their nodes upon provisioning such as installing Windows services, configuring registry keys, managing TLS certificates, making network configuration changes, or even deploying monitoring tools such as a Prometheus's node-exporter. Previously, performing these actions on Windows nodes was usually done by running PowerShell scripts over SSH or WinRM sessions and/or working with your cloud provider's virtual machine management tooling. HostProcess containers now enable you to do all of this and more with minimal effort using Kubernetes native APIs.

With HostProcess containers you can now package any payload into the container image, map volumes into containers at runtime, and manage them like any other Kubernetes workload. You get all the benefits of containerized packaging and deployment methods combined with a reduction in both administrative and development cost. Gone are the days where cluster operators would need to manually log onto Windows nodes to perform administrative duties.

HostProcess containers differ quite significantly from regular Windows Server containers. They are run directly as processes on the host with the access policies of a user you specify. HostProcess containers run as either the built-in Windows system accounts or ephemeral users within a user group defined by you. HostProcess containers also share the host's network namespace and access/configure storage mounts visible to the host. On the other hand, Windows Server containers are highly isolated and exist in a separate execution namespace. Direct access to the host from a Windows Server container is explicitly disallowed by default.

How does it work?

Windows HostProcess containers are implemented with Windows Job Objects, a break from the previous container model which use server silos. Job Objects are components of the Windows OS which offer the ability to manage a group of processes as a group (also known as a job) and assign resource constraints to the group as a whole. Job objects are specific to the Windows OS and are not associated with the Kubernetes Job API. They have no process or file system isolation, enabling the privileged payload to view and edit the host file system with the desired permissions, among other host resources. The init process, and any processes it launches (including processes explicitly launched by the user) are all assigned to the job object of that container. When the init process exits or is signaled to exit, all the processes in the job will be signaled to exit, the job handle will be closed and the storage will be unmounted.

HostProcess and Linux privileged containers enable similar scenarios but differ greatly in their implementation (hence the naming difference). HostProcess containers have their own PodSecurityContext fields. Those used to configure Linux privileged containers do not apply. Enabling privileged access to a Windows host is a fundamentally different process than with Linux so the configuration and capabilities of each differ significantly. Below is a diagram detailing the overall architecture of Windows HostProcess containers:

HostProcess Architecture

Two major features were added prior to moving to stable: the ability to run as local user accounts, and a simplified method of accessing volume mounts. To learn more, read Create a Windows HostProcess Pod.

HostProcess containers in action

Kubernetes SIG Windows has been busy putting HostProcess containers to use - even before GA! They've been very excited to use HostProcess containers for a number of important activities that were a pain to perform in the past.

Here are just a few of the many use use cases with example deployments:

How do I use it?

A HostProcess container can be built using any base image of your choosing, however, for convenience we have created a HostProcess container base image. This image is only a few KB in size and does not inherit any of the same compatibility requirements as regular Windows server containers which allows it to run on any Windows server version.

To use that Microsoft image, put this in your Dockerfile:

FROM mcr.microsoft.com/oss/kubernetes/windows-host-process-containers-base-image:v1.0.0

You can run HostProcess containers from within a HostProcess Pod.

To get started with running Windows containers, see the general guidance for deploying Windows nodes. If you have a compatible node (for example: Windows as the operating system with containerd v1.7 or later as the container runtime), you can deploy a Pod with one or more HostProcess containers. See the Create a Windows HostProcess Pod - Prerequisites for more information.

Please note that within a Pod, you can't mix HostProcess containers with normal Windows containers.

How can I learn more?

How do I get involved?

Get involved with SIG Windows to contribute!

13 Dec 2022 12:00am GMT