23 Aug 2024

feedKubernetes Blog

Kubernetes v1.31: kubeadm v1beta4

As part of the Kubernetes v1.31 release, kubeadm is adopting a new (v1beta4) version of its configuration file format. Configuration in the previous v1beta3 format is now formally deprecated, which means it's supported but you should migrate to v1beta4 and stop using the deprecated format. Support for v1beta3 configuration will be removed after a minimum of 3 Kubernetes minor releases.

In this article, I'll walk you through key changes; I'll explain about the kubeadm v1beta4 configuration format, and how to migrate from v1beta3 to v1beta4.

You can read the reference for the v1beta4 configuration format: kubeadm Configuration (v1beta4).

A list of changes since v1beta3

This version improves on the v1beta3 format by fixing some minor issues and adding a few new fields.

To put it simply,

For details, you can see the official document below:

These changes simplify the configuration of tools that use kubeadm and improve the extensibility of kubeadm itself.

How to migrate v1beta3 configuration to v1beta4?

If your configuration is not using the latest version, it is recommended that you migrate using the kubeadm config migrate command.

This command reads an existing configuration file that uses the old format, and writes a new file that uses the current format.

Example

Using kubeadm v1.31, run kubeadm config migrate --old-config old-v1beta3.yaml --new-config new-v1beta4.yaml

How do I get involved?

Huge thanks to all the contributors who helped with the design, implementation, and review of this feature:

For those interested in getting involved in future discussions on kubeadm configuration, you can reach out kubeadm or SIG-cluster-lifecycle by several means:

23 Aug 2024 12:00am GMT

22 Aug 2024

feedKubernetes Blog

Kubernetes v1.31: New Kubernetes CPUManager Static Policy: Distribute CPUs Across Cores

In Kubernetes v1.31, we are excited to introduce a significant enhancement to CPU management capabilities: the distribute-cpus-across-cores option for the CPUManager static policy. This feature is currently in alpha and hidden by default, marking a strategic shift aimed at optimizing CPU utilization and improving system performance across multi-core processors.

Understanding the feature

Traditionally, Kubernetes' CPUManager tends to allocate CPUs as compactly as possible, typically packing them onto the fewest number of physical cores. However, allocation strategy matters, CPUs on the same physical host still share some resources of the physical core, such as the cache and execution units, etc.

cpu-cache-architecture

While default approach minimizes inter-core communication and can be beneficial under certain scenarios, it also poses a challenge. CPUs sharing a physical core can lead to resource contention, which in turn may cause performance bottlenecks, particularly noticeable in CPU-intensive applications.

The new distribute-cpus-across-cores feature addresses this issue by modifying the allocation strategy. When enabled, this policy option instructs the CPUManager to spread out the CPUs (hardware threads) across as many physical cores as possible. This distribution is designed to minimize contention among CPUs sharing the same physical core, potentially enhancing the performance of applications by providing them dedicated core resources.

Technically, within this static policy, the free CPU list is reordered in the manner depicted in the diagram, aiming to allocate CPUs from separate physical cores.

cpu-ordering

Enabling the feature

To enable this feature, users firstly need to add --cpu-manager-policy=static kubelet flag or the cpuManagerPolicy: static field in KubeletConfiuration. Then user can add --cpu-manager-policy-options distribute-cpus-across-cores=true or distribute-cpus-across-cores=true to their CPUManager policy options in the Kubernetes configuration or. This setting directs the CPUManager to adopt the new distribution strategy. It is important to note that this policy option cannot currently be used in conjunction with full-pcpus-only or distribute-cpus-across-numa options.

Current limitations and future directions

As with any new feature, especially one in alpha, there are limitations and areas for future improvement. One significant current limitation is that distribute-cpus-across-cores cannot be combined with other policy options that might conflict in terms of CPU allocation strategies. This restriction can affect compatibility with certain workloads and deployment scenarios that rely on more specialized resource management.

Looking forward, we are committed to enhancing the compatibility and functionality of the distribute-cpus-across-cores option. Future updates will focus on resolving these compatibility issues, allowing this policy to be combined with other CPUManager policies seamlessly. Our goal is to provide a more flexible and robust CPU allocation framework that can adapt to a variety of workloads and performance demands.

Conclusion

The introduction of the distribute-cpus-across-cores policy in Kubernetes CPUManager is a step forward in our ongoing efforts to refine resource management and improve application performance. By reducing the contention on physical cores, this feature offers a more balanced approach to CPU resource allocation, particularly beneficial for environments running heterogeneous workloads. We encourage Kubernetes users to test this new feature and provide feedback, which will be invaluable in shaping its future development.

This draft aims to clearly explain the new feature while setting expectations for its current stage and future improvements.

Further reading

Please check out the Control CPU Management Policies on the Node task page to learn more about the CPU Manager, and how it fits in relation to the other node-level resource managers.

Getting involved

This feature is driven by the SIG Node. If you are interested in helping develop this feature, sharing feedback, or participating in any other ongoing SIG Node projects, please attend the SIG Node meeting for more details.

22 Aug 2024 12:00am GMT

Kubernetes 1.31: Fine-grained SupplementalGroups control

This blog discusses a new feature in Kubernetes 1.31 to improve the handling of supplementary groups in containers within Pods.

Motivation: Implicit group memberships defined in /etc/group in the container image

Although this behavior may not be popular with many Kubernetes cluster users/admins, kubernetes, by default, merges group information from the Pod with information defined in /etc/group in the container image.

Let's see an example, below Pod specifies runAsUser=1000, runAsGroup=3000 and supplementalGroups=4000 in the Pod's security context.

apiVersion: v1
kind: Pod
metadata:
 name: implicit-groups
spec:
 securityContext:
 runAsUser: 1000
 runAsGroup: 3000
 supplementalGroups: [4000]
 containers:
 - name: ctr
 image: registry.k8s.io/e2e-test-images/agnhost:2.45
 command: [ "sh", "-c", "sleep 1h" ]
 securityContext:
 allowPrivilegeEscalation: false

What is the result of id command in the ctr container?

# Create the Pod:
$ kubectl apply -f https://k8s.io/blog/2024-08-22-Fine-grained-SupplementalGroups-control/implicit-groups.yaml

# Verify that the Pod's Container is running:
$ kubectl get pod implicit-groups

# Check the id command
$ kubectl exec implicit-groups -- id

Then, output should be similar to this:

uid=1000 gid=3000 groups=3000,4000,50000

Where does group ID 50000 in supplementary groups (groups field) come from, even though 50000 is not defined in the Pod's manifest at all? The answer is /etc/group file in the container image.

Checking the contents of /etc/group in the container image should show below:

$ kubectl exec implicit-groups -- cat /etc/group
...
user-defined-in-image:x:1000:
group-defined-in-image:x:50000:user-defined-in-image

Aha! The container's primary user 1000 belongs to the group 50000 in the last entry.

Thus, the group membership defined in /etc/group in the container image for the container's primary user is implicitly merged to the information from the Pod. Please note that this was a design decision the current CRI implementations inherited from Docker, and the community never really reconsidered it until now.

What's wrong with it?

The implicitly merged group information from /etc/group in the container image may cause some concerns particularly in accessing volumes (see kubernetes/kubernetes#112879 for details) because file permission is controlled by uid/gid in Linux. Even worse, the implicit gids from /etc/group can not be detected/validated by any policy engines because there is no clue for the implicit group information in the manifest. This can also be a concern for Kubernetes security.

Fine-grained SupplementalGroups control in a Pod: SupplementaryGroupsPolicy

To tackle the above problem, Kubernetes 1.31 introduces new field supplementalGroupsPolicy in Pod's .spec.securityContext.

This field provies a way to control how to calculate supplementary groups for the container processes in a Pod. The available policy is below:

Let's see how Strict policy works.

apiVersion: v1
kind: Pod
metadata:
 name: strict-supplementalgroups-policy
spec:
 securityContext:
 runAsUser: 1000
 runAsGroup: 3000
 supplementalGroups: [4000]
 supplementalGroupsPolicy: Strict
 containers:
 - name: ctr
 image: registry.k8s.io/e2e-test-images/agnhost:2.45
 command: [ "sh", "-c", "sleep 1h" ]
 securityContext:
 allowPrivilegeEscalation: false
# Create the Pod:
$ kubectl apply -f https://k8s.io/blog/2024-08-22-Fine-grained-SupplementalGroups-control/strict-supplementalgroups-policy.yaml

# Verify that the Pod's Container is running:
$ kubectl get pod strict-supplementalgroups-policy

# Check the process identity:
kubectl exec -it strict-supplementalgroups-policy -- id

The output should be similar to this:

uid=1000 gid=3000 groups=3000,4000

You can see Strict policy can exclude group 50000 from groups!

Thus, ensuring supplementalGroupsPolicy: Strict (enforced by some policy mechanism) helps prevent the implicit supplementary groups in a Pod.

Attached process identity in Pod status

This feature also exposes the process identity attached to the first container process of the container via .status.containerStatuses[].user.linux field. It would be helpful to see if implicit group IDs are attached.

...
status:
 containerStatuses:
 - name: ctr
 user:
 linux:
 gid: 3000
 supplementalGroups:
 - 3000
 - 4000
 uid: 1000
...

Feature availability

To enable supplementalGroupsPolicy field, the following components have to be used:

You can see if the feature is supported in the Node's .status.features.supplementalGroupsPolicy field.

apiVersion: v1
kind: Node
...
status:
 features:
 supplementalGroupsPolicy: true

What's next?

Kubernetes SIG Node hope - and expect - that the feature will be promoted to beta and eventually general availability (GA) in future releases of Kubernetes, so that users no longer need to enable the feature gate manually.

Merge policy is applied when supplementalGroupsPolicy is not specified, for backwards compatibility.

How can I learn more?

How to get involved?

This feature is driven by the SIG Node community. Please join us to connect with the community and share your ideas and feedback around the above feature and beyond. We look forward to hearing from you!

22 Aug 2024 12:00am GMT

Kubernetes 1.31: Custom Profiling in Kubectl Debug Graduates to Beta

There are many ways of troubleshooting the pods and nodes in the cluster. However, kubectl debug is one of the easiest, highly used and most prominent ones. It provides a set of static profiles and each profile serves for a different kind of role. For instance, from the network administrator's point of view, debugging the node should be as easy as this:

$ kubectl debug node/mynode -it --image=busybox --profile=netadmin

On the other hand, static profiles also bring about inherent rigidity, which has some implications for some pods contrary to their ease of use. Because there are various kinds of pods (or nodes) that all have their specific necessities, and unfortunately, some can't be debugged by only using the static profiles.

Take an instance of a simple pod consisting of a container whose healthiness relies on an environment variable:

apiVersion: v1
kind: Pod
metadata:
 name: example-pod
spec:
 containers:
 - name: example-container
 image: customapp:latest
 env:
 - name: REQUIRED_ENV_VAR
 value: "value1"

Currently, copying the pod is the sole mechanism that supports debugging this pod in kubectl debug. Furthermore, what if user needs to modify the REQUIRED_ENV_VAR to something different for advanced troubleshooting?. There is no mechanism to achieve this.

Custom Profiling

Custom profiling is a new functionality available under --custom flag, introduced in kubectl debug to provide extensibility. It expects partial Container spec in either YAML or JSON format. In order to debug the example-container above by creating an ephemeral container, we simply have to define this YAML:

# partial_container.yaml
env:
 - name: REQUIRED_ENV_VAR
 value: value2

and execute:

kubectl debug example-pod -it --image=customapp --custom=partial_container.yaml

Here is another example that modifies multiple fields at once (change port number, add resource limits, modify environment variable) in JSON:

{
 "ports": [
 {
 "containerPort": 80
 }
 ],
 "resources": {
 "limits": {
 "cpu": "0.5",
 "memory": "512Mi"
 },
 "requests": {
 "cpu": "0.2",
 "memory": "256Mi"
 }
 },
 "env": [
 {
 "name": "REQUIRED_ENV_VAR",
 "value": "value2"
 }
 ]
}

Constraints

Uncontrolled extensibility hurts the usability. So that, custom profiling is not allowed for certain fields such as command, image, lifecycle, volume devices and container name. In the future, more fields can be added to the disallowed list if required.

Limitations

The kubectl debug command has 3 aspects: Debugging with ephemeral containers, pod copying, and node debugging. The largest intersection set of these aspects is the container spec within a Pod That's why, custom profiling only supports the modification of the fields that are defined with containers. This leads to a limitation that if user needs to modify the other fields in the Pod spec, it is not supported.

Acknowledgments

Special thanks to all the contributors who reviewed and commented on this feature, from the initial conception to its actual implementation (alphabetical order):

22 Aug 2024 12:00am GMT

21 Aug 2024

feedKubernetes Blog

Kubernetes 1.31: Autoconfiguration For Node Cgroup Driver (beta)

Historically, configuring the correct cgroup driver has been a pain point for users running new Kubernetes clusters. On Linux systems, there are two different cgroup drivers: cgroupfs and systemd. In the past, both the kubelet and CRI implementation (like CRI-O or containerd) needed to be configured to use the same cgroup driver, or else the kubelet would exit with an error. This was a source of headaches for many cluster admins. However, there is light at the end of the tunnel!

Automated cgroup driver detection

In v1.28.0, the SIG Node community introduced the feature gate KubeletCgroupDriverFromCRI, which instructs the kubelet to ask the CRI implementation which cgroup driver to use. A few minor releases of Kubernetes happened whilst we waited for support to land in the major two CRI implementations (containerd and CRI-O), but as of v1.31.0, this feature is now beta!

In addition to setting the feature gate, a cluster admin needs to ensure their CRI implementation is new enough:

Then, they should ensure their CRI implementation is configured to the cgroup_driver they would like to use.

Future work

Eventually, support for the kubelet's cgroupDriver configuration field will be dropped, and the kubelet will fail to start if the CRI implementation isn't new enough to have support for this feature.

21 Aug 2024 12:00am GMT

20 Aug 2024

feedKubernetes Blog

Kubernetes 1.31: Streaming Transitions from SPDY to WebSockets

In Kubernetes 1.31, by default kubectl now uses the WebSocket protocol instead of SPDY for streaming.

This post describes what these changes mean for you and why these streaming APIs matter.

Streaming APIs in Kubernetes

In Kubernetes, specific endpoints that are exposed as an HTTP or RESTful interface are upgraded to streaming connections, which require a streaming protocol. Unlike HTTP, which is a request-response protocol, a streaming protocol provides a persistent connection that's bi-directional, low-latency, and lets you interact in real-time. Streaming protocols support reading and writing data between your client and the server, in both directions, over the same connection. This type of connection is useful, for example, when you create a shell in a running container from your local workstation and run commands in the container.

Why change the streaming protocol?

Before the v1.31 release, Kubernetes used the SPDY/3.1 protocol by default when upgrading streaming connections. SPDY/3.1 has been deprecated for eight years, and it was never standardized. Many modern proxies, gateways, and load balancers no longer support the protocol. As a result, you might notice that commands like kubectl cp, kubectl attach, kubectl exec, and kubectl port-forward stop working when you try to access your cluster through a proxy or gateway.

As of Kubernetes v1.31, SIG API Machinery has modified the streaming protocol that a Kubernetes client (such as kubectl) uses for these commands to the more modern WebSocket streaming protocol. The WebSocket protocol is a currently supported standardized streaming protocol that guarantees compatibility and interoperability with different components and programming languages. The WebSocket protocol is more widely supported by modern proxies and gateways than SPDY.

How streaming APIs work

Kubernetes upgrades HTTP connections to streaming connections by adding specific upgrade headers to the originating HTTP request. For example, an HTTP upgrade request for running the date command on an nginx container within a cluster is similar to the following:

$ kubectl exec -v=8 nginx -- date
GET https://127.0.0.1:43251/api/v1/namespaces/default/pods/nginx/exec?command=date…
Request Headers:
 Connection: Upgrade
 Upgrade: websocket
 Sec-Websocket-Protocol: v5.channel.k8s.io
 User-Agent: kubectl/v1.31.0 (linux/amd64) kubernetes/6911225

If the container runtime supports the WebSocket streaming protocol and at least one of the subprotocol versions (e.g. v5.channel.k8s.io), the server responds with a successful 101 Switching Protocols status, along with the negotiated subprotocol version:

Response Status: 101 Switching Protocols in 3 milliseconds
Response Headers:
 Upgrade: websocket
 Connection: Upgrade
 Sec-Websocket-Accept: j0/jHW9RpaUoGsUAv97EcKw8jFM=
 Sec-Websocket-Protocol: v5.channel.k8s.io

At this point the TCP connection used for the HTTP protocol has changed to a streaming connection. Subsequent STDIN, STDOUT, and STDERR data (as well as terminal resizing data and process exit code data) for this shell interaction is then streamed over this upgraded connection.

How to use the new WebSocket streaming protocol

If your cluster and kubectl are on version 1.29 or later, there are two control plane feature gates and two kubectl environment variables that govern the use of the WebSockets rather than SPDY. In Kubernetes 1.31, all of the following feature gates are in beta and are enabled by default:

If you're connecting to an older cluster but can manage the feature gate settings, turn on both TranslateStreamCloseWebsocketRequests (added in Kubernetes v1.29) and PortForwardWebsockets (added in Kubernetes v1.30) to try this new behavior. Version 1.31 of kubectl can automatically use the new behavior, but you do need to connect to a cluster where the server-side features are explicitly enabled.

Learn more about streaming APIs

20 Aug 2024 12:00am GMT

19 Aug 2024

feedKubernetes Blog

Kubernetes 1.31: Pod Failure Policy for Jobs Goes GA

This post describes Pod failure policy, which graduates to stable in Kubernetes 1.31, and how to use it in your Jobs.

About Pod failure policy

When you run workloads on Kubernetes, Pods might fail for a variety of reasons. Ideally, workloads like Jobs should be able to ignore transient, retriable failures and continue running to completion.

To allow for these transient failures, Kubernetes Jobs include the backoffLimit field, which lets you specify a number of Pod failures that you're willing to tolerate during Job execution. However, if you set a large value for the backoffLimit field and rely solely on this field, you might notice unnecessary increases in operating costs as Pods restart excessively until the backoffLimit is met.

This becomes particularly problematic when running large-scale Jobs with thousands of long-running Pods across thousands of nodes.

The Pod failure policy extends the backoff limit mechanism to help you reduce costs in the following ways:

For example, you can use a Pod failure policy to run your workload on more affordable spot machines by ignoring Pod failures caused by graceful node shutdown.

The policy allows you to distinguish between retriable and non-retriable Pod failures based on container exit codes or Pod conditions in a failed Pod.

How it works

You specify a Pod failure policy in the Job specification, represented as a list of rules.

For each rule you define match requirements based on one of the following properties:

Additionally, for each rule, you specify one of the following actions to take when a Pod matches the rule:

When Pod failures occur in a running Job, Kubernetes matches the failed Pod status against the list of Pod failure policy rules, in the specified order, and takes the corresponding actions for the first matched rule.

Note that when specifying the Pod failure policy, you must also set the Job's Pod template with restartPolicy: Never. This prevents race conditions between the kubelet and the Job controller when counting Pod failures.

Kubernetes-initiated Pod disruptions

To allow matching Pod failure policy rules against failures caused by disruptions initiated by Kubernetes, this feature introduces the DisruptionTarget Pod condition.

Kubernetes adds this condition to any Pod, regardless of whether it's managed by a Job controller, that fails because of a retriable disruption scenario. The DisruptionTarget condition contains one of the following reasons that corresponds to these disruption scenarios:

In all other disruption scenarios, like eviction due to exceeding Pod container limits, Pods don't receive the DisruptionTarget condition because the disruptions were likely caused by the Pod and would reoccur on retry.

Example

The Pod failure policy snippet below demonstrates an example use:

podFailurePolicy:
 rules:
 - action: Ignore
 onPodConditions:
 - type: DisruptionTarget
 - action: FailJob
 onPodConditions:
 - type: ConfigIssue
 - action: FailJob
 onExitCodes:
 operator: In
 values: [ 42 ]

In this example, the Pod failure policy does the following:

Learn more

Related work

Based on the concepts introduced by Pod failure policy, the following additional work is in progress:

Get involved

This work was sponsored by batch working group in close collaboration with the SIG Apps, and SIG Node, and SIG Scheduling communities.

If you are interested in working on new features in the space we recommend subscribing to our Slack channel and attending the regular community meetings.

Acknowledgments

I would love to thank everyone who was involved in this project over the years - it's been a journey and a joint community effort! The list below is my best-effort attempt to remember and recognize people who made an impact. Thank you!

19 Aug 2024 12:00am GMT

16 Aug 2024

feedKubernetes Blog

Kubernetes 1.31: Read Only Volumes Based On OCI Artifacts (alpha)

The Kubernetes community is moving towards fulfilling more Artificial Intelligence (AI) and Machine Learning (ML) use cases in the future. While the project has been designed to fulfill microservice architectures in the past, it's now time to listen to the end users and introduce features which have a stronger focus on AI/ML.

One of these requirements is to support Open Container Initiative (OCI) compatible images and artifacts (referred as OCI objects) directly as a native volume source. This allows users to focus on OCI standards as well as enables them to store and distribute any content using OCI registries. A feature like this gives the Kubernetes project a chance to grow into use cases which go beyond running particular images.

Given that, the Kubernetes community is proud to present a new alpha feature introduced in v1.31: The Image Volume Source (KEP-4639). This feature allows users to specify an image reference as volume in a pod while reusing it as volume mount within containers:


kind: Pod
spec:
 containers:
 - …
 volumeMounts:
 - name: my-volume
 mountPath: /path/to/directory
 volumes:
 - name: my-volume
 image:
 reference: my-image:tag

The above example would result in mounting my-image:tag to /path/to/directory in the pod's container.

Use cases

The goal of this enhancement is to stick as close as possible to the existing container image implementation within the kubelet, while introducing a new API surface to allow more extended use cases.

For example, users could share a configuration file among multiple containers in a pod without including the file in the main image, so that they can minimize security risks and the overall image size. They can also package and distribute binary artifacts using OCI images and mount them directly into Kubernetes pods, so that they can streamline their CI/CD pipeline as an example.

Data scientists, MLOps engineers, or AI developers, can mount large language model weights or machine learning model weights in a pod alongside a model-server, so that they can efficiently serve them without including them in the model-server container image. They can package these in an OCI object to take advantage of OCI distribution and ensure efficient model deployment. This allows them to separate the model specifications/content from the executables that process them.

Another use case is that security engineers can use a public image for a malware scanner and mount in a volume of private (commercial) malware signatures, so that they can load those signatures without baking their own combined image (which might not be allowed by the copyright on the public image). Those files work regardless of the OS or version of the scanner software.

But in the long term it will be up to you as an end user of this project to outline further important use cases for the new feature. SIG Node is happy to retrieve any feedback or suggestions for further enhancements to allow more advanced usage scenarios. Feel free to provide feedback by either using the Kubernetes Slack (#sig-node) channel or the SIG Node mailinglist.

Detailed example

The Kubernetes alpha feature gate ImageVolume needs to be enabled on the API Server as well as the kubelet to make it functional. If that's the case and the container runtime has support for the feature (like CRI-O ≥ v1.31), then an example pod.yaml like this can be created:

apiVersion: v1
kind: Pod
metadata:
 name: pod
spec:
 containers:
 - name: test
 image: registry.k8s.io/e2e-test-images/echoserver:2.3
 volumeMounts:
 - name: volume
 mountPath: /volume
 volumes:
 - name: volume
 image:
 reference: quay.io/crio/artifact:v1
 pullPolicy: IfNotPresent

The pod declares a new volume using the image.reference of quay.io/crio/artifact:v1, which refers to an OCI object containing two files. The pullPolicy behaves in the same way as for container images and allows the following values:

The volumeMounts field is indicating that the container with the name test should mount the volume under the path /volume.

If you now create the pod:

kubectl apply -f pod.yaml

And exec into it:

kubectl exec -it pod -- sh

Then you're able to investigate what has been mounted:

/ # ls /volume
dir file
/ # cat /volume/file
2
/ # ls /volume/dir
file
/ # cat /volume/dir/file
1

You managed to consume an OCI artifact using Kubernetes!

The container runtime pulls the image (or artifact), mounts it to the container and makes it finally available for direct usage. There are a bunch of details in the implementation, which closely align to the existing image pull behavior of the kubelet. For example:

Thank you for reading through the end of this blog post! SIG Node is proud and happy to deliver this feature as part of Kubernetes v1.31.

As writer of this blog post, I would like to emphasize my special thanks to all involved individuals out there! You all rock, let's keep on hacking!

Further reading

16 Aug 2024 12:00am GMT

Kubernetes 1.31: Prevent PersistentVolume Leaks When Deleting out of Order

PersistentVolume (or PVs for short) are associated with Reclaim Policy. The reclaim policy is used to determine the actions that need to be taken by the storage backend on deletion of the PVC Bound to a PV. When the reclaim policy is Delete, the expectation is that the storage backend releases the storage resource allocated for the PV. In essence, the reclaim policy needs to be honored on PV deletion.

With the recent Kubernetes v1.31 release, a beta feature lets you configure your cluster to behave that way and honor the configured reclaim policy.

How did reclaim work in previous Kubernetes releases?

PersistentVolumeClaim (or PVC for short) is a user's request for storage. A PV and PVC are considered Bound if a newly created PV or a matching PV is found. The PVs themselves are backed by volumes allocated by the storage backend.

Normally, if the volume is to be deleted, then the expectation is to delete the PVC for a bound PV-PVC pair. However, there are no restrictions on deleting a PV before deleting a PVC.

First, I'll demonstrate the behavior for clusters running an older version of Kubernetes.

Retrieve a PVC that is bound to a PV

Retrieve an existing PVC example-vanilla-block-pvc

kubectl get pvc example-vanilla-block-pvc

The following output shows the PVC and its bound PV; the PV is shown under the VOLUME column:

NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
example-vanilla-block-pvc Bound pvc-6791fdd4-5fad-438e-a7fb-16410363e3da 5Gi RWO example-vanilla-block-sc 19s

Delete PV

When I try to delete a bound PV, the kubectl session blocks and the kubectl tool does not return back control to the shell; for example:

kubectl delete pv pvc-6791fdd4-5fad-438e-a7fb-16410363e3da
persistentvolume "pvc-6791fdd4-5fad-438e-a7fb-16410363e3da" deleted
^C

Retrieving the PV

kubectl get pv pvc-6791fdd4-5fad-438e-a7fb-16410363e3da

It can be observed that the PV is in a Terminating state

NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-6791fdd4-5fad-438e-a7fb-16410363e3da 5Gi RWO Delete Terminating default/example-vanilla-block-pvc example-vanilla-block-sc 2m23s

Delete PVC

kubectl delete pvc example-vanilla-block-pvc

The following output is seen if the PVC gets successfully deleted:

persistentvolumeclaim "example-vanilla-block-pvc" deleted

The PV object from the cluster also gets deleted. When attempting to retrieve the PV it will be observed that the PV is no longer found:

kubectl get pv pvc-6791fdd4-5fad-438e-a7fb-16410363e3da
Error from server (NotFound): persistentvolumes "pvc-6791fdd4-5fad-438e-a7fb-16410363e3da" not found

Although the PV is deleted, the underlying storage resource is not deleted and needs to be removed manually.

To sum up, the reclaim policy associated with the PersistentVolume is currently ignored under certain circumstances. For a Bound PV-PVC pair, the ordering of PV-PVC deletion determines whether the PV reclaim policy is honored. The reclaim policy is honored if the PVC is deleted first; however, if the PV is deleted prior to deleting the PVC, then the reclaim policy is not exercised. As a result of this behavior, the associated storage asset in the external infrastructure is not removed.

PV reclaim policy with Kubernetes v1.31

The new behavior ensures that the underlying storage object is deleted from the backend when users attempt to delete a PV manually.

How to enable new behavior?

To take advantage of the new behavior, you must have upgraded your cluster to the v1.31 release of Kubernetes and run the CSI external-provisioner version 5.0.1 or later.

How does it work?

For CSI volumes, the new behavior is achieved by adding a finalizer external-provisioner.volume.kubernetes.io/finalizer on new and existing PVs. The finalizer is only removed after the storage from the backend is deleted. `

An example of a PV with the finalizer, notice the new finalizer in the finalizers list

kubectl get pv pvc-a7b7e3ba-f837-45ba-b243-dec7d8aaed53 -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
 annotations:
 pv.kubernetes.io/provisioned-by: csi.vsphere.vmware.com
 creationTimestamp: "2021-11-17T19:28:56Z"
 finalizers:
 - kubernetes.io/pv-protection
 - external-provisioner.volume.kubernetes.io/finalizer
 name: pvc-a7b7e3ba-f837-45ba-b243-dec7d8aaed53
 resourceVersion: "194711"
 uid: 087f14f2-4157-4e95-8a70-8294b039d30e
spec:
 accessModes:
 - ReadWriteOnce
 capacity:
 storage: 1Gi
 claimRef:
 apiVersion: v1
 kind: PersistentVolumeClaim
 name: example-vanilla-block-pvc
 namespace: default
 resourceVersion: "194677"
 uid: a7b7e3ba-f837-45ba-b243-dec7d8aaed53
 csi:
 driver: csi.vsphere.vmware.com
 fsType: ext4
 volumeAttributes:
 storage.kubernetes.io/csiProvisionerIdentity: 1637110610497-8081-csi.vsphere.vmware.com
 type: vSphere CNS Block Volume
 volumeHandle: 2dacf297-803f-4ccc-afc7-3d3c3f02051e
 persistentVolumeReclaimPolicy: Delete
 storageClassName: example-vanilla-block-sc
 volumeMode: Filesystem
status:
 phase: Bound

The finalizer prevents this PersistentVolume from being removed from the cluster. As stated previously, the finalizer is only removed from the PV object after it is successfully deleted from the storage backend. To learn more about finalizers, please refer to Using Finalizers to Control Deletion.

Similarly, the finalizer kubernetes.io/pv-controller is added to dynamically provisioned in-tree plugin volumes.

What about CSI migrated volumes?

The fix applies to CSI migrated volumes as well.

Some caveats

The fix does not apply to statically provisioned in-tree plugin volumes.

References

How do I get involved?

The Kubernetes Slack channel SIG Storage communication channels are great mediums to reach out to the SIG Storage and migration working group teams.

Special thanks to the following people for the insightful reviews, thorough consideration and valuable contribution:

Join the Kubernetes Storage Special Interest Group (SIG) if you're interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system. We're rapidly growing and always welcome new contributors.

16 Aug 2024 12:00am GMT

Kubernetes 1.31: MatchLabelKeys in PodAffinity graduates to beta

Kubernetes 1.29 introduced new fields matchLabelKeys and mismatchLabelKeys in podAffinity and podAntiAffinity.

In Kubernetes 1.31, this feature moves to beta and the corresponding feature gate (MatchLabelKeysInPodAffinity) gets enabled by default.

matchLabelKeys - Enhanced scheduling for versatile rolling updates

During a workload's (e.g., Deployment) rolling update, a cluster may have Pods from multiple versions at the same time. However, the scheduler cannot distinguish between old and new versions based on the labelSelector specified in podAffinity or podAntiAffinity. As a result, it will co-locate or disperse Pods regardless of their versions.

This can lead to sub-optimal scheduling outcome, for example:

matchLabelKeys is a set of Pod label keys and addresses this problem. The scheduler looks up the values of these keys from the new Pod's labels and combines them with labelSelector so that podAffinity matches Pods that have the same key-value in labels.

By using label pod-template-hash in matchLabelKeys, you can ensure that only Pods of the same version are evaluated for podAffinity or podAntiAffinity.

apiVersion: apps/v1
kind: Deployment
metadata:
 name: application-server
...
 affinity:
 podAffinity:
 requiredDuringSchedulingIgnoredDuringExecution:
 - labelSelector:
 matchExpressions:
 - key: app
 operator: In
 values:
 - database
 topologyKey: topology.kubernetes.io/zone
 matchLabelKeys:
 - pod-template-hash

The above matchLabelKeys will be translated in Pods like:

kind: Pod
metadata:
 name: application-server
 labels:
 pod-template-hash: xyz
...
 affinity:
 podAffinity:
 requiredDuringSchedulingIgnoredDuringExecution:
 - labelSelector:
 matchExpressions:
 - key: app
 operator: In
 values:
 - database
 - key: pod-template-hash # Added from matchLabelKeys; Only Pods from the same replicaset will match this affinity.
 operator: In
 values:
 - xyz
 topologyKey: topology.kubernetes.io/zone
 matchLabelKeys:
 - pod-template-hash

mismatchLabelKeys - Service isolation

mismatchLabelKeys is a set of Pod label keys, like matchLabelKeys, which looks up the values of these keys from the new Pod's labels, and merge them with labelSelector as key notin (value) so that podAffinity does not match Pods that have the same key-value in labels.

Suppose all Pods for each tenant get tenant label via a controller or a manifest management tool like Helm.

Although the value of tenant label is unknown when composing each workload's manifest, the cluster admin wants to achieve exclusive 1:1 tenant to domain placement for a tenant isolation.

mismatchLabelKeys works for this usecase; By applying the following affinity globally using a mutating webhook, the cluster admin can ensure that the Pods from the same tenant will land on the same domain exclusively, meaning Pods from other tenants won't land on the same domain.

affinity:
 podAffinity: # ensures the pods of this tenant land on the same node pool
 requiredDuringSchedulingIgnoredDuringExecution:
 - matchLabelKeys:
 - tenant
 topologyKey: node-pool
 podAntiAffinity: # ensures only Pods from this tenant lands on the same node pool
 requiredDuringSchedulingIgnoredDuringExecution:
 - mismatchLabelKeys:
 - tenant
 labelSelector:
 matchExpressions:
 - key: tenant
 operator: Exists
 topologyKey: node-pool

The above matchLabelKeys and mismatchLabelKeys will be translated to like:

kind: Pod
metadata:
 name: application-server
 labels:
 tenant: service-a
spec:
 affinity:
 podAffinity: # ensures the pods of this tenant land on the same node pool
 requiredDuringSchedulingIgnoredDuringExecution:
 - matchLabelKeys:
 - tenant
 topologyKey: node-pool
 labelSelector:
 matchExpressions:
 - key: tenant
 operator: In
 values:
 - service-a 
 podAntiAffinity: # ensures only Pods from this tenant lands on the same node pool
 requiredDuringSchedulingIgnoredDuringExecution:
 - mismatchLabelKeys:
 - tenant
 labelSelector:
 matchExpressions:
 - key: tenant
 operator: Exists
 - key: tenant
 operator: NotIn
 values:
 - service-a
 topologyKey: node-pool

Getting involved

These features are managed by Kubernetes SIG Scheduling.

Please join us and share your feedback. We look forward to hearing from you!

How can I learn more?

16 Aug 2024 12:00am GMT

15 Aug 2024

feedKubernetes Blog

Kubernetes v1.31: Accelerating Cluster Performance with Consistent Reads from Cache

Kubernetes is renowned for its robust orchestration of containerized applications, but as clusters grow, the demands on the control plane can become a bottleneck. A key challenge has been ensuring strongly consistent reads from the etcd datastore, requiring resource-intensive quorum reads.

Today, the Kubernetes community is excited to announce a major improvement: consistent reads from cache, graduating to Beta in Kubernetes v1.31.

Why consistent reads matter

Consistent reads are essential for ensuring that Kubernetes components have an accurate view of the latest cluster state. Guaranteeing consistent reads is crucial for maintaining the accuracy and reliability of Kubernetes operations, enabling components to make informed decisions based on up-to-date information. In large-scale clusters, fetching and processing this data can be a performance bottleneck, especially for requests that involve filtering results. While Kubernetes can filter data by namespace directly within etcd, any other filtering by labels or field selectors requires the entire dataset to be fetched from etcd and then filtered in-memory by the Kubernetes API server. This is particularly impactful for components like the kubelet, which only needs to list pods scheduled to its node - but previously required the API Server and etcd to process all pods in the cluster.

The breakthrough: Caching with confidence

Kubernetes has long used a watch cache to optimize read operations. The watch cache stores a snapshot of the cluster state and receives updates through etcd watches. However, until now, it couldn't serve consistent reads directly, as there was no guarantee the cache was sufficiently up-to-date.

The consistent reads from cache feature addresses this by leveraging etcd's progress notifications mechanism. These notifications inform the watch cache about how current its data is compared to etcd. When a consistent read is requested, the system first checks if the watch cache is up-to-date. If the cache is not up-to-date, the system queries etcd for progress notifications until it's confirmed that the cache is sufficiently fresh. Once ready, the read is efficiently served directly from the cache, which can significantly improve performance, particularly in cases where it would require fetching a lot of data from etcd. This enables requests that filter data to be served from the cache, with only minimal metadata needing to be read from etcd.

Important Note: To benefit from this feature, your Kubernetes cluster must be running etcd version 3.4.31+ or 3.5.13+. For older etcd versions, Kubernetes will automatically fall back to serving consistent reads directly from etcd.

Performance gains you'll notice

This seemingly simple change has a profound impact on Kubernetes performance and scalability:

5k Node Scalability Test Results: In recent scalability tests on 5,000 node clusters, enabling consistent reads from cache delivered impressive improvements:

What's next?

With the graduation to beta, consistent reads from cache are enabled by default, offering a seamless performance boost to all Kubernetes users running a supported etcd version.

Our journey doesn't end here. Kubernetes community is actively exploring pagination support in the watch cache, which will unlock even more performance optimizations in the future.

Getting started

Upgrading to Kubernetes v1.31 and ensuring you are using etcd version 3.4.31+ or 3.5.13+ is the easiest way to experience the benefits of consistent reads from cache. If you have any questions or feedback, don't hesitate to reach out to the Kubernetes community.

Let us know how consistent reads from cache transforms your Kubernetes experience!

Special thanks to @ah8ad3 and @p0lyn0mial for their contributions to this feature!

15 Aug 2024 12:00am GMT

Kubernetes 1.31: VolumeAttributesClass for Volume Modification Beta

Volumes in Kubernetes have been described by two attributes: their storage class, and their capacity. The storage class is an immutable property of the volume, while the capacity can be changed dynamically with volume resize.

This complicates vertical scaling of workloads with volumes. While cloud providers and storage vendors often offer volumes which allow specifying IO quality of service (Performance) parameters like IOPS or throughput and tuning them as workloads operate, Kubernetes has no API which allows changing them.

We are pleased to announce that the VolumeAttributesClass KEP, alpha since Kubernetes 1.29, will be beta in 1.31. This provides a generic, Kubernetes-native API for modifying volume parameters like provisioned IO.

Like all new volume features in Kubernetes, this API is implemented via the container storage interface (CSI). In addition to the VolumeAttributesClass feature gate, your provisioner-specific CSI driver must support the new ModifyVolume API which is the CSI side of this feature.

See the full documentation for all details. Here we show the common workflow.

Dynamically modifying volume attributes.

A VolumeAttributesClass is a cluster-scoped resource that specifies provisioner-specific attributes. These are created by the cluster administrator in the same way as storage classes. For example, a series of gold, silver and bronze volume attribute classes can be created for volumes with greater or lessor amounts of provisioned IO.

apiVersion: storage.k8s.io/v1alpha1
kind: VolumeAttributesClass
metadata:
 name: silver
driverName: your-csi-driver
parameters:
 provisioned-iops: "500"
 provisioned-throughput: "50MiB/s"
---
apiVersion: storage.k8s.io/v1alpha1
kind: VolumeAttributesClass
metadata:
 name: gold
driverName: your-csi-driver
parameters:
 provisioned-iops: "10000"
 provisioned-throughput: "500MiB/s"

An attribute class is added to a PVC in much the same way as a storage class.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: test-pv-claim
spec:
 storageClassName: any-storage-class
 volumeAttributesClassName: silver
 accessModes:
 - ReadWriteOnce
 resources:
 requests:
 storage: 64Gi

Unlike a storage class, the volume attributes class can be changed:

kubectl patch pvc test-pv-claim -p '{"spec": "volumeAttributesClassName": "gold"}'

Kubernetes will work with the CSI driver to update the attributes of the volume. The status of the PVC will track the current and desired attributes class. The PV resource will also be updated with the new volume attributes class which will be set to the currently active attributes of the PV.

Limitations with the beta

As a beta feature, there are still some features which are planned for GA but not yet present. The largest is quota support, see the KEP and discussion in sig-storage for details.

See the Kubernetes CSI driver list for up-to-date information of support for this feature in CSI drivers.

15 Aug 2024 12:00am GMT

14 Aug 2024

feedKubernetes Blog

Kubernetes v1.31: PersistentVolume Last Phase Transition Time Moves to GA

Announcing the graduation to General Availability (GA) of the PersistentVolume lastTransitionTime status field, in Kubernetes v1.31!

The Kubernetes SIG Storage team is excited to announce that the "PersistentVolumeLastPhaseTransitionTime" feature, introduced as an alpha in Kubernetes v1.28, has now reached GA status and is officially part of the Kubernetes v1.31 release. This enhancement helps Kubernetes users understand when a PersistentVolume transitions between different phases, allowing for more efficient and informed resource management.

For a v1.31 cluster, you can now assume that every PersistentVolume object has a .status.lastTransitionTime field, that holds a timestamp of when the volume last transitioned its phase. This change is not immediate; the new field will be populated whenever a PersistentVolume is updated and first transitions between phases (Pending, Bound, or Released) after upgrading to Kubernetes v1.31.

What changed?

The API strategy for updating PersistentVolume objects has been modified to populate the .status.lastTransitionTime field with the current timestamp whenever a PersistentVolume transitions phases. Users are allowed to set this field manually if needed, but it will be overwritten when the PersistentVolume transitions phases again.

For more details, read about Phase transition timestamp in the Kubernetes documentation. You can also read the previous blog post announcing the feature as alpha in v1.28.

To provide feedback, join our Kubernetes Storage Special-Interest-Group (SIG) or participate in discussions on our public Slack channel.

14 Aug 2024 12:00am GMT

Kubernetes 1.31: Moving cgroup v1 Support into Maintenance Mode

As Kubernetes continues to evolve and adapt to the changing landscape of container orchestration, the community has decided to move cgroup v1 support into maintenance mode in v1.31. This shift aligns with the broader industry's move towards cgroup v2, offering improved functionalities: including scalability and a more consistent interface. Before we dive into the consequences for Kubernetes, let's take a step back to understand what cgroups are and their significance in Linux.

Understanding cgroups

Control groups, or cgroups, are a Linux kernel feature that allows the allocation, prioritization, denial, and management of system resources (such as CPU, memory, disk I/O, and network bandwidth) among processes. This functionality is crucial for maintaining system performance and ensuring that no single process can monopolize system resources, which is especially important in multi-tenant environments.

There are two versions of cgroups: v1 and v2. While cgroup v1 provided sufficient capabilities for resource management, it had limitations that led to the development of cgroup v2. Cgroup v2 offers a more unified and consistent interface, on top of better resource control features.

Cgroups in Kubernetes

For Linux nodes, Kubernetes relies heavily on cgroups to manage and isolate the resources consumed by containers running in pods. Each container in Kubernetes is placed in its own cgroup, which allows Kubernetes to enforce resource limits, monitor usage, and ensure fair resource distribution among all containers.

How Kubernetes uses cgroups

Resource Allocation
Ensures that containers do not exceed their allocated CPU and memory limits.
Isolation
Isolates containers from each other to prevent resource contention.
Monitoring
Tracks resource usage for each container to provide insights and metrics.

Transitioning to Cgroup v2

The Linux community has been focusing on cgroup v2 for new features and improvements. Major Linux distributions and projects like systemd are transitioning towards cgroup v2. Using cgroup v2 provides several benefits over cgroupv1, such as Unified Hierarchy, Improved Interface, Better Resource Control, cgroup aware OOM killer, rootless support etc.

Given these advantages, Kubernetes is also making the move to embrace cgroup v2 more fully. However, this transition needs to be handled carefully to avoid disrupting existing workloads and to provide a smooth migration path for users.

Moving cgroup v1 support into maintenance mode

What does maintenance mode mean?

When cgroup v1 is placed into maintenance mode in Kubernetes, it means that:

  1. Feature Freeze: No new features will be added to cgroup v1 support.
  2. Security Fixes: Critical security fixes will still be provided.
  3. Best-Effort Bug Fixes: Major bugs may be fixed if feasible, but some issues might remain unresolved.

Why move to maintenance mode?

The move to maintenance mode is driven by the need to stay in line with the broader ecosystem and to encourage the adoption of cgroup v2, which offers better performance, security, and usability. By transitioning cgroup v1 to maintenance mode, Kubernetes can focus on enhancing support for cgroup v2 and ensure it meets the needs of modern workloads. It's important to note that maintenance mode does not mean deprecation; cgroup v1 will continue to receive critical security fixes and major bug fixes as needed.

What this means for cluster administrators

Users currently relying on cgroup v1 are highly encouraged to plan for the transition to cgroup v2. This transition involves:

  1. Upgrading Systems: Ensuring that the underlying operating systems and container runtimes support cgroup v2.
  2. Testing Workloads: Verifying that workloads and applications function correctly with cgroup v2.

Further reading

14 Aug 2024 12:00am GMT

13 Aug 2024

feedKubernetes Blog

Kubernetes v1.31: Elli

Editors: Matteo Bianchi, Yigit Demirbas, Abigail McCarthy, Edith Puclla, Rashan Smith

Announcing the release of Kubernetes v1.31: Elli!

Similar to previous releases, the release of Kubernetes v1.31 introduces new stable, beta, and alpha features. The consistent delivery of high-quality releases underscores the strength of our development cycle and the vibrant support from our community. This release consists of 45 enhancements. Of those enhancements, 11 have graduated to Stable, 22 are entering Beta, and 12 have graduated to Alpha.

Release theme and logo

The Kubernetes v1.31 Release Theme is "Elli".

Kubernetes v1.31's Elli is a cute and joyful dog, with a heart of gold and a nice sailor's cap, as a playful wink to the huge and diverse family of Kubernetes contributors.

Kubernetes v1.31 marks the first release after the project has successfully celebrated its first 10 years. Kubernetes has come a very long way since its inception, and it's still moving towards exciting new directions with each release. After 10 years, it is awe-inspiring to reflect on the effort, dedication, skill, wit and tiring work of the countless Kubernetes contributors who have made this a reality.

And yet, despite the herculean effort needed to run the project, there is no shortage of people who show up, time and again, with enthusiasm, smiles and a sense of pride for contributing and being part of the community. This "spirit" that we see from new and old contributors alike is the sign of a vibrant community, a "joyful" community, if we might call it that.

Kubernetes v1.31's Elli is all about celebrating this wonderful spirit! Here's to the next decade of Kubernetes!

Highlights of features graduating to Stable

This is a selection of some of the improvements that are now stable following the v1.31 release.

AppArmor support is now stable

Kubernetes support for AppArmor is now GA. Protect your containers using AppArmor by setting the appArmorProfile.type field in the container's securityContext. Note that before Kubernetes v1.30, AppArmor was controlled via annotations; starting in v1.30 it is controlled using fields. It is recommended that you should migrate away from using annotations and start using the appArmorProfile.type field.

To learn more read the AppArmor tutorial. This work was done as a part of KEP #24, by SIG Node.

Improved ingress connectivity reliability for kube-proxy

Kube-proxy improved ingress connectivity reliability is stable in v1.31. One of the common problems with load balancers in Kubernetes is the synchronization between the different components involved to avoid traffic drop. This feature implements a mechanism in kube-proxy for load balancers to do connection draining for terminating Nodes exposed by services of type: LoadBalancer and externalTrafficPolicy: Cluster and establish some best practices for cloud providers and Kubernetes load balancers implementations.

To use this feature, kube-proxy needs to run as default service proxy on the cluster and the load balancer needs to support connection draining. There are no specific changes required for using this feature, it has been enabled by default in kube-proxy since v1.30 and been promoted to stable in v1.31.

For more details about this feature please visit the Virtual IPs and Service Proxies documentation page.

This work was done as part of KEP #3836 by SIG Network.

Persistent Volume last phase transition time

Persistent Volume last phase transition time feature moved to GA in v1.31. This feature adds a PersistentVolumeStatus field which holds a timestamp of when a PersistentVolume last transitioned to a different phase. With this feature enabled, every PersistentVolume object will have a new field .status.lastTransitionTime, that holds a timestamp of when the volume last transitioned its phase. This change is not immediate; the new field will be populated whenever a PersistentVolume is updated and first transitions between phases (Pending, Bound, or Released) after upgrading to Kubernetes v1.31. This allows you to measure time between when a PersistentVolume moves from Pending to Bound. This can be also useful for providing metrics and SLOs.

For more details about this feature please visit the PersistentVolume documentation page.

This work was done as a part of KEP #3762 by SIG Storage.

Highlights of features graduating to Beta

This is a selection of some of the improvements that are now beta following the v1.31 release.

nftables backend for kube-proxy

The nftables backend moves to beta in v1.31, behind the NFTablesProxyMode feature gate which is now enabled by default.

The nftables API is the successor to the iptables API and is designed to provide better performance and scalability than iptables. The nftables proxy mode is able to process changes to service endpoints faster and more efficiently than the iptables mode, and is also able to more efficiently process packets in the kernel (though this only becomes noticeable in clusters with tens of thousands of services).

As of Kubernetes v1.31, the nftables mode is still relatively new, and may not be compatible with all network plugins; consult the documentation for your network plugin. This proxy mode is only available on Linux nodes, and requires kernel 5.13 or later. Before migrating, note that some features, especially around NodePort services, are not implemented exactly the same in nftables mode as they are in iptables mode. Check the migration guide to see if you need to override the default configuration.

This work was done as part of KEP #3866 by SIG Network.

Changes to reclaim policy for PersistentVolumes

The Always Honor PersistentVolume Reclaim Policy feature has advanced to beta in Kubernetes v1.31. This enhancement ensures that the PersistentVolume (PV) reclaim policy is respected even after the associated PersistentVolumeClaim (PVC) is deleted, thereby preventing the leakage of volumes.

Prior to this feature, the reclaim policy linked to a PV could be disregarded under specific conditions, depending on whether the PV or PVC was deleted first. Consequently, the corresponding storage resource in the external infrastructure might not be removed, even if the reclaim policy was set to "Delete". This led to potential inconsistencies and resource leaks.

With the introduction of this feature, Kubernetes now guarantees that the "Delete" reclaim policy will be enforced, ensuring the deletion of the underlying storage object from the backend infrastructure, regardless of the deletion sequence of the PV and PVC.

This work was done as a part of KEP #2644 and by SIG Storage.

Bound service account token improvements

The ServiceAccountTokenNodeBinding feature is promoted to beta in v1.31. This feature allows requesting a token bound only to a node, not to a pod, which includes node information in claims in the token and validates the existence of the node when the token is used. For more information, read the bound service account tokens documentation.

This work was done as part of KEP #4193 by SIG Auth.

Multiple Service CIDRs

Support for clusters with multiple Service CIDRs moves to beta in v1.31 (disabled by default).

There are multiple components in a Kubernetes cluster that consume IP addresses: Nodes, Pods and Services. Nodes and Pods IP ranges can be dynamically changed because depend on the infrastructure or the network plugin respectively. However, Services IP ranges are defined during the cluster creation as a hardcoded flag in the kube-apiserver. IP exhaustion has been a problem for long lived or large clusters, as admins needed to expand, shrink or even replace entirely the assigned Service CIDR range. These operations were never supported natively and were performed via complex and delicate maintenance operations, often causing downtime on their clusters. This new feature allows users and cluster admins to dynamically modify Service CIDR ranges with zero downtime.

For more details about this feature please visit the Virtual IPs and Service Proxies documentation page.

This work was done as part of KEP #1880 by SIG Network.

Traffic distribution for Services

Traffic distribution for Services moves to beta in v1.31 and is enabled by default.

After several iterations on finding the best user experience and traffic engineering capabilities for Services networking, SIG Networking implemented the trafficDistribution field in the Service specification, which serves as a guideline for the underlying implementation to consider while making routing decisions.

For more details about this feature please read the 1.30 Release Blog or visit the Service documentation page.

This work was done as part of KEP #4444 by SIG Network.

Kubernetes VolumeAttributesClass ModifyVolume

VolumeAttributesClass API is moving to beta in v1.31. The VolumeAttributesClass provides a generic, Kubernetes-native API for modifying dynamically volume parameters like provisioned IO. This allows workloads to vertically scale their volumes on-line to balance cost and performance, if supported by their provider. This feature had been alpha since Kubernetes 1.29.

This work was done as a part of KEP #3751 and lead by SIG Storage.

New features in Alpha

This is a selection of some of the improvements that are now alpha following the v1.31 release.

New DRA APIs for better accelerators and other hardware management

Kubernetes v1.31 brings an updated dynamic resource allocation (DRA) API and design. The main focus in the update is on structured parameters because they make resource information and requests transparent to Kubernetes and clients and enable implementing features like cluster autoscaling. DRA support in the kubelet was updated such that version skew between kubelet and the control plane is possible. With structured parameters, the scheduler allocates ResourceClaims while scheduling a pod. Allocation by a DRA driver controller is still supported through what is now called "classic DRA".

With Kubernetes v1.31, classic DRA has a separate feature gate named DRAControlPlaneController, which you need to enable explicitly. With such a control plane controller, a DRA driver can implement allocation policies that are not supported yet through structured parameters.

This work was done as part of KEP #3063 by SIG Node.

Support for image volumes

The Kubernetes community is moving towards fulfilling more Artificial Intelligence (AI) and Machine Learning (ML) use cases in the future.

One of the requirements to fulfill these use cases is to support Open Container Initiative (OCI) compatible images and artifacts (referred as OCI objects) directly as a native volume source. This allows users to focus on OCI standards as well as enables them to store and distribute any content using OCI registries.

Given that, v1.31 adds a new alpha feature to allow using an OCI image as a volume in a Pod. This feature allows users to specify an image reference as volume in a pod while reusing it as volume mount within containers. You need to enable the ImageVolume feature gate to try this out.

This work was done as part of KEP #4639 by SIG Node and SIG Storage.

Exposing device health information through Pod status

Expose device health information through the Pod Status is added as a new alpha feature in v1.31, disabled by default.

Before Kubernetes v1.31, the way to know whether or not a Pod is associated with the failed device is to use the PodResources API.

By enabling this feature, the field allocatedResourcesStatus will be added to each container status, within the .status for each Pod. The allocatedResourcesStatus field reports health information for each device assigned to the container.

This work was done as part of KEP #4680 by SIG Node.

Finer-grained authorization based on selectors

This feature allows webhook authorizers and future (but not currently designed) in-tree authorizers to allow list and watch requests, provided those requests use label and/or field selectors. For example, it is now possible for an authorizer to express: this user cannot list all pods, but can list all pods where .spec.nodeName matches some specific value. Or to allow a user to watch all Secrets in a namespace that are not labelled as confidential: true. Combined with CRD field selectors (also moving to beta in v1.31), it is possible to write more secure per-node extensions.

This work was done as part of KEP #4601 by SIG Auth.

Restrictions on anonymous API access

By enabling the feature gate AnonymousAuthConfigurableEndpoints users can now use the authentication configuration file to configure the endpoints that can be accessed by anonymous requests. This allows users to protect themselves against RBAC misconfigurations that can give anonymous users broad access to the cluster.

This work was done as a part of KEP #4633 and by SIG Auth.

Graduations, deprecations, and removals in 1.31

Graduations to Stable

This lists all the features that graduated to stable (also known as general availability). For a full list of updates including new features and graduations from alpha to beta, see the release notes.

This release includes a total of 11 enhancements promoted to Stable:

Deprecations and Removals

As Kubernetes develops and matures, features may be deprecated, removed, or replaced with better ones for the project's overall health. See the Kubernetes deprecation and removal policy for more details on this process.

Cgroup v1 enters the maintenance mode

As Kubernetes continues to evolve and adapt to the changing landscape of container orchestration, the community has decided to move cgroup v1 support into maintenance mode in v1.31. This shift aligns with the broader industry's move towards cgroup v2, offering improved functionality, scalability, and a more consistent interface. Kubernetes maintance mode means that no new features will be added to cgroup v1 support. Critical security fixes will still be provided, however, bug-fixing is now best-effort, meaning major bugs may be fixed if feasible, but some issues might remain unresolved.

It is recommended that you start switching to use cgroup v2 as soon as possible. This transition depends on your architecture, including ensuring the underlying operating systems and container runtimes support cgroup v2 and testing workloads to verify that workloads and applications function correctly with cgroup v2.

Please report any problems you encounter by filing an issue.

This work was done as part of KEP #4569 by SIG Node.

A note about SHA-1 signature support

In go1.18 (released in March 2022), the crypto/x509 library started to reject certificates signed with a SHA-1 hash function. While SHA-1 is established to be unsafe and publicly trusted Certificate Authorities have not issued SHA-1 certificates since 2015, there might still be cases in the context of Kubernetes where user-provided certificates are signed using a SHA-1 hash function through private authorities with them being used for Aggregated API Servers or webhooks. If you have relied on SHA-1 based certificates, you must explicitly opt back into its support by setting GODEBUG=x509sha1=1 in your environment.

Given Go's compatibility policy for GODEBUGs, the x509sha1 GODEBUG and the support for SHA-1 certificates will fully go away in go1.24 which will be released in the first half of 2025. If you rely on SHA-1 certificates, please start moving off them.

Please see Kubernetes issue #125689 to get a better idea of timelines around the support for SHA-1 going away, when Kubernetes releases plans to adopt go1.24, and for more details on how to detect usage of SHA-1 certificates via metrics and audit logging.

Deprecation of status.nodeInfo.kubeProxyVersion field for Nodes (KEP 4004)

The .status.nodeInfo.kubeProxyVersion field of Nodes has been deprecated in Kubernetes v1.31, and will be removed in a later release. It's being deprecated because the value of this field wasn't (and isn't) accurate. This field is set by the kubelet, which does not have reliable information about the kube-proxy version or whether kube-proxy is running.

The DisableNodeKubeProxyVersion feature gate will be set to true in by default in v1.31 and the kubelet will no longer attempt to set the .status.kubeProxyVersion field for its associated Node.

Removal of all in-tree integrations with cloud providers

As highlighted in a previous article, the last remaining in-tree support for cloud provider integration has been removed as part of the v1.31 release. This doesn't mean you can't integrate with a cloud provider, however you now must use the recommended approach using an external integration. Some integrations are part of the Kubernetes project and others are third party software.

This milestone marks the completion of the externalization process for all cloud providers' integrations from the Kubernetes core (KEP-2395), a process started with Kubernetes v1.26. This change helps Kubernetes to get closer to being a truly vendor-neutral platform.

For further details on the cloud provider integrations, read our v1.29 Cloud Provider Integrations feature blog. For additional context about the in-tree code removal, we invite you to check the (v1.29 deprecation blog).

The latter blog also contains useful information for users who need to migrate to version v1.29 and later.

Removal of in-tree provider feature gates

In Kubernetes v1.31, the following alpha feature gates InTreePluginAWSUnregister, InTreePluginAzureDiskUnregister, InTreePluginAzureFileUnregister, InTreePluginGCEUnregister, InTreePluginOpenStackUnregister, and InTreePluginvSphereUnregister have been removed. These feature gates were introduced to facilitate the testing of scenarios where in-tree volume plugins were removed from the codebase, without actually removing them. Since Kubernetes 1.30 had deprecated these in-tree volume plugins, these feature gates were redundant and no longer served a purpose. The only CSI migration gate still standing is InTreePluginPortworxUnregister, which will remain in alpha until the CSI migration for Portworx is completed and its in-tree volume plugin will be ready for removal.

Removal of kubelet --keep-terminated-pod-volumes command line flag

The kubelet flag --keep-terminated-pod-volumes, which was deprecated in 2017, has been removed as part of the v1.31 release.

You can find more details in the pull request #122082.

Removal of CephFS volume plugin

CephFS volume plugin was removed in this release and the cephfs volume type became non-functional.

It is recommended that you use the CephFS CSI driver as a third-party storage driver instead. If you were using the CephFS volume plugin before upgrading the cluster version to v1.31, you must re-deploy your application to use the new driver.

CephFS volume plugin was formally marked as deprecated in v1.28.

Removal of Ceph RBD volume plugin

The v1.31 release removes the Ceph RBD volume plugin and its CSI migration support, making the rbd volume type non-functional.

It's recommended that you use the RBD CSI driver in your clusters instead. If you were using Ceph RBD volume plugin before upgrading the cluster version to v1.31, you must re-deploy your application to use the new driver.

The Ceph RBD volume plugin was formally marked as deprecated in v1.28.

Deprecation of non-CSI volume limit plugins in kube-scheduler

The v1.31 release will deprecate all non-CSI volume limit scheduler plugins, and will remove some already deprected plugins from the default plugins, including:

It's recommended that you use the NodeVolumeLimits plugin instead because it can handle the same functionality as the removed plugins since those volume types have been migrated to CSI. Please replace the deprecated plugins with the NodeVolumeLimits plugin if you explicitly use them in the scheduler config. The AzureDiskLimits, CinderLimits, EBSLimits, and GCEPDLimits plugins will be removed in a future release.

These plugins will be removed from the default scheduler plugins list as they have been deprecated since Kubernetes v1.14.

Release notes and upgrade actions required

Check out the full details of the Kubernetes v1.31 release in our release notes.

Scheduler now uses QueueingHint when SchedulerQueueingHints is enabled

Added support to the scheduler to start using a QueueingHint registered for Pod/Updated events, to determine whether updates to previously unschedulable Pods have made them schedulable. The new support is active when the feature gate SchedulerQueueingHints is enabled.

Previously, when unschedulable Pods were updated, the scheduler always put Pods back to into a queue (activeQ / backoffQ). However not all updates to Pods make Pods schedulable, especially considering many scheduling constraints nowadays are immutable. Under the new behaviour, once unschedulable Pods are updated, the scheduling queue checks with QueueingHint(s) whether the update may make the pod(s) schedulable, and requeues them to activeQ or backoffQ only when at least one QueueingHint returns Queue.

Action required for custom scheduler plugin developers: Plugins have to implement a QueueingHint for Pod/Update event if the rejection from them could be resolved by updating unscheduled Pods themselves. Example: suppose you develop a custom plugin that denies Pods that have a schedulable=false label. Given Pods with a schedulable=false label will be schedulable if the schedulable=false label is removed, this plugin would implement QueueingHint for Pod/Update event that returns Queue when such label changes are made in unscheduled Pods. You can find more details in the pull request #122234.

Removal of kubelet --keep-terminated-pod-volumes command line flag

The kubelet flag --keep-terminated-pod-volumes, which was deprecated in 2017, was removed as part of the v1.31 release.

You can find more details in the pull request #122082.

Availability

Kubernetes v1.31 is available for download on GitHub or on the Kubernetes download page.

To get started with Kubernetes, check out these interactive tutorials or run local Kubernetes clusters using minikube. You can also easily install v1.31 using kubeadm.

Release team

Kubernetes is only possible with the support, commitment, and hard work of its community. Each release team is made up of dedicated community volunteers who work together to build the many pieces that make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management.

We would like to thank the entire release team for the hours spent hard at work to deliver the Kubernetes v1.31 release to our community. The Release Team's membership ranges from first-time shadows to returning team leads with experience forged over several release cycles. A very special thanks goes out our release lead, Angelos Kolaitis, for supporting us through a successful release cycle, advocating for us, making sure that we could all contribute in the best way possible, and challenging us to improve the release process.

Project velocity

The CNCF K8s DevStats project aggregates a number of interesting data points related to the velocity of Kubernetes and various sub-projects. This includes everything from individual contributions to the number of companies that are contributing and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.

In the v1.31 release cycle, which ran for 14 weeks (May 7th to August 13th), we saw contributions to Kubernetes from 113 different companies and 528 individuals.

In the whole Cloud Native ecosystem we have 379 companies counting 2268 total contributors - which means that respect to the previous release cycle we experienced an astounding 63% increase on individuals contributing!

Source for this data:

By contribution we mean when someone makes a commit, code review, comment, creates an issue or PR, reviews a PR (including blogs and documentation) or comments on issues and PRs.

If you are interested in contributing visit this page to get started.

Check out DevStats to learn more about the overall velocity of the Kubernetes project and community.

Event update

Explore the upcoming Kubernetes and cloud-native events from August to November 2024, featuring KubeCon, KCD, and other notable conferences worldwide. Stay informed and engage with the Kubernetes community.

August 2024

September 2024

October 2024

November 2024

Upcoming release webinar

Join members of the Kubernetes v1.31 release team on Thursday, Thu Sep 12, 2024 10am PT to learn about the major features of this release, as well as deprecations and removals to help plan for upgrades. For more information and registration, visit the event page on the CNCF Online Programs site.

Get involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you'd like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.

13 Aug 2024 12:00am GMT

12 Aug 2024

feedKubernetes Blog

Introducing Feature Gates to Client-Go: Enhancing Flexibility and Control

Kubernetes components use on-off switches called feature gates to manage the risk of adding a new feature. The feature gate mechanism is what enables incremental graduation of a feature through the stages Alpha, Beta, and GA.

Kubernetes components, such as kube-controller-manager and kube-scheduler, use the client-go library to interact with the API. The same library is used across the Kubernetes ecosystem to build controllers, tools, webhooks, and more. client-go now includes its own feature gating mechanism, giving developers and cluster administrators more control over how they adopt client features.

To learn more about feature gates in Kubernetes, visit Feature Gates.

Motivation

In the absence of client-go feature gates, each new feature separated feature availability from enablement in its own way, if at all. Some features were enabled by updating to a newer version of client-go. Others needed to be actively configured in each program that used them. A few were configurable at runtime using environment variables. Consuming a feature-gated functionality exposed by the kube-apiserver sometimes required a client-side fallback mechanism to remain compatible with servers that don't support the functionality due to their age or configuration. In cases where issues were discovered in these fallback mechanisms, mitigation required updating to a fixed version of client-go or rolling back.

None of these approaches offer good support for enabling a feature by default in some, but not all, programs that consume client-go. Instead of enabling a new feature at first only for a single component, a change in the default setting immediately affects the default for all Kubernetes components, which broadens the blast radius significantly.

Feature gates in client-go

To address these challenges, substantial client-go features will be phased in using the new feature gate mechanism. It will allow developers and users to enable or disable features in a way that will be familiar to anyone who has experience with feature gates in the Kubernetes components.

Out of the box, simply by using a recent version of client-go, this offers several benefits.

For people who use software built with client-go:

For people who develop software built with client-go:

Overriding client-go feature gates

Note: This describes the default method for overriding client-go feature gates at runtime. It can be disabled or customized by the developer of a particular program. In Kubernetes components, client-go feature gate overrides are controlled by the --feature-gates flag.

Features of client-go can be enabled or disabled by setting environment variables prefixed with KUBE_FEATURE. For example, to enable a feature named MyFeature, set the environment variable as follows:

 KUBE_FEATURE_MyFeature=true

To disable the feature, set the environment variable to false:

 KUBE_FEATURE_MyFeature=false

Note: Environment variables are case-sensitive on some operating systems. Therefore, KUBE_FEATURE_MyFeature and KUBE_FEATURE_MYFEATURE would be considered two different variables.

Customizing client-go feature gates

The default environment-variable based mechanism for feature gate overrides can be sufficient for many programs in the Kubernetes ecosystem, and requires no special integration. Programs that require different behavior can replace it with their own custom feature gate provider. This allows a program to do things like force-disable a feature that is known to work poorly, read feature gates directly from a remote configuration service, or accept feature gate overrides through command-line options.

The Kubernetes components replace client-go's default feature gate provider with a shim to the existing Kubernetes feature gate provider. For all practical purposes, client-go feature gates are treated the same as other Kubernetes feature gates: they are wired to the --feature-gates command-line flag, included in feature enablement metrics, and logged on startup.

To replace the default feature gate provider, implement the Gates interface and call ReplaceFeatureGates at package initialization time, as in this simple example:

import (
 "k8s.io/client-go/features"
)

type AlwaysEnabledGates struct{}

func (AlwaysEnabledGates) Enabled(features.Feature) bool {
 return true
}

func init() {
 features.ReplaceFeatureGates(AlwaysEnabledGates{})
}

Implementations that need the complete list of defined client-go features can get it by implementing the Registry interface and calling AddFeaturesToExistingFeatureGates. For a complete example, refer to the usage within Kubernetes.

Summary

With the introduction of feature gates in client-go v1.30, rolling out a new client-go feature has become safer and easier. Users and developers can control the pace of their own adoption of client-go features. The work of Kubernetes contributors is streamlined by having a common mechanism for graduating features that span both sides of the Kubernetes API boundary.

Special shoutout to @sttts and @deads2k for their help in shaping this feature.

12 Aug 2024 12:00am GMT