25 Apr 2025

feedKubernetes Blog

Kubernetes v1.33: User Namespaces enabled by default!

In Kubernetes v1.33 support for user namespaces is enabled by default. This means that, when the stack requirements are met, pods can opt-in to use user namespaces. To use the feature there is no need to enable any Kubernetes feature flag anymore!

In this blog post we answer some common questions about user namespaces. But, before we dive into that, let's recap what user namespaces are and why they are important.

What is a user namespace?

Note: Linux user namespaces are a different concept from Kubernetes namespaces. The former is a Linux kernel feature; the latter is a Kubernetes feature.

Linux provides different namespaces to isolate processes from each other. For example, a typical Kubernetes pod runs within a network namespace to isolate the network identity and a PID namespace to isolate the processes.

One Linux namespace that was left behind is the user namespace. It isolates the UIDs and GIDs of the containers from the ones on the host. The identifiers in a container can be mapped to identifiers on the host in a way where host and container(s) never end up in overlapping UID/GIDs. Furthermore, the identifiers can be mapped to unprivileged, non-overlapping UIDs and GIDs on the host. This brings three key benefits:

Image showing IDs 0-65535 are reserved to the host, pods use higher IDs

User namespace IDs allocation

If a pod running as the root user without a user namespace manages to breakout, it has root privileges on the node. If some capabilities were granted to the container, the capabilities are valid on the host too. None of this is true when using user namespaces (modulo bugs, of course 🙂).

Demos

Rodrigo created demos to understand how some CVEs are mitigated when user namespaces are used. We showed them here before (see here and here), but take a look if you haven't:

Mitigation of CVE 2024-21626 with user namespaces:

Mitigation of CVE 2022-0492 with user namespaces:

Everything you wanted to know about user namespaces in Kubernetes

Here we try to answer some of the questions we have been asked about user namespaces support in Kubernetes.

1. What are the requirements to use it?

The requirements are documented here. But we will elaborate a bit more, in the following questions.

Note this is a Linux-only feature.

2. How do I configure a pod to opt-in?

A complete step-by-step guide is available here. But the short version is you need to set the hostUsers: false field in the pod spec. For example like this:

apiVersion: v1
kind: Pod
metadata:
 name: userns
spec:
 hostUsers: false
 containers:
 - name: shell
 command: ["sleep", "infinity"]
 image: debian

Yes, it is that simple. Applications will run just fine, without any other changes needed (unless your application needs the privileges).

User namespaces allows you to run as root inside the container, but not have privileges in the host. However, if your application needs the privileges on the host, for example an app that needs to load a kernel module, then you can't use user namespaces.

3. What are idmap mounts and why the file-systems used need to support it?

Idmap mounts are a Linux kernel feature that uses a mapping of UIDs/GIDs when accessing a mount. When combined with user namespaces, it greatly simplifies the support for volumes, as you can forget about the host UIDs/GIDs the user namespace is using.

In particular, thanks to idmap mounts we can:

Support for idmap mounts in the kernel is per file-system and different kernel releases added support for idmap mounts on different file-systems.

To find which kernel version added support for each file-system, you can check out the mount_setattr man page, or the online version of it here.

Most popular file-systems are supported, the notable absence that isn't supported yet is NFS.

4. Can you clarify exactly which file-systems need to support idmap mounts?

The file-systems that need to support idmap mounts are all the file-systems used by a pod in the pod.spec.volumes field.

This means: for PV/PVC volumes, the file-system used in the PV needs to support idmap mounts; for hostPath volumes, the file-system used in the hostPath needs to support idmap mounts.

What does this mean for secrets/configmaps/projected/downwardAPI volumes? For these volumes, the kubelet creates a tmpfs file-system. So, you will need a 6.3 kernel to use these volumes (note that if you use them as env variables it is fine).

And what about emptyDir volumes? Those volumes are created by the kubelet by default in /var/lib/kubelet/pods/. You can also use a custom directory for this. But what needs to support idmap mounts is the file-system used in that directory.

The kubelet creates some more files for the container, like /etc/hostname, /etc/resolv.conf, /dev/termination-log, /etc/hosts, etc. These files are also created in /var/lib/kubelet/pods/ by default, so it's important for the file-system used in that directory to support idmap mounts.

Also, some container runtimes may put some of these ephemeral volumes inside a tmpfs file-system, in which case you will need support for idmap mounts in tmpfs.

5. Can I use a kernel older than 6.3?

Yes, but you will need to make sure you are not using a tmpfs file-system. If you avoid that, you can easily use 5.19 (if all the other file-systems you use support idmap mounts in that kernel).

It can be tricky to avoid using tmpfs, though, as we just described above. Besides having to avoid those volume types, you will also have to avoid mounting the service account token. Every pod has it mounted by default, and it uses a projected volume that, as we mentioned, uses a tmpfs file-system.

You could even go lower than 5.19, all the way to 5.12. However, your container rootfs probably uses an overlayfs file-system, and support for overlayfs was added in 5.19. We wouldn't recommend to use a kernel older than 5.19, as not being able to use idmap mounts for the rootfs is a big limitation. If you absolutely need to, you can check this blog post Rodrigo wrote some years ago, about tricks to use user namespaces when you can't support idmap mounts on the rootfs.

6. If my stack supports user namespaces, do I need to configure anything else?

No, if your stack supports it and you are using Kubernetes v1.33, there is nothing you need to configure. You should be able to follow the task: Use a user namespace with a pod.

However, in case you have specific requirements, you may configure various options. You can find more information here. You can also enable a feature gate to relax the PSS rules.

7. The demos are nice, but are there more CVEs that this mitigates?

Yes, quite a lot, actually! Besides the ones in the demo, the KEP has more CVEs you can check. That list is not exhaustive, there are many more.

8. Can you sum up why user namespaces is important?

Think about running a process as root, maybe even an untrusted process. Do you think that is secure? What if we limit it by adding seccomp and apparmor, mask some files in /proc (so it can't crash the node, etc.) and some more tweaks?

Wouldn't it be better if we don't give it privileges in the first place, instead of trying to play whack-a-mole with all the possible ways root can escape?

This is what user namespaces does, plus some other goodies:

9. Is there container runtime documentation for user namespaces?

Yes, we have containerd documentation. This explains different limitations of containerd 1.7 and how to use user namespaces in containerd without Kubernetes pods (using ctr). Note that if you use containerd, you need containerd 2.0 or higher to use user namespaces with Kubernetes.

CRI-O doesn't have special documentation for user namespaces, it works out of the box.

10. What about the other container runtimes?

No other container runtime that we are aware of supports user namespaces with Kubernetes. That sadly includes cri-dockerd too.

11. I'd like to learn more about it, what would you recommend?

Rodrigo did an introduction to user namespaces at KubeCon 2022:

Also, this aforementioned presentation at KubeCon 2023 can be useful as a motivation for user namespaces:

Bear in mind the presentation are some years old, some things have changed since then. Use the Kubernetes documentation as the source of truth.

If you would like to learn more about the low-level details of user namespaces, you can check man 7 user_namespaces and man 1 unshare. You can easily create namespaces and experiment with how they behave. Be aware that the unshare tool has a lot of flexibility, and with that options to create incomplete setups.

If you would like to know more about idmap mounts, you can check its Linux kernel documentation.

Conclusions

Running pods as root is not ideal and running them as non-root is also hard with containers, as it can require a lot of changes to the applications. User namespaces are a unique feature to let you have the best of both worlds: run as non-root, without any changes to your application.

This post covered: what are user namespaces, why they are important, some real world examples of CVEs mitigated by user-namespaces, and some common questions. Hopefully, this post helped you to eliminate the last doubts you had and you will now try user-namespaces (if you didn't already!).

How do I get involved?

You can reach SIG Node by several means:

You can also contact us directly:

25 Apr 2025 6:30pm GMT

24 Apr 2025

feedKubernetes Blog

Continuing the transition from Endpoints to EndpointSlices

Since the addition of EndpointSlices (KEP-752) as alpha in v1.15 and later GA in v1.21, the Endpoints API in Kubernetes has been gathering dust. New Service features like dual-stack networking and traffic distribution are only supported via the EndpointSlice API, so all service proxies, Gateway API implementations, and similar controllers have had to be ported from using Endpoints to using EndpointSlices. At this point, the Endpoints API is really only there to avoid breaking end user workloads and scripts that still make use of it.

As of Kubernetes 1.33, the Endpoints API is now officially deprecated, and the API server will return warnings to users who read or write Endpoints resources rather than using EndpointSlices.

Eventually, the plan (as documented in KEP-4974) is to change the Kubernetes Conformance criteria to no longer require that clusters run the Endpoints controller (which generates Endpoints objects based on Services and Pods), to avoid doing work that is unneeded in most modern-day clusters.

Thus, while the Kubernetes deprecation policy means that the Endpoints type itself will probably never completely go away, users who still have workloads or scripts that use the Endpoints API should start migrating them to EndpointSlices.

Notes on migrating from Endpoints to EndpointSlices

Consuming EndpointSlices rather than Endpoints

For end users, the biggest change between the Endpoints API and the EndpointSlice API is that while every Service with a selector has exactly 1 Endpoints object (with the same name as the Service), a Service may have any number of EndpointSlices associated with it:

$ kubectl get endpoints myservice
Warning: v1 Endpoints is deprecated in v1.33+; use discovery.k8s.io/v1 EndpointSlice
NAME ENDPOINTS AGE
myservice 10.180.3.17:443 1h

$ kubectl get endpointslice -l kubernetes.io/service-name=myservice
NAME ADDRESSTYPE PORTS ENDPOINTS AGE
myservice-7vzhx IPv4 443 10.180.3.17 21s
myservice-jcv8s IPv6 443 2001:db8:0123::5 21s

In this case, because the service is dual stack, it has 2 EndpointSlices: 1 for IPv4 addresses and 1 for IPv6 addresses. (The Endpoints API does not support dual stack, so the Endpoints object shows only the addresses in the cluster's primary address family.) Although any Service with multiple endpoints can have multiple EndpointSlices, there are three main cases where you will see this:

Because there is not a predictable 1-to-1 mapping between Services and EndpointSlices, there is no way to know what the actual name of the EndpointSlice resource(s) for a Service will be ahead of time; thus, instead of fetching the EndpointSlice(s) by name, you instead ask for all EndpointSlices with a "kubernetes.io/service-name" label pointing to the Service:

$ kubectl get endpointslice -l kubernetes.io/service-name=myservice

A similar change is needed in Go code. With Endpoints, you would do something like:

// Get the Endpoints named `name` in `namespace`.
endpoint, err := client.CoreV1().Endpoints(namespace).Get(ctx, name, metav1.GetOptions{})
if err != nil {
 if apierrors.IsNotFound(err) {
 // No Endpoints exists for the Service (yet?)
 ...
 }
 // handle other errors
 ...
}

// process `endpoint`
...

With EndpointSlices, this becomes:

// Get all EndpointSlices for Service `name` in `namespace`.
slices, err := client.DiscoveryV1().EndpointSlices(namespace).List(ctx,
 metav1.ListOptions{LabelSelector: discoveryv1.LabelServiceName + "=" + name})
if err != nil {
 // handle errors
 ...
} else if len(slices.Items) == 0 {
 // No EndpointSlices exist for the Service (yet?)
 ...
}

// process `slices.Items`
...

Generating EndpointSlices rather than Endpoints

For people (or controllers) generating Endpoints, migrating to EndpointSlices is slightly easier, because in most cases you won't have to worry about multiple slices. You just need to update your YAML or Go code to use the new type (which organizes the information in a slightly different way than Endpoints did).

For example, this Endpoints object:

apiVersion: v1
kind: Endpoints
metadata:
 name: myservice
subsets:
 - addresses:
 - ip: 10.180.3.17
 nodeName: node-4
 - ip: 10.180.5.22
 nodeName: node-9
 - ip: 10.180.18.2
 nodeName: node-7
 notReadyAddresses:
 - ip: 10.180.6.6
 nodeName: node-8
 ports:
 - name: https
 protocol: TCP
 port: 443

would become something like:

apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
 name: myservice
 labels:
 kubernetes.io/service-name: myservice
addressType: IPv4
endpoints:
 - addresses:
 - 10.180.3.17
 nodeName: node-4
 - addresses:
 - 10.180.5.22
 nodeName: node-9
 - addresses:
 - 10.180.18.12
 nodeName: node-7
 - addresses:
 - 10.180.6.6
 nodeName: node-8
 conditions:
 ready: false
ports:
 - name: https
 protocol: TCP
 port: 443

Some points to note:

  1. This example uses an explicit name, but you could also use generateName and let the API server append a unique suffix. The name itself does not matter: what matters is the "kubernetes.io/service-name" label pointing back to the Service.

  2. You have to explicitly indicate addressType: IPv4 (or IPv6).

  3. An EndpointSlice is similar to a single element of the "subsets" array in Endpoints. An Endpoints object with multiple subsets will normally need to be expressed as multiple EndpointSlices, each with different "ports".

  4. The endpoints and addresses fields are both arrays, but by convention, each addresses array only contains a single element. If your Service has multiple endpoints, then you need to have multiple elements in the endpoints array, each with a single element in its addresses array.

  5. The Endpoints API lists "ready" and "not-ready" endpoints separately, while the EndpointSlice API allows each endpoint to have conditions (such as "ready: false") associated with it.

And of course, once you have ported to EndpointSlice, you can make use of EndpointSlice-specific features, such as topology hints and terminating endpoints. Consult the EndpointSlice API documentation for more information.

24 Apr 2025 6:30pm GMT

23 Apr 2025

feedKubernetes Blog

Kubernetes v1.33: Octarine

Editors: Agustina Barbetta, Aakanksha Bhende, Udi Hofesh, Ryota Sawada, Sneha Yadav

Similar to previous releases, the release of Kubernetes v1.33 introduces new stable, beta, and alpha features. The consistent delivery of high-quality releases underscores the strength of our development cycle and the vibrant support from our community.

This release consists of 64 enhancements. Of those enhancements, 18 have graduated to Stable, 20 are entering Beta, 24 have entered Alpha, and 2 are deprecated or withdrawn.

There are also several notable deprecations and removals in this release; make sure to read about those if you already run an older version of Kubernetes.

Release theme and logo

The theme for Kubernetes v1.33 is Octarine: The Color of Magic1, inspired by Terry Pratchett's Discworld series. This release highlights the open source magic2 that Kubernetes enables across the ecosystem.

If you're familiar with the world of Discworld, you might recognize a small swamp dragon perched atop the tower of the Unseen University, gazing up at the Kubernetes moon above the city of Ankh-Morpork with 64 stars3 in the background.

As Kubernetes moves into its second decade, we celebrate both the wizardry of its maintainers, the curiosity of new contributors, and the collaborative spirit that fuels the project. The v1.33 release is a reminder that, as Pratchett wrote, "It's still magic even if you know how it's done." Even if you know the ins and outs of the Kubernetes code base, stepping back at the end of the release cycle, you'll realize that Kubernetes remains magical.

Kubernetes v1.33 is a testament to the enduring power of open source innovation, where hundreds of contributors4 from around the world work together to create something truly extraordinary. Behind every new feature, the Kubernetes community works to maintain and improve the project, ensuring it remains secure, reliable, and released on time. Each release builds upon the other, creating something greater than we could achieve alone.

1. Octarine is the mythical eighth color, visible only to those attuned to the arcane-wizards, witches, and, of course, cats. And occasionally, someone who's stared at IPtable rules for too long.
2. Any sufficiently advanced technology is indistinguishable from magic…?
3. It's not a coincidence 64 KEPs (Kubernetes Enhancement Proposals) are also included in v1.33.
4. See the Project Velocity section for v1.33 🚀

Spotlight on key updates

Kubernetes v1.33 is packed with new features and improvements. Here are a few select updates the Release Team would like to highlight!

Stable: Sidecar containers

The sidecar pattern involves deploying separate auxiliary container(s) to handle extra capabilities in areas such as networking, logging, and metrics gathering. Sidecar containers graduate to stable in v1.33.

Kubernetes implements sidecars as a special class of init containers with restartPolicy: Always, ensuring that sidecars start before application containers, remain running throughout the pod's lifecycle, and terminate automatically after the main containers exit.

Additionally, sidecars can utilize probes (startup, readiness, liveness) to signal their operational state, and their Out-Of-Memory (OOM) score adjustments are aligned with primary containers to prevent premature termination under memory pressure.

To learn more, read Sidecar Containers.

This work was done as part of KEP-753: Sidecar Containers led by SIG Node.

Beta: In-place resource resize for vertical scaling of Pods

Workloads can be defined using APIs like Deployment, StatefulSet, etc. These describe the template for the Pods that should run, including memory and CPU resources, as well as the replica count of the number of Pods that should run. Workloads can be scaled horizontally by updating the Pod replica count, or vertically by updating the resources required in the Pods container(s). Before this enhancement, container resources defined in a Pod's spec were immutable, and updating any of these details within a Pod template would trigger Pod replacement.

But what if you could dynamically update the resource configuration for your existing Pods without restarting them?

The KEP-1287 is precisely to allow such in-place Pod updates. It was released as alpha in v1.27, and has graduated to beta in v1.33. This opens up various possibilities for vertical scale-up of stateful processes without any downtime, seamless scale-down when the traffic is low, and even allocating larger resources during startup, which can then be reduced once the initial setup is complete.

This work was done as part of KEP-1287: In-Place Update of Pod Resources led by SIG Node and SIG Autoscaling.

Alpha: New configuration option for kubectl with .kuberc for user preferences

In v1.33, kubectl introduces a new alpha feature with opt-in configuration file .kuberc for user preferences. This file can contain kubectl aliases and overrides (e.g. defaulting to use server-side apply), while leaving cluster credentials and host information in kubeconfig. This separation allows sharing the same user preferences for kubectl interaction, regardless of target cluster and kubeconfig used.

To enable this alpha feature, users can set the environment variable of KUBECTL_KUBERC=true and create a .kuberc configuration file. By default, kubectl looks for this file in ~/.kube/kuberc. You can also specify an alternative location using the --kuberc flag, for example: kubectl --kuberc /var/kube/rc.

This work was done as part of KEP-3104: Separate kubectl user preferences from cluster configs led by SIG CLI.

Features graduating to Stable

This is a selection of some of the improvements that are now stable following the v1.33 release.

Backoff limits per index for indexed Jobs

​This release graduates a feature that allows setting backoff limits on a per-index basis for Indexed Jobs. Traditionally, the backoffLimit parameter in Kubernetes Jobs specifies the number of retries before considering the entire Job as failed. This enhancement allows each index within an Indexed Job to have its own backoff limit, providing more granular control over retry behavior for individual tasks. This ensures that the failure of specific indices does not prematurely terminate the entire Job, allowing the other indices to continue processing independently.

This work was done as part of KEP-3850: Backoff Limit Per Index For Indexed Jobs led by SIG Apps.

Job success policy

Using .spec.successPolicy, users can specify which pod indexes must succeed (succeededIndexes), how many pods must succeed (succeededCount), or a combination of both. This feature benefits various workloads, including simulations where partial completion is sufficient, and leader-worker patterns where only the leader's success determines the Job's overall outcome.

This work was done as part of KEP-3998: Job success/completion policy led by SIG Apps.

Bound ServiceAccount token security improvements

This enhancement introduced features such as including a unique token identifier (i.e. JWT ID Claim, also known as JTI) and node information within the tokens, enabling more precise validation and auditing. Additionally, it supports node-specific restrictions, ensuring that tokens are only usable on designated nodes, thereby reducing the risk of token misuse and potential security breaches. These improvements, now generally available, aim to enhance the overall security posture of service account tokens within Kubernetes clusters.

This work was done as part of KEP-4193: Bound service account token improvements led by SIG Auth.

Subresource support in kubectl

The --subresource argument is now generally available for kubectl subcommands such as get, patch, edit, apply and replace, allowing users to fetch and update subresources for all resources that support them. To learn more about the subresources supported, visit the kubectl reference.

This work was done as part of KEP-2590: Add subresource support to kubectl led by SIG CLI.

Multiple Service CIDRs

This enhancement introduced a new implementation of allocation logic for Service IPs. Across the whole cluster, every Service of type: ClusterIP must have a unique IP address assigned to it. Trying to create a Service with a specific cluster IP that has already been allocated will return an error. The updated IP address allocator logic uses two newly stable API objects: ServiceCIDR and IPAddress. Now generally available, these APIs allow cluster administrators to dynamically increase the number of IP addresses available for type: ClusterIP Services (by creating new ServiceCIDR objects).

This work was done as part of KEP-1880: Multiple Service CIDRs led by SIG Network.

nftables backend for kube-proxy

The nftables backend for kube-proxy is now stable, adding a new implementation that significantly improves performance and scalability for Services implementation within Kubernetes clusters. For compatibility reasons, iptables remains the default on Linux nodes. Check the migration guide if you want to try it out.

This work was done as part of KEP-3866: nftables kube-proxy backend led by SIG Network.

Topology aware routing with trafficDistribution: PreferClose

This release graduates topology-aware routing and traffic distribution to GA, which would allow us to optimize service traffic in multi-zone clusters. The topology-aware hints in EndpointSlices would enable components like kube-proxy to prioritize routing traffic to endpoints within the same zone, thereby reducing latency and cross-zone data transfer costs. Building upon this, trafficDistribution field is added to the Service specification, with the PreferClose option directing traffic to the nearest available endpoints based on network topology. This configuration enhances performance and cost-efficiency by minimizing inter-zone communication.

This work was done as part of KEP-4444: Traffic Distribution for Services and KEP-2433: Topology Aware Routing led by SIG Network.

Options to reject non SMT-aligned workload

This feature added policy options to the CPU Manager, enabling it to reject workloads that do not align with Simultaneous Multithreading (SMT) configurations. This enhancement, now generally available, ensures that when a pod requests exclusive use of CPU cores, the CPU Manager can enforce allocation of entire core pairs (comprising primary and sibling threads) on SMT-enabled systems, thereby preventing scenarios where workloads share CPU resources in unintended ways.

This work was done as part of KEP-2625: node: cpumanager: add options to reject non SMT-aligned workload led by SIG Node.

Defining Pod affinity or anti-affinity using matchLabelKeys and mismatchLabelKeys

The matchLabelKeys and mismatchLabelKeys fields are available in Pod affinity terms, enabling users to finely control the scope where Pods are expected to co-exist (Affinity) or not (AntiAffinity). These newly stable options complement the existing labelSelector mechanism. The affinity fields facilitate enhanced scheduling for versatile rolling updates, as well as isolation of services managed by tools or controllers based on global configurations.

This work was done as part of KEP-3633: Introduce MatchLabelKeys to Pod Affinity and Pod Anti Affinity led by SIG Scheduling.

Considering taints and tolerations when calculating Pod topology spread skew

This enhanced PodTopologySpread by introducing two fields: nodeAffinityPolicy and nodeTaintsPolicy. These fields allow users to specify whether node affinity rules and node taints should be considered when calculating pod distribution across nodes. By default, nodeAffinityPolicy is set to Honor, meaning only nodes matching the pod's node affinity or selector are included in the distribution calculation. The nodeTaintsPolicy defaults to Ignore, indicating that node taints are not considered unless specified. This enhancement provides finer control over pod placement, ensuring that pods are scheduled on nodes that meet both affinity and taint toleration requirements, thereby preventing scenarios where pods remain pending due to unsatisfied constraints.

This work was done as part of KEP-3094: Take taints/tolerations into consideration when calculating PodTopologySpread skew led by SIG Scheduling.

Volume populators

After being released as beta in v1.24, volume populators have graduated to GA in v1.33. This newly stable feature provides a way to allow users to pre-populate volumes with data from various sources, and not just from PersistentVolumeClaim (PVC) clones or volume snapshots. The mechanism relies on the dataSourceRef field within a PersistentVolumeClaim. This field offers more flexibility than the existing dataSource field, and allows for custom resources to be used as data sources.

A special controller, volume-data-source-validator, validates these data source references, alongside a newly stable CustomResourceDefinition (CRD) for an API kind named VolumePopulator. The VolumePopulator API allows volume populator controllers to register the types of data sources they support. You need to set up your cluster with the appropriate CRD in order to use volume populators.

This work was done as part of KEP-1495: Generic data populators led by SIG Storage.

Always honor PersistentVolume reclaim policy

This enhancement addressed an issue where the Persistent Volume (PV) reclaim policy is not consistently honored, leading to potential storage resource leaks. Specifically, if a PV is deleted before its associated Persistent Volume Claim (PVC), the "Delete" reclaim policy may not be executed, leaving the underlying storage assets intact. To mitigate this, Kubernetes now sets finalizers on relevant PVs, ensuring that the reclaim policy is enforced regardless of the deletion sequence. This enhancement prevents unintended retention of storage resources and maintains consistency in PV lifecycle management.

This work was done as part of KEP-2644: Always Honor PersistentVolume Reclaim Policy led by SIG Storage.

New features in Beta

This is a selection of some of the improvements that are now beta following the v1.33 release.

Support for Direct Service Return (DSR) in Windows kube-proxy

DSR provides performance optimizations by allowing the return traffic routed through load balancers to bypass the load balancer and respond directly to the client; reducing load on the load balancer and also reducing overall latency. For information on DSR on Windows, read Direct Server Return (DSR) in a nutshell.

Initially introduced in v1.14, support for DSR has been promoted to beta by SIG Windows as part of KEP-5100: Support for Direct Service Return (DSR) and overlay networking in Windows kube-proxy.

Structured parameter support

While structured parameter support continues as a beta feature in Kubernetes v1.33, this core part of Dynamic Resource Allocation (DRA) has seen significant improvements. A new v1beta2 version simplifies the resource.k8s.io API, and regular users with the namespaced cluster edit role can now use DRA.

The kubelet now includes seamless upgrade support, enabling drivers deployed as DaemonSets to use a rolling update mechanism. For DRA implementations, this prevents the deletion and re-creation of ResourceSlices, allowing them to remain unchanged during upgrades. Additionally, a 30-second grace period has been introduced before the kubelet cleans up after unregistering a driver, providing better support for drivers that do not use rolling updates.

This work was done as part of KEP-4381: DRA: structured parameters by WG Device Management, a cross-functional team including SIG Node, SIG Scheduling, and SIG Autoscaling.

Dynamic Resource Allocation (DRA) for network interfaces

The standardized reporting of network interface data via DRA, introduced in v1.32, has graduated to beta in v1.33. This enables more native Kubernetes network integrations, simplifying the development and management of networking devices. This was covered previously in the v1.32 release announcement blog.

This work was done as part of KEP-4817: DRA: Resource Claim Status with possible standardized network interface data led by SIG Network, SIG Node, and WG Device Management.

Handle unscheduled pods early when scheduler does not have any pod on activeQ

This feature improves queue scheduling behavior. Behind the scenes, the scheduler achieves this by popping pods from the backoffQ, which are not backed off due to errors, when the activeQ is empty. Previously, the scheduler would become idle even when the activeQ was empty; this enhancement improves scheduling efficiency by preventing that.

This work was done as part of KEP-5142: Pop pod from backoffQ when activeQ is empty led by SIG Scheduling.

Asynchronous preemption in the Kubernetes Scheduler

Preemption ensures higher-priority pods get the resources they need by evicting lower-priority ones. Asynchronous Preemption, introduced in v1.32 as alpha, has graduated to beta in v1.33. With this enhancement, heavy operations such as API calls to delete pods are processed in parallel, allowing the scheduler to continue scheduling other pods without delays. This improvement is particularly beneficial in clusters with high Pod churn or frequent scheduling failures, ensuring a more efficient and resilient scheduling process.

This work was done as part of KEP-4832: Asynchronous preemption in the scheduler led by SIG Scheduling.

ClusterTrustBundles

ClusterTrustBundle, a cluster-scoped resource designed for holding X.509 trust anchors (root certificates), has graduated to beta in v1.33. This API makes it easier for in-cluster certificate signers to publish and communicate X.509 trust anchors to cluster workloads.

This work was done as part of KEP-3257: ClusterTrustBundles (previously Trust Anchor Sets) led by SIG Auth.

Fine-grained SupplementalGroups control

Introduced in v1.31, this feature graduates to beta in v1.33 and is now enabled by default. Provided that your cluster has the SupplementalGroupsPolicy feature gate enabled, the supplementalGroupsPolicy field within a Pod's securityContext supports two policies: the default Merge policy maintains backward compatibility by combining specified groups with those from the container image's /etc/group file, whereas the new Strict policy applies only to explicitly defined groups.

This enhancement helps to address security concerns where implicit group memberships from container images could lead to unintended file access permissions and bypass policy controls.

This work was done as part of KEP-3619: Fine-grained SupplementalGroups control led by SIG Node.

Support for mounting images as volumes

Support for using Open Container Initiative (OCI) images as volumes in Pods, introduced in v1.31, has graduated to beta. This feature allows users to specify an image reference as a volume in a Pod while reusing it as a volume mount within containers. It opens up the possibility of packaging the volume data separately, and sharing them among containers in a Pod without including them in the main image, thereby reducing vulnerabilities and simplifying image creation.

This work was done as part of KEP-4639: VolumeSource: OCI Artifact and/or Image led by SIG Node and SIG Storage.

Support for user namespaces within Linux Pods

One of the oldest open KEPs as of writing is KEP-127, Pod security improvement by using Linux User namespaces for Pods. This KEP was first opened in late 2016, and after multiple iterations, had its alpha release in v1.25, initial beta in v1.30 (where it was disabled by default), and has moved to on-by-default beta as part of v1.33.

This support will not impact existing Pods unless you manually specify pod.spec.hostUsers to opt in. As highlighted in the v1.30 sneak peek blog, this is an important milestone for mitigating vulnerabilities.

This work was done as part of KEP-127: Support User Namespaces in pods led by SIG Node.

Pod procMount option

The procMount option, introduced as alpha in v1.12, and off-by-default beta in v1.31, has moved to an on-by-default beta in v1.33. This enhancement improves Pod isolation by allowing users to fine-tune access to the /proc filesystem. Specifically, it adds a field to the Pod securityContext that lets you override the default behavior of masking and marking certain /proc paths as read-only. This is particularly useful for scenarios where users want to run unprivileged containers inside the Kubernetes Pod using user namespaces. Normally, the container runtime (via the CRI implementation) starts the outer container with strict /proc mount settings. However, to successfully run nested containers with an unprivileged Pod, users need a mechanism to relax those defaults, and this feature provides exactly that.

This work was done as part of KEP-4265: add ProcMount option led by SIG Node.

CPUManager policy to distribute CPUs across NUMA nodes

This feature adds a new policy option for the CPU Manager to distribute CPUs across Non-Uniform Memory Access (NUMA) nodes, rather than concentrating them on a single node. It optimizes CPU resource allocation by balancing workloads across multiple NUMA nodes, thereby improving performance and resource utilization in multi-NUMA systems.

This work was done as part of KEP-2902: Add CPUManager policy option to distribute CPUs across NUMA nodes instead of packing them led by SIG Node.

Zero-second sleeps for container PreStop hooks

Kubernetes 1.29 introduced a Sleep action for the preStop lifecycle hook in Pods, allowing containers to pause for a specified duration before termination. This provides a straightforward method to delay container shutdown, facilitating tasks such as connection draining or cleanup operations.

The Sleep action in a preStop hook can now accept a zero-second duration as a beta feature. This allows defining a no-op preStop hook, which is useful when a preStop hook is required but no delay is desired.

This work was done as part of KEP-3960: Introducing Sleep Action for PreStop Hook and KEP-4818: Allow zero value for Sleep Action of PreStop Hook led by SIG Node.

Internal tooling for declarative validation of Kubernetes-native types

Behind the scenes, the internals of Kubernetes are starting to use a new mechanism for validating objects and changes to objects. Kubernetes v1.33 introduces validation-gen, an internal tool that Kubernetes contributors use to generate declarative validation rules. The overall goal is to improve the robustness and maintainability of API validations by enabling developers to specify validation constraints declaratively, reducing manual coding errors and ensuring consistency across the codebase.

This work was done as part of KEP-5073: Declarative Validation Of Kubernetes Native Types With validation-gen led by SIG API Machinery.

New features in Alpha

This is a selection of some of the improvements that are now alpha following the v1.33 release.

Configurable tolerance for HorizontalPodAutoscalers

This feature introduces configurable tolerance for HorizontalPodAutoscalers, which dampens scaling reactions to small metric variations.

This work was done as part of KEP-4951: Configurable tolerance for Horizontal Pod Autoscalers led by SIG Autoscaling.

Configurable container restart delay

Introduced as alpha1 in v1.32, this feature provides a set of kubelet-level configurations to fine-tune how CrashLoopBackOff is handled.

This work was done as part of KEP-4603: Tune CrashLoopBackOff led by SIG Node.

Custom container stop signals

Before Kubernetes v1.33, stop signals could only be set in container image definitions (for example, via the StopSignal configuration field in the image metadata). If you wanted to modify termination behavior, you needed to build a custom container image. By enabling the (alpha) ContainerStopSignals feature gate in Kubernetes v1.33, you can now define custom stop signals directly within Pod specifications. This is defined in the container's lifecycle.stopSignal field and requires the Pod's spec.os.name field to be present. If unspecified, containers fall back to the image-defined stop signal (if present), or the container runtime default (typically SIGTERM for Linux).

This work was done as part of KEP-4960: Container Stop Signals led by SIG Node.

DRA enhancements galore!

Kubernetes v1.33 continues to develop Dynamic Resource Allocation (DRA) with features designed for today's complex infrastructures. DRA is an API for requesting and sharing resources between pods and containers inside a pod. Typically those resources are devices such as GPUs, FPGAs, and network adapters.

The following are all the alpha DRA feature gates introduced in v1.33:

These feature gates have no effect unless you also enable the DynamicResourceAllocation feature gate.

This work was done as part of KEP-5055: DRA: device taints and tolerations, KEP-4816: DRA: Prioritized Alternatives in Device Requests, KEP-5018: DRA: AdminAccess for ResourceClaims and ResourceClaimTemplates, and KEP-4815: DRA: Add support for partitionable devices, led by SIG Node, SIG Scheduling and SIG Auth.

Robust image pull policy to authenticate images for IfNotPresent and Never

This feature allows users to ensure that kubelet requires an image pull authentication check for each new set of credentials, regardless of whether the image is already present on the node.

This work was done as part of KEP-2535: Ensure secret pulled images led by SIG Auth.

Node topology labels are available via downward API

This feature enables Node topology labels to be exposed via the downward API. Prior to Kubernetes v1.33, a workaround involved using an init container to query the Kubernetes API for the underlying node; this alpha feature simplifies how workloads can access Node topology information.

This work was done as part of KEP-4742: Expose Node labels via downward API led by SIG Node.

Better pod status with generation and observed generation

Prior to this change, the metadata.generation field was unused in pods. Along with extending to support metadata.generation, this feature will introduce status.observedGeneration to provide clearer pod status.

This work was done as part of KEP-5067: Pod Generation led by SIG Node.

Support for split level 3 cache architecture with kubelet's CPU Manager

The previous kubelet's CPU Manager was unaware of split L3 cache architecture (also known as Last Level Cache, or LLC), and can potentially distribute CPU assignments without considering the split L3 cache, causing a noisy neighbor problem. This alpha feature improves the CPU Manager to better assign CPU cores for better performance.

This work was done as part of KEP-5109: Split L3 Cache Topology Awareness in CPU Manager led by SIG Node.

PSI (Pressure Stall Information) metrics for scheduling improvements

This feature adds support on Linux nodes for providing PSI stats and metrics using cgroupv2. It can detect resource shortages and provide nodes with more granular control for pod scheduling.

This work was done as part of KEP-4205: Support PSI based on cgroupv2 led by SIG Node.

Secret-less image pulls with kubelet

The kubelet's on-disk credential provider now supports optional Kubernetes ServiceAccount (SA) token fetching. This simplifies authentication with image registries by allowing cloud providers to better integrate with OIDC compatible identity solutions.

This work was done as part of KEP-4412: Projected service account tokens for Kubelet image credential providers led by SIG Auth.

Graduations, deprecations, and removals in v1.33

Graduations to stable

This lists all the features that have graduated to stable (also known as general availability). For a full list of updates including new features and graduations from alpha to beta, see the release notes.

This release includes a total of 18 enhancements promoted to stable:

Deprecations and removals

As Kubernetes develops and matures, features may be deprecated, removed, or replaced with better ones to improve the project's overall health. See the Kubernetes deprecation and removal policy for more details on this process. Many of these deprecations and removals were announced in the Deprecations and Removals blog post.

Deprecation of the stable Endpoints API

The EndpointSlices API has been stable since v1.21, which effectively replaced the original Endpoints API. While the original Endpoints API was simple and straightforward, it also posed some challenges when scaling to large numbers of network endpoints. The EndpointSlices API has introduced new features such as dual-stack networking, making the original Endpoints API ready for deprecation.

This deprecation affects only those who use the Endpoints API directly from workloads or scripts; these users should migrate to use EndpointSlices instead. There will be a dedicated blog post with more details on the deprecation implications and migration plans.

You can find more in KEP-4974: Deprecate v1.Endpoints.

Removal of kube-proxy version information in node status

Following its deprecation in v1.31, as highlighted in the v1.31 release announcement, the .status.nodeInfo.kubeProxyVersion field for Nodes was removed in v1.33.

This field was set by kubelet, but its value was not consistently accurate. As it has been disabled by default since v1.31, this field has been removed entirely in v1.33.

You can find more in KEP-4004: Deprecate status.nodeInfo.kubeProxyVersion field.

Removal of in-tree gitRepo volume driver

The gitRepo volume type has been deprecated since v1.11, nearly 7 years ago. Since its deprecation, there have been security concerns, including how gitRepo volume types can be exploited to gain remote code execution as root on the nodes. In v1.33, the in-tree driver code is removed.

There are alternatives such as git-sync and initContainers. gitVolumes in the Kubernetes API is not removed, and thus pods with gitRepo volumes will be admitted by kube-apiserver, but kubelets with the feature-gate GitRepoVolumeDriver set to false will not run them and return an appropriate error to the user. This allows users to opt-in to re-enabling the driver for 3 versions to give them enough time to fix workloads.

The feature gate in kubelet and in-tree plugin code is planned to be removed in the v1.39 release.

You can find more in KEP-5040: Remove gitRepo volume driver.

Removal of host network support for Windows pods

Windows Pod networking aimed to achieve feature parity with Linux and provide better cluster density by allowing containers to use the Node's networking namespace. The original implementation landed as alpha with v1.26, but because it faced unexpected containerd behaviours and alternative solutions were available, the Kubernetes project has decided to withdraw the associated KEP. Support was fully removed in v1.33.

Please note that this does not affect HostProcess containers, which provides host network as well as host level access. The KEP withdrawn in v1.33 was about providing the host network only, which was never stable due to technical limitations with Windows networking logic.

You can find more in KEP-3503: Host network support for Windows pods.

Release notes

Check out the full details of the Kubernetes v1.33 release in our release notes.

Availability

Kubernetes v1.33 is available for download on GitHub or on the Kubernetes download page.

To get started with Kubernetes, check out these interactive tutorials or run local Kubernetes clusters using minikube. You can also easily install v1.33 using kubeadm.

Release Team

Kubernetes is only possible with the support, commitment, and hard work of its community. Release Team is made up of dedicated community volunteers who work together to build the many pieces that make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management.

We would like to thank the entire Release Team for the hours spent hard at work to deliver the Kubernetes v1.33 release to our community. The Release Team's membership ranges from first-time shadows to returning team leads with experience forged over several release cycles. There was a new team structure adopted in this release cycle, which was to combine Release Notes and Docs subteams into a unified subteam of Docs. Thanks to the meticulous effort in organizing the relevant information and resources from the new Docs team, both Release Notes and Docs tracking have seen a smooth and successful transition. Finally, a very special thanks goes out to our release lead, Nina Polshakova, for her support throughout a successful release cycle, her advocacy, her efforts to ensure that everyone could contribute effectively, and her challenges to improve the release process.

Project velocity

The CNCF K8s DevStats project aggregates several interesting data points related to the velocity of Kubernetes and various subprojects. This includes everything from individual contributions, to the number of companies contributing, and illustrates the depth and breadth of effort that goes into evolving this ecosystem.

During the v1.33 release cycle, which spanned 15 weeks from January 13 to April 23, 2025, Kubernetes received contributions from as many as 121 different companies and 570 individuals (as of writing, a few weeks before the release date). In the wider cloud native ecosystem, the figure goes up to 435 companies counting 2400 total contributors. You can find the data source in this dashboard. Compared to the velocity data from previous release, v1.32, we see a similar level of contribution from companies and individuals, indicating strong community interest and engagement.

Note that, "contribution" counts when someone makes a commit, code review, comment, creates an issue or PR, reviews a PR (including blogs and documentation) or comments on issues and PRs. If you are interested in contributing, visit Getting Started on our contributor website.

Check out DevStats to learn more about the overall velocity of the Kubernetes project and community.

Event update

Explore upcoming Kubernetes and cloud native events, including KubeCon + CloudNativeCon, KCD, and other notable conferences worldwide. Stay informed and get involved with the Kubernetes community!

May 2025

June 2025

July 2025

August 2025

You can find the latest KCD details here.

Upcoming release webinar

Join members of the Kubernetes v1.33 Release Team on Friday, May 16th 2025 at 4:00 PM (UTC), to learn about the release highlights of this release, as well as deprecations and removals to help plan for upgrades. For more information and registration, visit the event page on the CNCF Online Programs site.

Get involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you'd like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.

23 Apr 2025 6:30pm GMT

22 Apr 2025

feedKubernetes Blog

Kubernetes Multicontainer Pods: An Overview

As cloud-native architectures continue to evolve, Kubernetes has become the go-to platform for deploying complex, distributed systems. One of the most powerful yet nuanced design patterns in this ecosystem is the sidecar pattern-a technique that allows developers to extend application functionality without diving deep into source code.

The origins of the sidecar pattern

Think of a sidecar like a trusty companion motorcycle attachment. Historically, IT infrastructures have always used auxiliary services to handle critical tasks. Before containers, we relied on background processes and helper daemons to manage logging, monitoring, and networking. The microservices revolution transformed this approach, making sidecars a structured and intentional architectural choice. With the rise of microservices, the sidecar pattern became more clearly defined, allowing developers to offload specific responsibilities from the main service without altering its code. Service meshes like Istio and Linkerd have popularized sidecar proxies, demonstrating how these companion containers can elegantly handle observability, security, and traffic management in distributed systems.

Kubernetes implementation

In Kubernetes, sidecar containers operate within the same Pod as the main application, enabling communication and resource sharing. Does this sound just like defining multiple containers along each other inside the Pod? It actually does, and this is how sidecar containers had to be implemented before Kubernetes v1.29.0, which introduced native support for sidecars. Sidecar containers can now be defined within a Pod manifest using the spec.initContainers field. What makes it a sidecar container is that you specify it with restartPolicy: Always. You can see an example of this below, which is a partial snippet of the full Kubernetes manifest:

initContainers:
 - name: logshipper
 image: alpine:latest
 restartPolicy: Always
 command: ['sh', '-c', 'tail -F /opt/logs.txt']
 volumeMounts:
 - name: data
 mountPath: /opt

That field name, spec.initContainers may sound confusing. How come when you want to define a sidecar container, you have to put an entry in the spec.initContainers array? spec.initContainers are run to completion just before main application starts, so they're one-off, whereas sidecars often run in parallel to the main app container. It's the spec.initContainers with restartPolicy:Always which differs classic init containers from Kubernetes-native sidecar containers and ensures they are always up.

When to embrace (or avoid) sidecars

While the sidecar pattern can be useful in many cases, it is generally not the preferred approach unless the use case justifies it. Adding a sidecar increases complexity, resource consumption, and potential network latency. Instead, simpler alternatives such as built-in libraries or shared infrastructure should be considered first.

Deploy a sidecar when:

  1. You need to extend application functionality without touching the original code
  2. Implementing cross-cutting concerns like logging, monitoring or security
  3. Working with legacy applications requiring modern networking capabilities
  4. Designing microservices that demand independent scaling and updates

Proceed with caution if:

  1. Resource efficiency is your primary concern
  2. Minimal network latency is critical
  3. Simpler alternatives exist
  4. You want to minimize troubleshooting complexity

Four essential multi-container patterns

Init container pattern

The Init container pattern is used to execute (often critical) setup tasks before the main application container starts. Unlike regular containers, init containers run to completion and then terminate, ensuring that preconditions for the main application are met.

Ideal for:

  1. Preparing configurations
  2. Loading secrets
  3. Verifying dependency availability
  4. Running database migrations

The init container ensures your application starts in a predictable, controlled environment without code modifications.

Ambassador pattern

An ambassador container provides Pod-local helper services that expose a simple way to access a network service. Commonly, ambassador containers send network requests on behalf of a an application container and take care of challenges such as service discovery, peer identity verification, or encryption in transit.

Perfect when you need to:

  1. Offload client connectivity concerns
  2. Implement language-agnostic networking features
  3. Add security layers like TLS
  4. Create robust circuit breakers and retry mechanisms

Configuration helper

A configuration helper sidecar provides configuration updates to an application dynamically, ensuring it always has access to the latest settings without disrupting the service. Often the helper needs to provide an initial configuration before the application would be able to start successfully.

Use cases:

  1. Fetching environment variables and secrets
  2. Polling configuration changes
  3. Decoupling configuration management from application logic

Adapter pattern

An adapter (or sometimes façade) container enables interoperability between the main application container and external services. It does this by translating data formats, protocols, or APIs.

Strengths:

  1. Transforming legacy data formats
  2. Bridging communication protocols
  3. Facilitating integration between mismatched services

Wrap-up

While sidecar patterns offer tremendous flexibility, they're not a silver bullet. Each added sidecar introduces complexity, consumes resources, and potentially increases operational overhead. Always evaluate simpler alternatives first. The key is strategic implementation: use sidecars as precision tools to solve specific architectural challenges, not as a default approach. When used correctly, they can improve security, networking, and configuration management in containerized environments. Choose wisely, implement carefully, and let your sidecars elevate your container ecosystem.

22 Apr 2025 12:00am GMT

07 Apr 2025

feedKubernetes Blog

Introducing kube-scheduler-simulator

The Kubernetes Scheduler is a crucial control plane component that determines which node a Pod will run on. Thus, anyone utilizing Kubernetes relies on a scheduler.

kube-scheduler-simulator is a simulator for the Kubernetes scheduler, that started as a Google Summer of Code 2021 project developed by me (Kensei Nakada) and later received a lot of contributions. This tool allows users to closely examine the scheduler's behavior and decisions.

It is useful for casual users who employ scheduling constraints (for example, inter-Pod affinity) and experts who extend the scheduler with custom plugins.

Motivation

The scheduler often appears as a black box, composed of many plugins that each contribute to the scheduling decision-making process from their unique perspectives. Understanding its behavior can be challenging due to the multitude of factors it considers.

Even if a Pod appears to be scheduled correctly in a simple test cluster, it might have been scheduled based on different calculations than expected. This discrepancy could lead to unexpected scheduling outcomes when deployed in a large production environment.

Also, testing a scheduler is a complex challenge. There are countless patterns of operations executed within a real cluster, making it unfeasible to anticipate every scenario with a finite number of tests. More often than not, bugs are discovered only when the scheduler is deployed in an actual cluster. Actually, many bugs are found by users after shipping the release, even in the upstream kube-scheduler.

Having a development or sandbox environment for testing the scheduler - or, indeed, any Kubernetes controllers - is a common practice. However, this approach falls short of capturing all the potential scenarios that might arise in a production cluster because a development cluster is often much smaller with notable differences in workload sizes and scaling dynamics. It never sees the exact same use or exhibits the same behavior as its production counterpart.

The kube-scheduler-simulator aims to solve those problems. It enables users to test their scheduling constraints, scheduler configurations, and custom plugins while checking every detailed part of scheduling decisions. It also allows users to create a simulated cluster environment, where they can test their scheduler with the same resources as their production cluster without affecting actual workloads.

Features of the kube-scheduler-simulator

The kube-scheduler-simulator's core feature is its ability to expose the scheduler's internal decisions. The scheduler operates based on the scheduling framework, using various plugins at different extension points, filter nodes (Filter phase), score nodes (Score phase), and ultimately determine the best node for the Pod.

The simulator allows users to create Kubernetes resources and observe how each plugin influences the scheduling decisions for Pods. This visibility helps users understand the scheduler's workings and define appropriate scheduling constraints.

Screenshot of the simulator web frontend that shows the detailed scheduling results per node and per extension point

The simulator web frontend

Inside the simulator, a debuggable scheduler runs instead of the vanilla scheduler. This debuggable scheduler outputs the results of each scheduler plugin at every extension point to the Pod's annotations like the following manifest shows and the web front end formats/visualizes the scheduling results based on these annotations.

kind: Pod
apiVersion: v1
metadata:
 # The JSONs within these annotations are manually formatted for clarity in the blog post. 
 annotations:
 kube-scheduler-simulator.sigs.k8s.io/bind-result: '{"DefaultBinder":"success"}'
 kube-scheduler-simulator.sigs.k8s.io/filter-result: >-
 {
 "node-jjfg5":{
 "NodeName":"passed",
 "NodeResourcesFit":"passed",
 "NodeUnschedulable":"passed",
 "TaintToleration":"passed"
 },
 "node-mtb5x":{
 "NodeName":"passed",
 "NodeResourcesFit":"passed",
 "NodeUnschedulable":"passed",
 "TaintToleration":"passed"
 }
 }
 kube-scheduler-simulator.sigs.k8s.io/finalscore-result: >-
 {
 "node-jjfg5":{
 "ImageLocality":"0",
 "NodeAffinity":"0",
 "NodeResourcesBalancedAllocation":"52",
 "NodeResourcesFit":"47",
 "TaintToleration":"300",
 "VolumeBinding":"0"
 },
 "node-mtb5x":{
 "ImageLocality":"0",
 "NodeAffinity":"0",
 "NodeResourcesBalancedAllocation":"76",
 "NodeResourcesFit":"73",
 "TaintToleration":"300",
 "VolumeBinding":"0"
 }
 } 
 kube-scheduler-simulator.sigs.k8s.io/permit-result: '{}'
 kube-scheduler-simulator.sigs.k8s.io/permit-result-timeout: '{}'
 kube-scheduler-simulator.sigs.k8s.io/postfilter-result: '{}'
 kube-scheduler-simulator.sigs.k8s.io/prebind-result: '{"VolumeBinding":"success"}'
 kube-scheduler-simulator.sigs.k8s.io/prefilter-result: '{}'
 kube-scheduler-simulator.sigs.k8s.io/prefilter-result-status: >-
 {
 "AzureDiskLimits":"",
 "EBSLimits":"",
 "GCEPDLimits":"",
 "InterPodAffinity":"",
 "NodeAffinity":"",
 "NodePorts":"",
 "NodeResourcesFit":"success",
 "NodeVolumeLimits":"",
 "PodTopologySpread":"",
 "VolumeBinding":"",
 "VolumeRestrictions":"",
 "VolumeZone":""
 }
 kube-scheduler-simulator.sigs.k8s.io/prescore-result: >-
 {
 "InterPodAffinity":"",
 "NodeAffinity":"success",
 "NodeResourcesBalancedAllocation":"success",
 "NodeResourcesFit":"success",
 "PodTopologySpread":"",
 "TaintToleration":"success"
 }
 kube-scheduler-simulator.sigs.k8s.io/reserve-result: '{"VolumeBinding":"success"}'
 kube-scheduler-simulator.sigs.k8s.io/result-history: >-
 [
 {
 "kube-scheduler-simulator.sigs.k8s.io/bind-result":"{\"DefaultBinder\":\"success\"}",
 "kube-scheduler-simulator.sigs.k8s.io/filter-result":"{\"node-jjfg5\":{\"NodeName\":\"passed\",\"NodeResourcesFit\":\"passed\",\"NodeUnschedulable\":\"passed\",\"TaintToleration\":\"passed\"},\"node-mtb5x\":{\"NodeName\":\"passed\",\"NodeResourcesFit\":\"passed\",\"NodeUnschedulable\":\"passed\",\"TaintToleration\":\"passed\"}}",
 "kube-scheduler-simulator.sigs.k8s.io/finalscore-result":"{\"node-jjfg5\":{\"ImageLocality\":\"0\",\"NodeAffinity\":\"0\",\"NodeResourcesBalancedAllocation\":\"52\",\"NodeResourcesFit\":\"47\",\"TaintToleration\":\"300\",\"VolumeBinding\":\"0\"},\"node-mtb5x\":{\"ImageLocality\":\"0\",\"NodeAffinity\":\"0\",\"NodeResourcesBalancedAllocation\":\"76\",\"NodeResourcesFit\":\"73\",\"TaintToleration\":\"300\",\"VolumeBinding\":\"0\"}}",
 "kube-scheduler-simulator.sigs.k8s.io/permit-result":"{}",
 "kube-scheduler-simulator.sigs.k8s.io/permit-result-timeout":"{}",
 "kube-scheduler-simulator.sigs.k8s.io/postfilter-result":"{}",
 "kube-scheduler-simulator.sigs.k8s.io/prebind-result":"{\"VolumeBinding\":\"success\"}",
 "kube-scheduler-simulator.sigs.k8s.io/prefilter-result":"{}",
 "kube-scheduler-simulator.sigs.k8s.io/prefilter-result-status":"{\"AzureDiskLimits\":\"\",\"EBSLimits\":\"\",\"GCEPDLimits\":\"\",\"InterPodAffinity\":\"\",\"NodeAffinity\":\"\",\"NodePorts\":\"\",\"NodeResourcesFit\":\"success\",\"NodeVolumeLimits\":\"\",\"PodTopologySpread\":\"\",\"VolumeBinding\":\"\",\"VolumeRestrictions\":\"\",\"VolumeZone\":\"\"}",
 "kube-scheduler-simulator.sigs.k8s.io/prescore-result":"{\"InterPodAffinity\":\"\",\"NodeAffinity\":\"success\",\"NodeResourcesBalancedAllocation\":\"success\",\"NodeResourcesFit\":\"success\",\"PodTopologySpread\":\"\",\"TaintToleration\":\"success\"}",
 "kube-scheduler-simulator.sigs.k8s.io/reserve-result":"{\"VolumeBinding\":\"success\"}",
 "kube-scheduler-simulator.sigs.k8s.io/score-result":"{\"node-jjfg5\":{\"ImageLocality\":\"0\",\"NodeAffinity\":\"0\",\"NodeResourcesBalancedAllocation\":\"52\",\"NodeResourcesFit\":\"47\",\"TaintToleration\":\"0\",\"VolumeBinding\":\"0\"},\"node-mtb5x\":{\"ImageLocality\":\"0\",\"NodeAffinity\":\"0\",\"NodeResourcesBalancedAllocation\":\"76\",\"NodeResourcesFit\":\"73\",\"TaintToleration\":\"0\",\"VolumeBinding\":\"0\"}}",
 "kube-scheduler-simulator.sigs.k8s.io/selected-node":"node-mtb5x"
 }
 ]
 kube-scheduler-simulator.sigs.k8s.io/score-result: >-
 {
 "node-jjfg5":{
 "ImageLocality":"0",
 "NodeAffinity":"0",
 "NodeResourcesBalancedAllocation":"52",
 "NodeResourcesFit":"47",
 "TaintToleration":"0",
 "VolumeBinding":"0"
 },
 "node-mtb5x":{
 "ImageLocality":"0",
 "NodeAffinity":"0",
 "NodeResourcesBalancedAllocation":"76",
 "NodeResourcesFit":"73",
 "TaintToleration":"0",
 "VolumeBinding":"0"
 }
 }
 kube-scheduler-simulator.sigs.k8s.io/selected-node: node-mtb5x

Users can also integrate their custom plugins or extenders, into the debuggable scheduler and visualize their results.

This debuggable scheduler can also run standalone, for example, on any Kubernetes cluster or in integration tests. This would be useful to custom plugin developers who want to test their plugins or examine their custom scheduler in a real cluster with better debuggability.

The simulator as a better dev cluster

As mentioned earlier, with a limited set of tests, it is impossible to predict every possible scenario in a real-world cluster. Typically, users will test the scheduler in a small, development cluster before deploying it to production, hoping that no issues arise.

The simulator's importing feature provides a solution by allowing users to simulate deploying a new scheduler version in a production-like environment without impacting their live workloads.

By continuously syncing between a production cluster and the simulator, users can safely test a new scheduler version with the same resources their production cluster handles. Once confident in its performance, they can proceed with the production deployment, reducing the risk of unexpected issues.

What are the use cases?

  1. Cluster users: Examine if scheduling constraints (for example, PodAffinity, PodTopologySpread) work as intended.
  2. Cluster admins: Assess how a cluster would behave with changes to the scheduler configuration.
  3. Scheduler plugin developers: Test a custom scheduler plugins or extenders, use the debuggable scheduler in integration tests or development clusters, or use the syncing feature for testing within a production-like environment.

Getting started

The simulator only requires Docker to be installed on a machine; a Kubernetes cluster is not necessary.

git clone git@github.com:kubernetes-sigs/kube-scheduler-simulator.git
cd kube-scheduler-simulator
make docker_up

You can then access the simulator's web UI at http://localhost:3000.

Visit the kube-scheduler-simulator repository for more details!

Getting involved

The scheduler simulator is developed by Kubernetes SIG Scheduling. Your feedback and contributions are welcome!

Open issues or PRs at the kube-scheduler-simulator repository. Join the conversation on the #sig-scheduling slack channel.

Acknowledgments

The simulator has been maintained by dedicated volunteer engineers, overcoming many challenges to reach its current form.

A big shout out to all the awesome contributors!

07 Apr 2025 12:00am GMT

26 Mar 2025

feedKubernetes Blog

Kubernetes v1.33 sneak peek

As the release of Kubernetes v1.33 approaches, the Kubernetes project continues to evolve. Features may be deprecated, removed, or replaced to improve the overall health of the project. This blog post outlines some planned changes for the v1.33 release, which the release team believes you should be aware of to ensure the continued smooth operation of your Kubernetes environment and to keep you up-to-date with the latest developments. The information below is based on the current status of the v1.33 release and is subject to change before the final release date.

The Kubernetes API removal and deprecation process

The Kubernetes project has a well-documented deprecation policy for features. This policy states that stable APIs may only be deprecated when a newer, stable version of that same API is available and that APIs have a minimum lifetime for each stability level. A deprecated API has been marked for removal in a future Kubernetes release. It will continue to function until removal (at least one year from the deprecation), but usage will result in a warning being displayed. Removed APIs are no longer available in the current version, at which point you must migrate to using the replacement.

Whether an API is removed as a result of a feature graduating from beta to stable, or because that API simply did not succeed, all removals comply with this deprecation policy. Whenever an API is removed, migration options are communicated in the deprecation guide.

Deprecations and removals for Kubernetes v1.33

Deprecation of the stable Endpoints API

The EndpointSlices API has been stable since v1.21, which effectively replaced the original Endpoints API. While the original Endpoints API was simple and straightforward, it also posed some challenges when scaling to large numbers of network endpoints. The EndpointSlices API has introduced new features such as dual-stack networking, making the original Endpoints API ready for deprecation.

This deprecation only impacts those who use the Endpoints API directly from workloads or scripts; these users should migrate to use EndpointSlices instead. There will be a dedicated blog post with more details on the deprecation implications and migration plans in the coming weeks.

You can find more in KEP-4974: Deprecate v1.Endpoints.

Removal of kube-proxy version information in node status

Following its deprecation in v1.31, as highlighted in the release announcement, the status.nodeInfo.kubeProxyVersion field will be removed in v1.33. This field was set by kubelet, but its value was not consistently accurate. As it has been disabled by default since v1.31, the v1.33 release will remove this field entirely.

You can find more in KEP-4004: Deprecate status.nodeInfo.kubeProxyVersion field.

Removal of host network support for Windows pods

Windows Pod networking aimed to achieve feature parity with Linux and provide better cluster density by allowing containers to use the Node's networking namespace. The original implementation landed as alpha with v1.26, but as it faced unexpected containerd behaviours, and alternative solutions were available, the Kubernetes project has decided to withdraw the associated KEP. We're expecting to see support fully removed in v1.33.

You can find more in KEP-3503: Host network support for Windows pods.

Featured improvement of Kubernetes v1.33

As authors of this article, we picked one improvement as the most significant change to call out!

Support for user namespaces within Linux Pods

One of the oldest open KEPs today is KEP-127, Pod security improvement by using Linux User namespaces for Pods. This KEP was first opened in late 2016, and after multiple iterations, had its alpha release in v1.25, initial beta in v1.30 (where it was disabled by default), and now is set to be a part of v1.33, where the feature is available by default.

This support will not impact existing Pods unless you manually specify pod.spec.hostUsers to opt in. As highlighted in the v1.30 sneak peek blog, this is an important milestone for mitigating vulnerabilities.

You can find more in KEP-127: Support User Namespaces in pods.

Selected other Kubernetes v1.33 improvements

The following list of enhancements is likely to be included in the upcoming v1.33 release. This is not a commitment and the release content is subject to change.

In-place resource resize for vertical scaling of Pods

When provisioning a Pod, you can use various resources such as Deployment, StatefulSet, etc. Scalability requirements may need horizontal scaling by updating the Pod replica count, or vertical scaling by updating resources allocated to Pod's container(s). Before this enhancement, container resources defined in a Pod's spec were immutable, and updating any of these details within a Pod template would trigger Pod replacement.

But what if you could dynamically update the resource configuration for your existing Pods without restarting them?

The KEP-1287 is precisely to allow such in-place Pod updates. It opens up various possibilities of vertical scale-up for stateful processes without any downtime, seamless scale-down when the traffic is low, and even allocating larger resources during startup that is eventually reduced once the initial setup is complete. This was released as alpha in v1.27, and is expected to land as beta in v1.33.

You can find more in KEP-1287: In-Place Update of Pod Resources.

DRA's ResourceClaim Device Status graduates to beta

The devices field in ResourceClaim status, originally introduced in the v1.32 release, is likely to graduate to beta in v1.33. This field allows drivers to report device status data, improving both observability and troubleshooting capabilities.

For example, reporting the interface name, MAC address, and IP addresses of network interfaces in the status of a ResourceClaim can significantly help in configuring and managing network services, as well as in debugging network related issues. You can read more about ResourceClaim Device Status in Dynamic Resource Allocation: ResourceClaim Device Status document.

Also, you can find more about the planned enhancement in KEP-4817: DRA: Resource Claim Status with possible standardized network interface data.

Ordered namespace deletion

This KEP introduces a more structured deletion process for Kubernetes namespaces to ensure secure and deterministic resource removal. The current semi-random deletion order can create security gaps or unintended behaviour, such as Pods persisting after their associated NetworkPolicies are deleted. By enforcing a structured deletion sequence that respects logical and security dependencies, this approach ensures Pods are removed before other resources. The design improves Kubernetes's security and reliability by mitigating risks associated with non-deterministic deletions.

You can find more in KEP-5080: Ordered namespace deletion.

Enhancements for indexed job management

These two KEPs are both set to graduate to GA to provide better reliability for job handling, specifically for indexed jobs. KEP-3850 provides per-index backoff limits for indexed jobs, which allows each index to be fully independent of other indexes. Also, KEP-3998 extends Job API to define conditions for making an indexed job as successfully completed when not all indexes are succeeded.

You can find more in KEP-3850: Backoff Limit Per Index For Indexed Jobs and KEP-3998: Job success/completion policy.

Want to know more?

New features and deprecations are also announced in the Kubernetes release notes. We will formally announce what's new in Kubernetes v1.33 as part of the CHANGELOG for that release.

Kubernetes v1.33 release is planned for Wednesday, 23rd April, 2025. Stay tuned for updates!

You can also see the announcements of changes in the release notes for:

Get involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you'd like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.

26 Mar 2025 6:30pm GMT

25 Mar 2025

feedKubernetes Blog

Fresh Swap Features for Linux Users in Kubernetes 1.32

Swap is a fundamental and an invaluable Linux feature. It offers numerous benefits, such as effectively increasing a node's memory by swapping out unused data, shielding nodes from system-level memory spikes, preventing Pods from crashing when they hit their memory limits, and much more. As a result, the node special interest group within the Kubernetes project has invested significant effort into supporting swap on Linux nodes.

The 1.22 release introduced Alpha support for configuring swap memory usage for Kubernetes workloads running on Linux on a per-node basis. Later, in release 1.28, support for swap on Linux nodes has graduated to Beta, along with many new improvements. In the following Kubernetes releases more improvements were made, paving the way to GA in the near future.

Prior to version 1.22, Kubernetes did not provide support for swap memory on Linux systems. This was due to the inherent difficulty in guaranteeing and accounting for pod memory utilization when swap memory was involved. As a result, swap support was deemed out of scope in the initial design of Kubernetes, and the default behavior of a kubelet was to fail to start if swap memory was detected on a node.

In version 1.22, the swap feature for Linux was initially introduced in its Alpha stage. This provided Linux users the opportunity to experiment with the swap feature for the first time. However, as an Alpha version, it was not fully developed and only partially worked on limited environments.

In version 1.28 swap support on Linux nodes was promoted to Beta. The Beta version was a drastic leap forward. Not only did it fix a large amount of bugs and made swap work in a stable way, but it also brought cgroup v2 support, introduced a wide variety of tests which include complex scenarios such as node-level pressure, and more. It also brought many exciting new capabilities such as the LimitedSwap behavior which sets an auto-calculated swap limit to containers, OpenMetrics instrumentation support (through the /metrics/resource endpoint) and Summary API for VerticalPodAutoscalers (through the /stats/summary endpoint), and more.

Today we are working on more improvements, paving the way for GA. Currently, the focus is especially towards ensuring node stability, enhanced debug abilities, addressing user feedback, polishing the feature and making it stable. For example, in order to increase stability, containers in high-priority pods cannot access swap which ensures the memory they need is ready to use. In addition, the UnlimitedSwap behavior was removed since it might compromise the node's health. Secret content protection against swapping has also been introduced (see relevant security-risk section for more info).

To conclude, compared to previous releases, the kubelet's support for running with swap enabled is more stable and robust, more user-friendly, and addresses many known shortcomings. That said, the NodeSwap feature introduces basic swap support, and this is just the beginning. In the near future, additional features are planned to enhance swap functionality in various ways, such as improving evictions, extending the API, increasing customizability, and more!

How do I use it?

In order for the kubelet to initialize on a swap-enabled node, the failSwapOn field must be set to false on kubelet's configuration setting, or the deprecated --fail-swap-on command line flag must be deactivated.

It is possible to configure the memorySwap.swapBehavior option to define the manner in which a node utilizes swap memory. For instance,

# this fragment goes into the kubelet's configuration file
memorySwap:
 swapBehavior: LimitedSwap

The currently available configuration options for swapBehavior are:

If configuration for memorySwap is not specified, by default the kubelet will apply the same behaviour as the NoSwap setting.

On Linux nodes, Kubernetes only supports running with swap enabled for hosts that use cgroup v2. On cgroup v1 systems, all Kubernetes workloads are not allowed to use swap memory.

Install a swap-enabled cluster with kubeadm

Before you begin

It is required for this demo that the kubeadm tool be installed, following the steps outlined in the kubeadm installation guide. If swap is already enabled on the node, cluster creation may proceed. If swap is not enabled, please refer to the provided instructions for enabling swap.

Create a swap file and turn swap on

I'll demonstrate creating 4GiB of swap, both in the encrypted and unencrypted case.

Setting up unencrypted swap

An unencrypted swap file can be set up as follows.

# Allocate storage and restrict access
fallocate --length 4GiB /swapfile
chmod 600 /swapfile

# Format the swap space
mkswap /swapfile

# Activate the swap space for paging
swapon /swapfile

Setting up encrypted swap

An encrypted swap file can be set up as follows. Bear in mind that this example uses the cryptsetup binary (which is available on most Linux distributions).

# Allocate storage and restrict access
fallocate --length 4GiB /swapfile
chmod 600 /swapfile

# Create an encrypted device backed by the allocated storage
cryptsetup --type plain --cipher aes-xts-plain64 --key-size 256 -d /dev/urandom open /swapfile cryptswap

# Format the swap space
mkswap /dev/mapper/cryptswap

# Activate the swap space for paging
swapon /dev/mapper/cryptswap

Verify that swap is enabled

Swap can be verified to be enabled with both swapon -s command or the free command

> swapon -s
Filename Type Size Used Priority
/dev/dm-0 partition 4194300 0 -2
> free -h
total used free shared buff/cache available
Mem: 3.8Gi 1.3Gi 249Mi 25Mi 2.5Gi 2.5Gi
Swap: 4.0Gi 0B 4.0Gi

Enable swap on boot

After setting up swap, to start the swap file at boot time, you either set up a systemd unit to activate (encrypted) swap, or you add a line similar to /swapfile swap swap defaults 0 0 into /etc/fstab.

Set up a Kubernetes cluster that uses swap-enabled nodes

To make things clearer, here is an example kubeadm configuration file kubeadm-config.yaml for the swap enabled cluster.

---
apiVersion: "kubeadm.k8s.io/v1beta3"
kind: InitConfiguration
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
failSwapOn: false
memorySwap:
 swapBehavior: LimitedSwap

Then create a single-node cluster using kubeadm init --config kubeadm-config.yaml. During init, there is a warning that swap is enabled on the node and in case the kubelet failSwapOn is set to true. We plan to remove this warning in a future release.

How is the swap limit being determined with LimitedSwap?

The configuration of swap memory, including its limitations, presents a significant challenge. Not only is it prone to misconfiguration, but as a system-level property, any misconfiguration could potentially compromise the entire node rather than just a specific workload. To mitigate this risk and ensure the health of the node, we have implemented Swap with automatic configuration of limitations.

With LimitedSwap, Pods that do not fall under the Burstable QoS classification (i.e. BestEffort/Guaranteed QoS Pods) are prohibited from utilizing swap memory. BestEffort QoS Pods exhibit unpredictable memory consumption patterns and lack information regarding their memory usage, making it difficult to determine a safe allocation of swap memory. Conversely, Guaranteed QoS Pods are typically employed for applications that rely on the precise allocation of resources specified by the workload, with memory being immediately available. To maintain the aforementioned security and node health guarantees, these Pods are not permitted to use swap memory when LimitedSwap is in effect. In addition, high-priority pods are not permitted to use swap in order to ensure the memory they consume always residents on disk, hence ready to use.

Prior to detailing the calculation of the swap limit, it is necessary to define the following terms:

Swap limitation is configured as: (containerMemoryRequest / nodeTotalMemory) × totalPodsSwapAvailable

In other words, the amount of swap that a container is able to use is proportionate to its memory request, the node's total physical memory and the total amount of swap memory on the node that is available for use by Pods.

It is important to note that, for containers within Burstable QoS Pods, it is possible to opt-out of swap usage by specifying memory requests that are equal to memory limits. Containers configured in this manner will not have access to swap memory.

How does it work?

There are a number of possible ways that one could envision swap use on a node. When swap is already provisioned and available on a node, the kubelet is able to be configured so that:

Swap configuration on a node is exposed to a cluster admin via the memorySwap in the KubeletConfiguration. As a cluster administrator, you can specify the node's behaviour in the presence of swap memory by setting memorySwap.swapBehavior.

The kubelet employs the CRI (container runtime interface) API, and directs the container runtime to configure specific cgroup v2 parameters (such as memory.swap.max) in a manner that will enable the desired swap configuration for a container. For runtimes that use control groups, the container runtime is then responsible for writing these settings to the container-level cgroup.

How can I monitor swap?

Node and container level metric statistics

Kubelet now collects node and container level metric statistics, which can be accessed at the /metrics/resource (which is used mainly by monitoring tools like Prometheus) and /stats/summary (which is used mainly by Autoscalers) kubelet HTTP endpoints. This allows clients who can directly interrogate the kubelet to monitor swap usage and remaining swap memory when using LimitedSwap. Additionally, a machine_swap_bytes metric has been added to cadvisor to show the total physical swap capacity of the machine. See this page for more info.

Node Feature Discovery (NFD)

Node Feature Discovery is a Kubernetes addon for detecting hardware features and configuration. It can be utilized to discover which nodes are provisioned with swap.

As an example, to figure out which nodes are provisioned with swap, use the following command:

kubectl get nodes -o jsonpath='{range .items[?(@.metadata.labels.feature\.node\.kubernetes\.io/memory-swap)]}{.metadata.name}{"\t"}{.metadata.labels.feature\.node\.kubernetes\.io/memory-swap}{"\n"}{end}'

This will result in an output similar to:

k8s-worker1: true
k8s-worker2: true
k8s-worker3: false

In this example, swap is provisioned on nodes k8s-worker1 and k8s-worker2, but not on k8s-worker3.

Caveats

Having swap available on a system reduces predictability. While swap can enhance performance by making more RAM available, swapping data back to memory is a heavy operation, sometimes slower by many orders of magnitude, which can cause unexpected performance regressions. Furthermore, swap changes a system's behaviour under memory pressure. Enabling swap increases the risk of noisy neighbors, where Pods that frequently use their RAM may cause other Pods to swap. In addition, since swap allows for greater memory usage for workloads in Kubernetes that cannot be predictably accounted for, and due to unexpected packing configurations, the scheduler currently does not account for swap memory usage. This heightens the risk of noisy neighbors.

The performance of a node with swap memory enabled depends on the underlying physical storage. When swap memory is in use, performance will be significantly worse in an I/O operations per second (IOPS) constrained environment, such as a cloud VM with I/O throttling, when compared to faster storage mediums like solid-state drives or NVMe. As swap might cause IO pressure, it is recommended to give a higher IO latency priority to system critical daemons. See the relevant section in the recommended practices section below.

Memory-backed volumes

On Linux nodes, memory-backed volumes (such as secret volume mounts, or emptyDir with medium: Memory) are implemented with a tmpfs filesystem. The contents of such volumes should remain in memory at all times, hence should not be swapped to disk. To ensure the contents of such volumes remain in memory, the noswap tmpfs option is being used.

The Linux kernel officially supports the noswap option from version 6.3 (more info can be found in Linux Kernel Version Requirements). However, the different distributions often choose to backport this mount option to older Linux versions as well.

In order to verify whether the node supports the noswap option, the kubelet will do the following:

It is deeply encouraged to encrypt the swap space. See the section above with an example for setting unencrypted swap. However, handling encrypted swap is not within the scope of kubelet; rather, it is a general OS configuration concern and should be addressed at that level. It is the administrator's responsibility to provision encrypted swap to mitigate this risk.

Good practice for using swap in a Kubernetes cluster

Disable swap for system-critical daemons

During the testing phase and based on user feedback, it was observed that the performance of system-critical daemons and services might degrade. This implies that system daemons, including the kubelet, could operate slower than usual. If this issue is encountered, it is advisable to configure the cgroup of the system slice to prevent swapping (i.e., set memory.swap.max=0).

Protect system-critical daemons for I/O latency

Swap can increase the I/O load on a node. When memory pressure causes the kernel to rapidly swap pages in and out, system-critical daemons and services that rely on I/O operations may experience performance degradation.

To mitigate this, it is recommended for systemd users to prioritize the system slice in terms of I/O latency. For non-systemd users, setting up a dedicated cgroup for system daemons and processes and prioritizing I/O latency in the same way is advised. This can be achieved by setting io.latency for the system slice, thereby granting it higher I/O priority. See cgroup's documentation for more info.

Swap and control plane nodes

The Kubernetes project recommends running control plane nodes without any swap space configured. The control plane primarily hosts Guaranteed QoS Pods, so swap can generally be disabled. The main concern is that swapping critical services on the control plane could negatively impact performance.

Use of a dedicated disk for swap

It is recommended to use a separate, encrypted disk for the swap partition. If swap resides on a partition or the root filesystem, workloads may interfere with system processes that need to write to disk. When they share the same disk, processes can overwhelm swap, disrupting the I/O of kubelet, container runtime, and systemd, which would impact other workloads. Since swap space is located on a disk, it is crucial to ensure the disk is fast enough for the intended use cases. Alternatively, one can configure I/O priorities between different mapped areas of a single backing device.

Looking ahead

As you can see, the swap feature was dramatically improved lately, paving the way for a feature GA. However, this is just the beginning. It's a foundational implementation marking the beginning of enhanced swap functionality.

In the near future, additional features are planned to further improve swap capabilities, including better eviction mechanisms, extended API support, increased customizability, better debug abilities and more!

How can I learn more?

You can review the current documentation for using swap with Kubernetes.

For more information, please see KEP-2400 and its design proposal.

How do I get involved?

Your feedback is always welcome! SIG Node meets regularly and can be reached via Slack (channel #sig-node), or the SIG's mailing list. A Slack channel dedicated to swap is also available at #sig-node-swap.

Feel free to reach out to me, Itamar Holder (@iholder101 on Slack and GitHub) if you'd like to help or ask further questions.

25 Mar 2025 6:00pm GMT

24 Mar 2025

feedKubernetes Blog

Ingress-nginx CVE-2025-1974: What You Need to Know

Today, the ingress-nginx maintainers have released patches for a batch of critical vulnerabilities that could make it easy for attackers to take over your Kubernetes cluster. If you are among the over 40% of Kubernetes administrators using ingress-nginx, you should take action immediately to protect your users and data.

Background

Ingress is the traditional Kubernetes feature for exposing your workload Pods to the world so that they can be useful. In an implementation-agnostic way, Kubernetes users can define how their applications should be made available on the network. Then, an ingress controller uses that definition to set up local or cloud resources as required for the user's particular situation and needs.

Many different ingress controllers are available, to suit users of different cloud providers or brands of load balancers. Ingress-nginx is a software-only ingress controller provided by the Kubernetes project. Because of its versatility and ease of use, ingress-nginx is quite popular: it is deployed in over 40% of Kubernetes clusters!

Ingress-nginx translates the requirements from Ingress objects into configuration for nginx, a powerful open source webserver daemon. Then, nginx uses that configuration to accept and route requests to the various applications running within a Kubernetes cluster. Proper handling of these nginx configuration parameters is crucial, because ingress-nginx needs to allow users significant flexibility while preventing them from accidentally or intentionally tricking nginx into doing things it shouldn't.

Vulnerabilities Patched Today

Four of today's ingress-nginx vulnerabilities are improvements to how ingress-nginx handles particular bits of nginx config. Without these fixes, a specially-crafted Ingress object can cause nginx to misbehave in various ways, including revealing the values of Secrets that are accessible to ingress-nginx. By default, ingress-nginx has access to all Secrets cluster-wide, so this can often lead to complete cluster takeover by any user or entity that has permission to create an Ingress.

The most serious of today's vulnerabilities, CVE-2025-1974, rated 9.8 CVSS, allows anything on the Pod network to exploit configuration injection vulnerabilities via the Validating Admission Controller feature of ingress-nginx. This makes such vulnerabilities far more dangerous: ordinarily one would need to be able to create an Ingress object in the cluster, which is a fairly privileged action. When combined with today's other vulnerabilities, CVE-2025-1974 means that anything on the Pod network has a good chance of taking over your Kubernetes cluster, with no credentials or administrative access required. In many common scenarios, the Pod network is accessible to all workloads in your cloud VPC, or even anyone connected to your corporate network! This is a very serious situation.

Today, we have released ingress-nginx v1.12.1 and v1.11.5, which have fixes for all five of these vulnerabilities.

Your next steps

First, determine if your clusters are using ingress-nginx. In most cases, you can check this by running kubectl get pods --all-namespaces --selector app.kubernetes.io/name=ingress-nginx with cluster administrator permissions.

If you are using ingress-nginx, make a plan to remediate these vulnerabilities immediately.

The best and easiest remedy is to upgrade to the new patch release of ingress-nginx. All five of today's vulnerabilities are fixed by installing today's patches.

If you can't upgrade right away, you can significantly reduce your risk by turning off the Validating Admission Controller feature of ingress-nginx.

If you turn off the Validating Admission Controller feature as a mitigation for CVE-2025-1974, remember to turn it back on after you upgrade. This feature provides important quality of life improvements for your users, warning them about incorrect Ingress configurations before they can take effect.

Conclusion, thanks, and further reading

The ingress-nginx vulnerabilities announced today, including CVE-2025-1974, present a serious risk to many Kubernetes users and their data. If you use ingress-nginx, you should take action immediately to keep yourself safe.

Thanks go out to Nir Ohfeld, Sagi Tzadik, Ronen Shustin, and Hillai Ben-Sasson from Wiz for responsibly disclosing these vulnerabilities, and for working with the Kubernetes SRC members and ingress-nginx maintainers (Marco Ebert and James Strong) to ensure we fixed them effectively.

For further information about the maintenance and future of ingress-nginx, please see this GitHub issue and/or attend James and Marco's KubeCon/CloudNativeCon EU 2025 presentation.

For further information about the specific vulnerabilities discussed in this article, please see the appropriate GitHub issue: CVE-2025-24513, CVE-2025-24514, CVE-2025-1097, CVE-2025-1098, or CVE-2025-1974

24 Mar 2025 8:00pm GMT

23 Mar 2025

feedKubernetes Blog

Introducing JobSet

Authors: Daniel Vega-Myhre (Google), Abdullah Gharaibeh (Google), Kevin Hannon (Red Hat)

In this article, we introduce JobSet, an open source API for representing distributed jobs. The goal of JobSet is to provide a unified API for distributed ML training and HPC workloads on Kubernetes.

Why JobSet?

The Kubernetes community's recent enhancements to the batch ecosystem on Kubernetes has attracted ML engineers who have found it to be a natural fit for the requirements of running distributed training workloads.

Large ML models (particularly LLMs) which cannot fit into the memory of the GPU or TPU chips on a single host are often distributed across tens of thousands of accelerator chips, which in turn may span thousands of hosts.

As such, the model training code is often containerized and executed simultaneously on all these hosts, performing distributed computations which often shard both the model parameters and/or the training dataset across the target accelerator chips, using communication collective primitives like all-gather and all-reduce to perform distributed computations and synchronize gradients between hosts.

These workload characteristics make Kubernetes a great fit for this type of workload, as efficiently scheduling and managing the lifecycle of containerized applications across a cluster of compute resources is an area where it shines.

It is also very extensible, allowing developers to define their own Kubernetes APIs, objects, and controllers which manage the behavior and life cycle of these objects, allowing engineers to develop custom distributed training orchestration solutions to fit their needs.

However, as distributed ML training techniques continue to evolve, existing Kubernetes primitives do not adequately model them alone anymore.

Furthermore, the landscape of Kubernetes distributed training orchestration APIs has become fragmented, and each of the existing solutions in this fragmented landscape has certain limitations that make it non-optimal for distributed ML training.

For example, the KubeFlow training operator defines custom APIs for different frameworks (e.g. PyTorchJob, TFJob, MPIJob, etc.); however, each of these job types are in fact a solution fit specifically to the target framework, each with different semantics and behavior.

On the other hand, the Job API fixed many gaps for running batch workloads, including Indexed completion mode, higher scalability, Pod failure policies and Pod backoff policy to mention a few of the most recent enhancements. However, running ML training and HPC workloads using the upstream Job API requires extra orchestration to fill the following gaps:

Multi-template Pods : Most HPC or ML training jobs include more than one type of Pods. The different Pods are part of the same workload, but they need to run a different container, request different resources or have different failure policies. A common example is the driver-worker pattern.

Job groups : Large scale training workloads span multiple network topologies, running across multiple racks for example. Such workloads are network latency sensitive, and aim to localize communication and minimize traffic crossing the higher-latency network links. To facilitate this, the workload needs to be split into groups of Pods each assigned to a network topology.

Inter-Pod communication : Create and manage the resources (e.g. headless Services) necessary to establish communication between the Pods of a job.

Startup sequencing : Some jobs require a specific start sequence of pods; sometimes the driver is expected to start first (like Ray or Spark), in other cases the workers are expected to be ready before starting the driver (like MPI).

JobSet aims to address those gaps using the Job API as a building block to build a richer API for large-scale distributed HPC and ML use cases.

How JobSet Works

JobSet models a distributed batch workload as a group of Kubernetes Jobs. This allows a user to easily specify different pod templates for different distinct groups of pods (e.g. a leader, workers, parameter servers, etc.).

It uses the abstraction of a ReplicatedJob to manage child Jobs, where a ReplicatedJob is essentially a Job Template with some desired number of Job replicas specified. This provides a declarative way to easily create identical child-jobs to run on different islands of accelerators, without resorting to scripting or Helm charts to generate many versions of the same job but with different names.

JobSet Architecture

Some other key JobSet features which address the problems described above include:

Replicated Jobs : In modern data centers, hardware accelerators like GPUs and TPUs allocated in islands of homogenous accelerators connected via a specialized, high bandwidth network links. For example, a user might provision nodes containing a group of hosts co-located on a rack, each with H100 GPUs, where GPU chips within each host are connected via NVLink, with a NVLink Switch connecting the multiple NVLinks. TPU Pods are another example of this: TPU ViperLitePods consist of 64 hosts, each with 4 TPU v5e chips attached, all connected via ICI mesh. When running a distributed training job across multiple of these islands, we often want to partition the workload into a group of smaller identical jobs, 1 per island, where each pod primarily communicates with the pods within the same island to do segments of distributed computation, and keeping the gradient synchronization over DCN (data center network, which is lower bandwidth than ICI) to a bare minimum.

Automatic headless service creation, configuration, and lifecycle management : Pod-to-pod communication via pod hostname is enabled by default, with automatic configuration and lifecycle management of the headless service enabling this.

Configurable success policies : JobSet has configurable success policies which target specific ReplicatedJobs, with operators to target "Any" or "All" of their child jobs. For example, you can configure the JobSet to be marked complete if and only if all pods that are part of the "worker" ReplicatedJob are completed.

Configurable failure policies : JobSet has configurable failure policies which allow the user to specify a maximum number of times the JobSet should be restarted in the event of a failure. If any job is marked failed, the entire JobSet will be recreated, allowing the workload to resume from the last checkpoint. When no failure policy is specified, if any job fails, the JobSet simply fails.

Exclusive placement per topology domain : JobSet allows users to express that child jobs have 1:1 exclusive assignment to a topology domain, typically an accelerator island like a rack. For example, if the JobSet creates two child jobs, then this feature will enforce that the pods of each child job will be co-located on the same island, and that only one child job is allowed to schedule per island. This is useful for scenarios where we want to use a distributed data parallel (DDP) training strategy to train a model using multiple islands of compute resources (GPU racks or TPU slices), running 1 model replica in each accelerator island, ensuring the forward and backward passes themselves occur within a single model replica occurs over the high bandwidth interconnect linking the accelerators chips within the island, and only the gradient synchronization between model replicas occurs across accelerator islands over the lower bandwidth data center network.

Integration with Kueue : Users can submit JobSets via Kueue to oversubscribe their clusters, queue workloads to run as capacity becomes available, prevent partial scheduling and deadlocks, enable multi-tenancy, and more.

Example use case

Distributed ML training on multiple TPU slices with Jax

The following example is a JobSet spec for running a TPU Multislice workload on 4 TPU v5e slices. To learn more about TPU concepts and terminology, please refer to these docs.

This example uses Jax, an ML framework with native support for Just-In-Time (JIT) compilation targeting TPU chips via OpenXLA. However, you can also use PyTorch/XLA to do ML training on TPUs.

This example makes use of several JobSet features (both explicitly and implicitly) to support the unique scheduling requirements of TPU multislice training out-of-the-box with very little configuration required by the user.

# Run a simple Jax workload on 
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
 name: multislice
 annotations:
 # Give each child Job exclusive usage of a TPU slice 
 alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
spec:
 failurePolicy:
 maxRestarts: 3
 replicatedJobs:
 - name: workers
 replicas: 4 # Set to number of TPU slices
 template:
 spec:
 parallelism: 2 # Set to number of VMs per TPU slice
 completions: 2 # Set to number of VMs per TPU slice
 backoffLimit: 0
 template:
 spec:
 hostNetwork: true
 dnsPolicy: ClusterFirstWithHostNet
 nodeSelector:
 cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
 cloud.google.com/gke-tpu-topology: 2x4
 containers:
 - name: jax-tpu
 image: python:3.8
 ports:
 - containerPort: 8471
 - containerPort: 8080
 securityContext:
 privileged: true
 command:
 - bash
 - -c
 - |
 pip install "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
 python -c 'import jax; print("Global device count:", jax.device_count())'
 sleep 60
 resources:
 limits:
 google.com/tpu: 4

Future work and getting involved

We have a number of features on the JobSet roadmap planned for development this year, which can be found in the JobSet roadmap.

Please feel free to reach out with feedback of any kind. We're also open to additional contributors, whether it is to fix or report bugs, or help add new features or write documentation.

You can get in touch with us via our repo, mailing list or on Slack.

Last but not least, thanks to all our contributors who made this project possible!

23 Mar 2025 12:00am GMT

12 Mar 2025

feedKubernetes Blog

Spotlight on SIG Apps

In our ongoing SIG Spotlight series, we dive into the heart of the Kubernetes project by talking to the leaders of its various Special Interest Groups (SIGs). This time, we focus on SIG Apps, the group responsible for everything related to developing, deploying, and operating applications on Kubernetes. Sandipan Panda (DevZero) had the opportunity to interview Maciej Szulik (Defense Unicorns) and Janet Kuo (Google), the chairs and tech leads of SIG Apps. They shared their experiences, challenges, and visions for the future of application management within the Kubernetes ecosystem.

Introductions

Sandipan: Hello, could you start by telling us a bit about yourself, your role, and your journey within the Kubernetes community that led to your current roles in SIG Apps?

Maciej: Hey, my name is Maciej, and I'm one of the leads for SIG Apps. Aside from this role, you can also find me helping SIG CLI and also being one of the Steering Committee members. I've been contributing to Kubernetes since late 2014 in various areas, including controllers, apiserver, and kubectl.

Janet: Certainly! I'm Janet, a Staff Software Engineer at Google, and I've been deeply involved with the Kubernetes project since its early days, even before the 1.0 launch in 2015. It's been an amazing journey!

My current role within the Kubernetes community is one of the chairs and tech leads of SIG Apps. My journey with SIG Apps started organically. I started with building the Deployment API and adding rolling update functionalities. I naturally gravitated towards SIG Apps and became increasingly involved. Over time, I took on more responsibilities, culminating in my current leadership roles.

About SIG Apps

All following answers were jointly provided by Maciej and Janet.

Sandipan: For those unfamiliar, could you provide an overview of SIG Apps' mission and objectives? What key problems does it aim to solve within the Kubernetes ecosystem?

As described in our charter, we cover a broad area related to developing, deploying, and operating applications on Kubernetes. That, in short, means we're open to each and everyone showing up at our bi-weekly meetings and discussing the ups and downs of writing and deploying various applications on Kubernetes.

Sandipan: What are some of the most significant projects or initiatives currently being undertaken by SIG Apps?

At this point in time, the main factors driving the development of our controllers are the challenges coming from running various AI-related workloads. It's worth giving credit here to two working groups we've sponsored over the past years:

  1. The Batch Working Group, which is looking at running HPC, AI/ML, and data analytics jobs on top of Kubernetes.
  2. The Serving Working Group, which is focusing on hardware-accelerated AI/ML inference.

Best practices and challenges

Sandipan: SIG Apps plays a crucial role in developing application management best practices for Kubernetes. Can you share some of these best practices and how they help improve application lifecycle management?

  1. Implementing health checks and readiness probes ensures that your applications are healthy and ready to serve traffic, leading to improved reliability and uptime. The above, combined with comprehensive logging, monitoring, and tracing solutions, will provide insights into your application's behavior, enabling you to identify and resolve issues quickly.

  2. Auto-scale your application based on resource utilization or custom metrics, optimizing resource usage and ensuring your application can handle varying loads.

  3. Use Deployment for stateless applications, StatefulSet for stateful applications, Job and CronJob for batch workloads, and DaemonSet for running a daemon on each node. Use Operators and CRDs to extend the Kubernetes API to automate the deployment, management, and lifecycle of complex applications, making them easier to operate and reducing manual intervention.

Sandipan: What are some of the common challenges SIG Apps faces, and how do you address them?

The biggest challenge we're facing all the time is the need to reject a lot of features, ideas, and improvements. This requires a lot of discipline and patience to be able to explain the reasons behind those decisions.

Sandipan: How has the evolution of Kubernetes influenced the work of SIG Apps? Are there any recent changes or upcoming features in Kubernetes that you find particularly relevant or beneficial for SIG Apps?

The main benefit for both us and the whole community around SIG Apps is the ability to extend kubernetes with Custom Resource Definitions and the fact that users can build their own custom controllers leveraging the built-in ones to achieve whatever sophisticated use cases they might have and we, as the core maintainers, haven't considered or weren't able to efficiently resolve inside Kubernetes.

Contributing to SIG Apps

Sandipan: What opportunities are available for new contributors who want to get involved with SIG Apps, and what advice would you give them?

We get the question, "What good first issue might you recommend we start with?" a lot :-) But unfortunately, there's no easy answer to it. We always tell everyone that the best option to start contributing to core controllers is to find one you are willing to spend some time with. Read through the code, then try running unit tests and integration tests focusing on that controller. Once you grasp the general idea, try breaking it and the tests again to verify your breakage. Once you start feeling confident you understand that particular controller, you may want to search through open issues affecting that controller and either provide suggestions, explaining the problem users have, or maybe attempt your first fix.

Like we said, there are no shortcuts on that road; you need to spend the time with the codebase to understand all the edge cases we've slowly built up to get to the point where we are. Once you're successful with one controller, you'll need to repeat that same process with others all over again.

Sandipan: How does SIG Apps gather feedback from the community, and how is this feedback integrated into your work?

We always encourage everyone to show up and present their problems and solutions during our bi-weekly meetings. As long as you're solving an interesting problem on top of Kubernetes and you can provide valuable feedback about any of the core controllers, we're always happy to hear from everyone.

Looking ahead

Sandipan: Looking ahead, what are the key focus areas or upcoming trends in application management within Kubernetes that SIG Apps is excited about? How is the SIG adapting to these trends?

Definitely the current AI hype is the major driving factor; as mentioned above, we have two working groups, each covering a different aspect of it.

Sandipan: What are some of your favorite things about this SIG?

Without a doubt, the people that participate in our meetings and on Slack, who tirelessly help triage issues, pull requests and invest a lot of their time (very frequently their private time) into making kubernetes great!


SIG Apps is an essential part of the Kubernetes community, helping to shape how applications are deployed and managed at scale. From its work on improving Kubernetes' workload APIs to driving innovation in AI/ML application management, SIG Apps is continually adapting to meet the needs of modern application developers and operators. Whether you're a new contributor or an experienced developer, there's always an opportunity to get involved and make an impact.

If you're interested in learning more or contributing to SIG Apps, be sure to check out their SIG README and join their bi-weekly meetings.

12 Mar 2025 12:00am GMT

04 Mar 2025

feedKubernetes Blog

Spotlight on SIG etcd

In this SIG etcd spotlight we talked with James Blair, Marek Siarkowicz, Wenjia Zhang, and Benjamin Wang to learn a bit more about this Kubernetes Special Interest Group.

Introducing SIG etcd

Frederico: Hello, thank you for the time! Let's start with some introductions, could you tell us a bit about yourself, your role and how you got involved in Kubernetes.

Benjamin: Hello, I am Benjamin. I am a SIG etcd Tech Lead and one of the etcd maintainers. I work for VMware, which is part of the Broadcom group. I got involved in Kubernetes & etcd & CSI (Container Storage Interface) because of work and also a big passion for open source. I have been working on Kubernetes & etcd (and also CSI) since 2020.

James: Hey team, I'm James, a co-chair for SIG etcd and etcd maintainer. I work at Red Hat as a Specialist Architect helping people adopt cloud native technology. I got involved with the Kubernetes ecosystem in 2019. Around the end of 2022 I noticed how the etcd community and project needed help so started contributing as often as I could. There is a saying in our community that "you come for the technology, and stay for the people": for me this is absolutely real, it's been a wonderful journey so far and I'm excited to support our community moving forward.

Marek: Hey everyone, I'm Marek, the SIG etcd lead. At Google, I lead the GKE etcd team, ensuring a stable and reliable experience for all GKE users. My Kubernetes journey began with SIG Instrumentation, where I created and led the Kubernetes Structured Logging effort.
I'm still the main project lead for Kubernetes Metrics Server, providing crucial signals for autoscaling in Kubernetes. I started working on etcd 3 years ago, right around the 3.5 release. We faced some challenges, but I'm thrilled to see etcd now the most scalable and reliable it's ever been, with the highest contribution numbers in the project's history. I'm passionate about distributed systems, extreme programming, and testing.

Wenjia: Hi there, my name is Wenjia, I am the co-chair of SIG etcd and one of the etcd maintainers. I work at Google as an Engineering Manager, working on GKE (Google Kubernetes Engine) and GDC (Google Distributed Cloud). I have been working in the area of open source Kubernetes and etcd since the Kubernetes v1.10 and etcd v3.1 releases. I got involved in Kubernetes because of my job, but what keeps me in the space is the charm of the container orchestration technology, and more importantly, the awesome open source community.

Becoming a Kubernetes Special Interest Group (SIG)

Frederico: Excellent, thank you. I'd like to start with the origin of the SIG itself: SIG etcd is a very recent SIG, could you quickly go through the history and reasons behind its creation?

Marek: Absolutely! SIG etcd was formed because etcd is a critical component of Kubernetes, serving as its data store. However, etcd was facing challenges like maintainer turnover and reliability issues. Creating a dedicated SIG allowed us to focus on addressing these problems, improving development and maintenance processes, and ensuring etcd evolves in sync with the cloud-native landscape.

Frederico: And has becoming a SIG worked out as expected? Better yet, are the motivations you just described being addressed, and to what extent?

Marek: It's been a positive change overall. Becoming a SIG has brought more structure and transparency to etcd's development. We've adopted Kubernetes processes like KEPs (Kubernetes Enhancement Proposals and PRRs (Production Readiness Reviews, which has improved our feature development and release cycle.

Frederico: On top of those, what would you single out as the major benefit that has resulted from becoming a SIG?

Marek: The biggest benefits for me was adopting Kubernetes testing infrastructure, tools like Prow and TestGrid. For large projects like etcd there is just no comparison to the default GitHub tooling. Having known, easy to use, clear tools is a major boost to the etcd as it makes it much easier for Kubernetes contributors to also help etcd.

Wenjia: Totally agree, while challenges remain, the SIG structure provides a solid foundation for addressing them and ensuring etcd's continued success as a critical component of the Kubernetes ecosystem.

The positive impact on the community is another crucial aspect of SIG etcd's success that I'd like to highlight. The Kubernetes SIG structure has created a welcoming environment for etcd contributors, leading to increased participation from the broader Kubernetes community. We have had greater collaboration with other SIGs like SIG API Machinery, SIG Scalability, SIG Testing, SIG Cluster Lifecycle, etc.

This collaboration helps ensure etcd's development aligns with the needs of the wider Kubernetes ecosystem. The formation of the etcd Operator Working Group under the joint effort between SIG etcd and SIG Cluster Lifecycle exemplifies this successful collaboration, demonstrating a shared commitment to improving etcd's operational aspects within Kubernetes.

Frederico: Since you mentioned collaboration, have you seen changes in terms of contributors and community involvement in recent months?

James: Yes -- as showing in our unique PR author data we recently hit an all time high in March and are trending in a positive direction:

Unique PR author data stats

Additionally, looking at our overall contributions across all etcd project repositories we are also observing a positive trend showing a resurgence in etcd project activity:

Overall contributions stats

The road ahead

Frederico: That's quite telling, thank you. In terms of the near future, what are the current priorities for SIG etcd?

Marek: Reliability is always top of mind -- we need to make sure etcd is rock-solid. We're also working on making etcd easier to use and manage for operators. And we have our sights set on making etcd a viable standalone solution for infrastructure management, not just for Kubernetes. Oh, and of course, scaling -- we need to ensure etcd can handle the growing demands of the cloud-native world.

Benjamin: I agree that reliability should always be our top guiding principle. We need to ensure not only correctness but also compatibility. Additionally, we should continuously strive to improve the understandability and maintainability of etcd. Our focus should be on addressing the pain points that the community cares about the most.

Frederico: Are there any specific SIGs that you work closely with?

Marek: SIG API Machinery, for sure - they own the structure of the data etcd stores, so we're constantly working together. And SIG Cluster Lifecycle - etcd is a key part of Kubernetes clusters, so we collaborate on the newly created etcd operator Working group.

Wenjia: Other than SIG API Machinery and SIG Cluster Lifecycle that Marek mentioned above, SIG Scalability and SIG Testing is another group that we work closely with.

Frederico: In a more general sense, how would you list the key challenges for SIG etcd in the evolving cloud native landscape?

Marek: Well, reliability is always a challenge when you're dealing with critical data. The cloud-native world is evolving so fast that scaling to meet those demands is a constant effort.

Getting involved

Frederico: We're almost at the end of our conversation, but for those interested in in etcd, how can they get involved?

Marek: We'd love to have them! The best way to start is to join our SIG etcd meetings, follow discussions on the etcd-dev mailing list, and check out our GitHub issues. We're always looking for people to review proposals, test code, and contribute to documentation.

Wenjia: I love this question 😀 . There are numerous ways for people interested in contributing to SIG etcd to get involved and make a difference. Here are some key areas where you can help:

Code Contributions:

Getting Started:

By contributing to etcd, you'll not only be helping to improve a critical piece of the cloud-native ecosystem but also gaining valuable experience and skills. So, jump in and start contributing!

Frederico: Excellent, thank you. Lastly, one piece of advice that you'd like to give to other newly formed SIGs?

Marek: Absolutely! My advice would be to embrace the established processes of the larger community, prioritize collaboration with other SIGs, and focus on building a strong community.

Wenjia: Here are some tips I myself found very helpful in my OSS journey:

Frederico: A great way to end this spotlight, thank you all!


For more information and resources, please take a look at :

  1. etcd website: https://etcd.io/
  2. etcd GitHub repository: https://github.com/etcd-io/etcd
  3. etcd community: https://etcd.io/community/

04 Mar 2025 12:00am GMT

28 Feb 2025

feedKubernetes Blog

NFTables mode for kube-proxy

A new nftables mode for kube-proxy was introduced as an alpha feature in Kubernetes 1.29. Currently in beta, it is expected to be GA as of 1.33. The new mode fixes long-standing performance problems with the iptables mode and all users running on systems with reasonably-recent kernels are encouraged to try it out. (For compatibility reasons, even once nftables becomes GA, iptables will still be the default.)

Why nftables? Part 1: data plane latency

The iptables API was designed for implementing simple firewalls, and has problems scaling up to support Service proxying in a large Kubernetes cluster with tens of thousands of Services.

In general, the ruleset generated by kube-proxy in iptables mode has a number of iptables rules proportional to the sum of the number of Services and the total number of endpoints. In particular, at the top level of the ruleset, there is one rule to test each possible Service IP (and port) that a packet might be addressed to:

# If the packet is addressed to 172.30.0.41:80, then jump to the chain
# KUBE-SVC-XPGD46QRK7WJZT7O for further processing
-A KUBE-SERVICES -m comment --comment "namespace1/service1:p80 cluster IP" -m tcp -p tcp -d 172.30.0.41 --dport 80 -j KUBE-SVC-XPGD46QRK7WJZT7O
# If the packet is addressed to 172.30.0.42:443, then...
-A KUBE-SERVICES -m comment --comment "namespace2/service2:p443 cluster IP" -m tcp -p tcp -d 172.30.0.42 --dport 443 -j KUBE-SVC-GNZBNJ2PO5MGZ6GT
# etc...
-A KUBE-SERVICES -m comment --comment "namespace3/service3:p80 cluster IP" -m tcp -p tcp -d 172.30.0.43 --dport 80 -j KUBE-SVC-X27LE4BHSL4DOUIK

This means that when a packet comes in, the time it takes the kernel to check it against all of the Service rules is O(n) in the number of Services. As the number of Services increases, both the average and the worst-case latency for the first packet of a new connection increases (with the difference between best-case, average, and worst-case being mostly determined by whether a given Service IP address appears earlier or later in the KUBE-SERVICES chain).

kube-proxy iptables first packet latency, at various percentiles, in clusters of various sizes

By contrast, with nftables, the normal way to write a ruleset like this is to have a single rule, using a "verdict map" to do the dispatch:

table ip kube-proxy {
# The service-ips verdict map indicates the action to take for each matching packet.
map service-ips {
type ipv4_addr . inet_proto . inet_service : verdict
comment "ClusterIP, ExternalIP and LoadBalancer IP traffic"
elements = { 172.30.0.41 . tcp . 80 : goto service-ULMVA6XW-namespace1/service1/tcp/p80,
172.30.0.42 . tcp . 443 : goto service-42NFTM6N-namespace2/service2/tcp/p443,
172.30.0.43 . tcp . 80 : goto service-4AT6LBPK-namespace3/service3/tcp/p80,
... }
}
# Now we just need a single rule to process all packets matching an
# element in the map. (This rule says, "construct a tuple from the
# destination IP address, layer 4 protocol, and destination port; look
# that tuple up in "service-ips"; and if there's a match, execute the
# associated verdict.)
chain services {
ip daddr . meta l4proto . th dport vmap @service-ips
}
...
}

Since there's only a single rule, with a roughly O(1) map lookup, packet processing time is more or less constant regardless of cluster size, and the best/average/worst cases are very similar:

kube-proxy nftables first packet latency, at various percentiles, in clusters of various sizes

But note the huge difference in the vertical scale between the iptables and nftables graphs! In the clusters with 5000 and 10,000 Services, the p50 (average) latency for nftables is about the same as the p01 (approximately best-case) latency for iptables. In the 30,000 Service cluster, the p99 (approximately worst-case) latency for nftables manages to beat out the p01 latency for iptables by a few microseconds! Here's both sets of data together, but you may have to squint to see the nftables results!:

kube-proxy iptables-vs-nftables first packet latency, at various percentiles, in clusters of various sizes

Why nftables? Part 2: control plane latency

While the improvements to data plane latency in large clusters are great, there's another problem with iptables kube-proxy that often keeps users from even being able to grow their clusters to that size: the time it takes kube-proxy to program new iptables rules when Services and their endpoints change.

With both iptables and nftables, the total size of the ruleset as a whole (actual rules, plus associated data) is O(n) in the combined number of Services and their endpoints. Originally, the iptables backend would rewrite every rule on every update, and with tens of thousands of Services, this could grow to be hundreds of thousands of iptables rules. Starting in Kubernetes 1.26, we began improving kube-proxy so that it could skip updating most of the unchanged rules in each update, but the limitations of iptables-restore as an API meant that it was still always necessary to send an update that's O(n) in the number of Services (though with a noticeably smaller constant than it used to be). Even with those optimizations, it can still be necessary to make use of kube-proxy's minSyncPeriod config option to ensure that it doesn't spend every waking second trying to push iptables updates.

The nftables APIs allow for doing much more incremental updates, and when kube-proxy in nftables mode does an update, the size of the update is only O(n) in the number of Services and endpoints that have changed since the last sync, regardless of the total number of Services and endpoints. The fact that the nftables API allows each nftables-using component to have its own private table also means that there is no global lock contention between components like with iptables. As a result, kube-proxy's nftables updates can be done much more efficiently than with iptables.

(Unfortunately I don't have cool graphs for this part.)

Why not nftables?

All that said, there are a few reasons why you might not want to jump right into using the nftables backend for now.

First, the code is still fairly new. While it has plenty of unit tests, performs correctly in our CI system, and has now been used in the real world by multiple users, it has not seen anything close to as much real-world usage as the iptables backend has, so we can't promise that it is as stable and bug-free.

Second, the nftables mode will not work on older Linux distributions; currently it requires a 5.13 or newer kernel. Additionally, because of bugs in early versions of the nft command line tool, you should not run kube-proxy in nftables mode on nodes that have an old (earlier than 1.0.0) version of nft in the host filesystem (or else kube-proxy's use of nftables may interfere with other uses of nftables on the system).

Third, you may have other networking components in your cluster, such as the pod network or NetworkPolicy implementation, that do not yet support kube-proxy in nftables mode. You should consult the documentation (or forums, bug tracker, etc.) for any such components to see if they have problems with nftables mode. (In many cases they will not; as long as they don't try to directly interact with or override kube-proxy's iptables rules, they shouldn't care whether kube-proxy is using iptables or nftables.) Additionally, observability and monitoring tools that have not been updated may report less data for kube-proxy in nftables mode than they do for kube-proxy in iptables mode.

Finally, kube-proxy in nftables mode is intentionally not 100% compatible with kube-proxy in iptables mode. There are a few old kube-proxy features whose default behaviors are less secure, less performant, or less intuitive than we'd like, but where we felt that changing the default would be a compatibility break. Since the nftables mode is opt-in, this gave us a chance to fix those bad defaults without breaking users who weren't expecting changes. (In particular, with nftables mode, NodePort Services are now only reachable on their nodes' default IPs, as opposed to being reachable on all IPs, including 127.0.0.1, with iptables mode.) The kube-proxy documentation has more information about this, including information about metrics you can look at to determine if you are relying on any of the changed functionality, and what configuration options are available to get more backward-compatible behavior.

Trying out nftables mode

Ready to try it out? In Kubernetes 1.31 and later, you just need to pass --proxy-mode nftables to kube-proxy (or set mode: nftables in your kube-proxy config file).

If you are using kubeadm to set up your cluster, the kubeadm documentation explains how to pass a KubeProxyConfiguration to kubeadm init. You can also deploy nftables-based clusters with kind.

You can also convert existing clusters from iptables (or ipvs) mode to nftables by updating the kube-proxy configuration and restarting the kube-proxy pods. (You do not need to reboot the nodes: when restarting in nftables mode, kube-proxy will delete any existing iptables or ipvs rules, and likewise, if you later revert back to iptables or ipvs mode, it will delete any existing nftables rules.)

Future plans

As mentioned above, while nftables is now the best kube-proxy mode, it is not the default, and we do not yet have a plan for changing that. We will continue to support the iptables mode for a long time.

The future of the IPVS mode of kube-proxy is less certain: its main advantage over iptables was that it was faster, but certain aspects of the IPVS architecture and APIs were awkward for kube-proxy's purposes (for example, the fact that the kube-ipvs0 device needs to have every Service IP address assigned to it), and some parts of Kubernetes Service proxying semantics were difficult to implement using IPVS (particularly the fact that some Services had to have different endpoints depending on whether you connected to them from a local or remote client). And now, the nftables mode has the same performance as IPVS mode (actually, slightly better), without any of the downsides:

kube-proxy ipvs-vs-nftables first packet latency, at various percentiles, in clusters of various sizes

(In theory the IPVS mode also has the advantage of being able to use various other IPVS functionality, like alternative "schedulers" for balancing endpoints. In practice, this ended up not being very useful, because kube-proxy runs independently on every node, and the IPVS schedulers on each node had no way of sharing their state with the proxies on other nodes, thus thwarting the effort to balance traffic more cleverly.)

While the Kubernetes project does not have an immediate plan to drop the IPVS backend, it is probably doomed in the long run, and people who are currently using IPVS mode should try out the nftables mode instead (and file bugs if you think there is missing functionality in nftables mode that you can't work around).

Learn more

28 Feb 2025 12:00am GMT

14 Feb 2025

feedKubernetes Blog

The Cloud Controller Manager Chicken and Egg Problem

Kubernetes 1.31 completed the largest migration in Kubernetes history, removing the in-tree cloud provider. While the component migration is now done, this leaves some additional complexity for users and installer projects (for example, kOps or Cluster API) . We will go over those additional steps and failure points and make recommendations for cluster owners. This migration was complex and some logic had to be extracted from the core components, building four new subsystems.

  1. Cloud controller manager (KEP-2392)
  2. API server network proxy (KEP-1281)
  3. kubelet credential provider plugins (KEP-2133)
  4. Storage migration to use CSI (KEP-625)

The cloud controller manager is part of the control plane. It is a critical component that replaces some functionality that existed previously in the kube-controller-manager and the kubelet.

Components of Kubernetes

Components of Kubernetes

One of the most critical functionalities of the cloud controller manager is the node controller, which is responsible for the initialization of the nodes.

As you can see in the following diagram, when the kubelet starts, it registers the Node object with the apiserver, Tainting the node so it can be processed first by the cloud-controller-manager. The initial Node is missing the cloud-provider specific information, like the Node Addresses and the Labels with the cloud provider specific information like the Node, Region and Instance type information.

Chicken and egg problem sequence diagram

Chicken and egg problem sequence diagram

This new initialization process adds some latency to the node readiness. Previously, the kubelet was able to initialize the node at the same time it created the node. Since the logic has moved to the cloud-controller-manager, this can cause a chicken and egg problem during the cluster bootstrapping for those Kubernetes architectures that do not deploy the controller manager as the other components of the control plane, commonly as static pods, standalone binaries or daemonsets/deployments with tolerations to the taints and using hostNetwork (more on this below)

Examples of the dependency problem

As noted above, it is possible during bootstrapping for the cloud-controller-manager to be unschedulable and as such the cluster will not initialize properly. The following are a few concrete examples of how this problem can be expressed and the root causes for why they might occur.

These examples assume you are running your cloud-controller-manager using a Kubernetes resource (e.g. Deployment, DaemonSet, or similar) to control its lifecycle. Because these methods rely on Kubernetes to schedule the cloud-controller-manager, care must be taken to ensure it will schedule properly.

Example: Cloud controller manager not scheduling due to uninitialized taint

As noted in the Kubernetes documentation, when the kubelet is started with the command line flag --cloud-provider=external, its corresponding Node object will have a no schedule taint named node.cloudprovider.kubernetes.io/uninitialized added. Because the cloud-controller-manager is responsible for removing the no schedule taint, this can create a situation where a cloud-controller-manager that is being managed by a Kubernetes resource, such as a Deployment or DaemonSet, may not be able to schedule.

If the cloud-controller-manager is not able to be scheduled during the initialization of the control plane, then the resulting Node objects will all have the node.cloudprovider.kubernetes.io/uninitialized no schedule taint. It also means that this taint will not be removed as the cloud-controller-manager is responsible for its removal. If the no schedule taint is not removed, then critical workloads, such as the container network interface controllers, will not be able to schedule, and the cluster will be left in an unhealthy state.

Example: Cloud controller manager not scheduling due to not-ready taint

The next example would be possible in situations where the container network interface (CNI) is waiting for IP address information from the cloud-controller-manager (CCM), and the CCM has not tolerated the taint which would be removed by the CNI.

The Kubernetes documentation describes the node.kubernetes.io/not-ready taint as follows:

"The Node controller detects whether a Node is ready by monitoring its health and adds or removes this taint accordingly."

One of the conditions that can lead to a Node resource having this taint is when the container network has not yet been initialized on that node. As the cloud-controller-manager is responsible for adding the IP addresses to a Node resource, and the IP addresses are needed by the container network controllers to properly configure the container network, it is possible in some circumstances for a node to become stuck as not ready and uninitialized permanently.

This situation occurs for a similar reason as the first example, although in this case, the node.kubernetes.io/not-ready taint is used with the no execute effect and thus will cause the cloud-controller-manager not to run on the node with the taint. If the cloud-controller-manager is not able to execute, then it will not initialize the node. It will cascade into the container network controllers not being able to run properly, and the node will end up carrying both the node.cloudprovider.kubernetes.io/uninitialized and node.kubernetes.io/not-ready taints, leaving the cluster in an unhealthy state.

Our Recommendations

There is no one "correct way" to run a cloud-controller-manager. The details will depend on the specific needs of the cluster administrators and users. When planning your clusters and the lifecycle of the cloud-controller-managers please consider the following guidance:

For cloud-controller-managers running in the same cluster, they are managing.

  1. Use host network mode, rather than the pod network: in most cases, a cloud controller manager will need to communicate with an API service endpoint associated with the infrastructure. Setting "hostNetwork" to true will ensure that the cloud controller is using the host networking instead of the container network and, as such, will have the same network access as the host operating system. It will also remove the dependency on the networking plugin. This will ensure that the cloud controller has access to the infrastructure endpoint (always check your networking configuration against your infrastructure provider's instructions).
  2. Use a scalable resource type. Deployments and DaemonSets are useful for controlling the lifecycle of a cloud controller. They allow easy access to running multiple copies for redundancy as well as using the Kubernetes scheduling to ensure proper placement in the cluster. When using these primitives to control the lifecycle of your cloud controllers and running multiple replicas, you must remember to enable leader election, or else your controllers will collide with each other which could lead to nodes not being initialized in the cluster.
  3. Target the controller manager containers to the control plane. There might exist other controllers which need to run outside the control plane (for example, Azure's node manager controller). Still, the controller managers themselves should be deployed to the control plane. Use a node selector or affinity stanza to direct the scheduling of cloud controllers to the control plane to ensure that they are running in a protected space. Cloud controllers are vital to adding and removing nodes to a cluster as they form a link between Kubernetes and the physical infrastructure. Running them on the control plane will help to ensure that they run with a similar priority as other core cluster controllers and that they have some separation from non-privileged user workloads.
    1. It is worth noting that an anti-affinity stanza to prevent cloud controllers from running on the same host is also very useful to ensure that a single node failure will not degrade the cloud controller performance.
  4. Ensure that the tolerations allow operation. Use tolerations on the manifest for the cloud controller container to ensure that it will schedule to the correct nodes and that it can run in situations where a node is initializing. This means that cloud controllers should tolerate the node.cloudprovider.kubernetes.io/uninitialized taint, and it should also tolerate any taints associated with the control plane (for example, node-role.kubernetes.io/control-plane or node-role.kubernetes.io/master). It can also be useful to tolerate the node.kubernetes.io/not-ready taint to ensure that the cloud controller can run even when the node is not yet available for health monitoring.

For cloud-controller-managers that will not be running on the cluster they manage (for example, in a hosted control plane on a separate cluster), then the rules are much more constrained by the dependencies of the environment of the cluster running the cloud-controller-manager. The advice for running on a self-managed cluster may not be appropriate as the types of conflicts and network constraints will be different. Please consult the architecture and requirements of your topology for these scenarios.

Example

This is an example of a Kubernetes Deployment highlighting the guidance shown above. It is important to note that this is for demonstration purposes only, for production uses please consult your cloud provider's documentation.

apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: cloud-controller-manager
name: cloud-controller-manager
namespace: kube-system
spec:
replicas: 2
selector:
matchLabels:
app.kubernetes.io/name: cloud-controller-manager
strategy:
type: Recreate
template:
metadata:
labels:
app.kubernetes.io/name: cloud-controller-manager
annotations:
kubernetes.io/description: Cloud controller manager for my infrastructure
spec:
containers: # the container details will depend on your specific cloud controller manager
- name: cloud-controller-manager
command:
- /bin/my-infrastructure-cloud-controller-manager
- --leader-elect=true
- -v=1
image: registry/my-infrastructure-cloud-controller-manager@latest
resources:
requests:
cpu: 200m
memory: 50Mi
hostNetwork: true # these Pods are part of the control plane
nodeSelector:
node-role.kubernetes.io/control-plane: ""
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: "kubernetes.io/hostname"
labelSelector:
matchLabels:
app.kubernetes.io/name: cloud-controller-manager
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 120
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 120
- effect: NoSchedule
key: node.cloudprovider.kubernetes.io/uninitialized
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/not-ready
operator: Exists

When deciding how to deploy your cloud controller manager it is worth noting that cluster-proportional, or resource-based, pod autoscaling is not recommended. Running multiple replicas of a cloud controller manager is good practice for ensuring high-availability and redundancy, but does not contribute to better performance. In general, only a single instance of a cloud controller manager will be reconciling a cluster at any given time.

14 Feb 2025 12:00am GMT

21 Jan 2025

feedKubernetes Blog

Spotlight on SIG Architecture: Enhancements

This is the fourth interview of a SIG Architecture Spotlight series that will cover the different subprojects, and we will be covering SIG Architecture: Enhancements.

In this SIG Architecture spotlight we talked with Kirsten Garrison, lead of the Enhancements subproject.

The Enhancements subproject

Frederico (FSM): Hi Kirsten, very happy to have the opportunity to talk about the Enhancements subproject. Let's start with some quick information about yourself and your role.

Kirsten Garrison (KG): I'm a lead of the Enhancements subproject of SIG-Architecture and currently work at Google. I first got involved by contributing to the service-catalog project with the help of Carolyn Van Slyck. With time, I joined the Release team, eventually becoming the Enhancements Lead and a Release Lead shadow. While on the release team, I worked on some ideas to make the process better for the SIGs and Enhancements team (the opt-in process) based on my team's experiences. Eventually, I started attending Subproject meetings and contributing to the Subproject's work.

FSM: You mentioned the Enhancements subproject: how would you describe its main goals and areas of intervention?

KG: The Enhancements Subproject primarily concerns itself with the Kubernetes Enhancement Proposal (KEP for short)-the "design" documents required for all features and significant changes to the Kubernetes project.

The KEP and its impact

FSM: The improvement of the KEP process was (and is) one in which SIG Architecture was heavily involved. Could you explain the process to those that aren't aware of it?

KG: Every release, the SIGs let the Release Team know which features they intend to work on to be put into the release. As mentioned above, the prerequisite for these changes is a KEP - a standardized design document that all authors must fill out and approve in the first weeks of the release cycle. Most features will move through 3 phases: alpha, beta and finally GA so approving a feature represents a significant commitment for the SIG.

The KEP serves as the full source of truth of a feature. The KEP template has different requirements based on what stage a feature is in, but it generally requires a detailed discussion of the design and the impact as well as providing artifacts of stability and performance. The KEP takes quite a bit of iterative work between authors, SIG reviewers, api review team and the Production Readiness Review team1 before it is approved. Each set of reviewers is looking to make sure that the proposal meets their standards in order to have a stable and performant Kubernetes release. Only after all approvals are secured, can an author go forth and merge their feature in the Kubernetes code base.

FSM: I see, quite a bit of additional structure was added. Looking back, what were the most significant improvements of that approach?

KG: In general, I think that the improvements with the most impact had to do with focusing on the core intent of the KEP. KEPs exist not just to memorialize designs, but provide a structured way to discuss and come to an agreement about different facets of the change. At the core of the KEP process is communication and consideration.

To that end, some of the significant changes revolve around a more detailed and accessible KEP template. A significant amount of work was put in over time to get the k/enhancements repo into its current form -- a directory structure organized by SIG with the contours of the modern KEP template (with Proposal/Motivation/Design Details subsections). We might take that basic structure for granted today, but it really represents the work of many people trying to get the foundation of this process in place over time.

As Kubernetes matures, we've needed to think about more than just the end goal of getting a single feature merged. We need to think about things like: stability, performance, setting and meeting user expectations. And as we've thought about those things the template has grown more detailed. The addition of the Production Readiness Review was major as well as the enhanced testing requirements (varying at different stages of a KEP's lifecycle).

Current areas of focus

FSM: Speaking of maturing, we've recently released Kubernetes v1.31, and work on v1.32 has started. Are there any areas that the Enhancements sub-project is currently addressing that might change the way things are done?

KG: We're currently working on two things:

  1. Creating a Process KEP template. Sometimes people want to harness the KEP process for significant changes that are more process oriented rather than feature oriented. We want to support this because memorializing changes is important and giving people a better tool to do so will only encourage more discussion and transparency.
  2. KEP versioning. While our template changes aim to be as non-disruptive as possible, we believe that it will be easier to track and communicate those changes to the community better with a versioned KEP template and the policies that go alongside such versioning.

Both features will take some time to get right and fully roll out (just like a KEP feature) but we believe that they will both provide improvements that will benefit the community at large.

FSM: You mentioned improvements: I remember when project boards for Enhancement tracking were introduced in recent releases, to great effect and unanimous applause from release team members. Was this a particular area of focus for the subproject?

KG: The Subproject provided support to the Release Team's Enhancement team in the migration away from using the spreadsheet to a project board. The collection and tracking of enhancements has always been a logistical challenge. During my time on the Release Team, I helped with the transition to an opt-in system of enhancements, whereby the SIG leads "opt-in" KEPs for release tracking. This helped to enhance communication between authors and SIGs before any significant work was undertaken on a KEP and removed toil from the Enhancements team. This change used the existing tools to avoid introducing too many changes at once to the community. Later, the Release Team approached the Subproject with an idea of leveraging GitHub Project Boards to further improve the collection process. This was to be a move away from the use of complicated spreadsheets to using repo-native labels on k/enhancement issues and project boards.

FSM: That surely adds an impact on simplifying the workflow...

KG: Removing sources of friction and promoting clear communication is very important to the Enhancements Subproject. At the same time, it's important to give careful consideration to decisions that impact the community as a whole. We want to make sure that changes are balanced to give an upside and while not causing any regressions and pain in the rollout. We supported the Release Team in ideation as well as through the actual migration to the project boards. It was a great success and exciting to see the team make high impact changes that helped everyone involved in the KEP process!

Getting involved

FSM: For those reading that might be curious and interested in helping, how would you describe the required skills for participating in the sub-project?

KG: Familiarity with KEPs either via experience or taking time to look through the kubernetes/enhancements repo is helpful. All are welcome to participate if interested - we can take it from there.

FSM: Excellent! Many thanks for your time and insight -- any final comments you would like to share with our readers?

KG: The Enhancements process is one of the most important parts of Kubernetes and requires enormous amounts of coordination and collaboration of people and teams across the project to make it successful. I'm thankful and inspired by everyone's continued hard work and dedication to making the project great. This is truly a wonderful community.


  1. For more information, check the Production Readiness Review spotlight interview in this series. ↩︎

21 Jan 2025 12:00am GMT

18 Dec 2024

feedKubernetes Blog

Kubernetes 1.32: Moving Volume Group Snapshots to Beta

Volume group snapshots were introduced as an Alpha feature with the Kubernetes 1.27 release. The recent release of Kubernetes v1.32 moved that support to beta. The support for volume group snapshots relies on a set of extension APIs for group snapshots. These APIs allow users to take crash consistent snapshots for a set of volumes. Behind the scenes, Kubernetes uses a label selector to group multiple PersistentVolumeClaims for snapshotting. A key aim is to allow you restore that set of snapshots to new volumes and recover your workload based on a crash consistent recovery point.

This new feature is only supported for CSI volume drivers.

An overview of volume group snapshots

Some storage systems provide the ability to create a crash consistent snapshot of multiple volumes. A group snapshot represents copies made from multiple volumes, that are taken at the same point-in-time. A group snapshot can be used either to rehydrate new volumes (pre-populated with the snapshot data) or to restore existing volumes to a previous state (represented by the snapshots).

Why add volume group snapshots to Kubernetes?

The Kubernetes volume plugin system already provides a powerful abstraction that automates the provisioning, attaching, mounting, resizing, and snapshotting of block and file storage.

Underpinning all these features is the Kubernetes goal of workload portability: Kubernetes aims to create an abstraction layer between distributed applications and underlying clusters so that applications can be agnostic to the specifics of the cluster they run on and application deployment requires no cluster specific knowledge.

There was already a VolumeSnapshot API that provides the ability to take a snapshot of a persistent volume to protect against data loss or data corruption. However, there are other snapshotting functionalities not covered by the VolumeSnapshot API.

Some storage systems support consistent group snapshots that allow a snapshot to be taken from multiple volumes at the same point-in-time to achieve write order consistency. This can be useful for applications that contain multiple volumes. For example, an application may have data stored in one volume and logs stored in another volume. If snapshots for the data volume and the logs volume are taken at different times, the application will not be consistent and will not function properly if it is restored from those snapshots when a disaster strikes.

It is true that you can quiesce the application first, take an individual snapshot from each volume that is part of the application one after the other, and then unquiesce the application after all the individual snapshots are taken. This way, you would get application consistent snapshots.

However, sometimes the application quiesce can be so time consuming that you want to do it less frequently, or it may not be possible to quiesce an application at all. For example, a user may want to run weekly backups with application quiesce and nightly backups without application quiesce but with consistent group support which provides crash consistency across all volumes in the group.

Kubernetes APIs for volume group snapshots

Kubernetes' support for volume group snapshots relies on three API kinds that are used for managing snapshots:

VolumeGroupSnapshot
Created by a Kubernetes user (or perhaps by your own automation) to request creation of a volume group snapshot for multiple persistent volume claims. It contains information about the volume group snapshot operation such as the timestamp when the volume group snapshot was taken and whether it is ready to use. The creation and deletion of this object represents a desire to create or delete a cluster resource (a group snapshot).
VolumeGroupSnapshotContent
Created by the snapshot controller for a dynamically created VolumeGroupSnapshot. It contains information about the volume group snapshot including the volume group snapshot ID. This object represents a provisioned resource on the cluster (a group snapshot). The VolumeGroupSnapshotContent object binds to the VolumeGroupSnapshot for which it was created with a one-to-one mapping.
VolumeGroupSnapshotClass
Created by cluster administrators to describe how volume group snapshots should be created, including the driver information, the deletion policy, etc.

These three API kinds are defined as CustomResourceDefinitions (CRDs). These CRDs must be installed in a Kubernetes cluster for a CSI Driver to support volume group snapshots.

What components are needed to support volume group snapshots

Volume group snapshots are implemented in the external-snapshotter repository. Implementing volume group snapshots meant adding or changing several components:

The volume snapshot controller and CRDs are deployed once per cluster, while the sidecar is bundled with each CSI driver.

Therefore, it makes sense to deploy the volume snapshot controller and CRDs as a cluster addon.

The Kubernetes project recommends that Kubernetes distributors bundle and deploy the volume snapshot controller and CRDs as part of their Kubernetes cluster management process (independent of any CSI Driver).

What's new in Beta?

How do I use Kubernetes volume group snapshots

Creating a new group snapshot with Kubernetes

Once a VolumeGroupSnapshotClass object is defined and you have volumes you want to snapshot together, you may request a new group snapshot by creating a VolumeGroupSnapshot object.

The source of the group snapshot specifies whether the underlying group snapshot should be dynamically created or if a pre-existing VolumeGroupSnapshotContent should be used.

A pre-existing VolumeGroupSnapshotContent is created by a cluster administrator. It contains the details of the real volume group snapshot on the storage system which is available for use by cluster users.

One of the following members in the source of the group snapshot must be set.

Dynamically provision a group snapshot

In the following example, there are two PVCs.

NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
pvc-0 Bound pvc-6e1f7d34-a5c5-4548-b104-01e72c72b9f2 100Mi RWO csi-hostpath-sc <unset> 2m15s
pvc-1 Bound pvc-abc640b3-2cc1-4c56-ad0c-4f0f0e636efa 100Mi RWO csi-hostpath-sc <unset> 2m7s

Label the PVCs.

% kubectl label pvc pvc-0 group=myGroup
persistentvolumeclaim/pvc-0 labeled

% kubectl label pvc pvc-1 group=myGroup
persistentvolumeclaim/pvc-1 labeled

For dynamic provisioning, a selector must be set so that the snapshot controller can find PVCs with the matching labels to be snapshotted together.

apiVersion: groupsnapshot.storage.k8s.io/v1beta1
kind: VolumeGroupSnapshot
metadata:
 name: snapshot-daily-20241217
 namespace: demo-namespace
spec:
 volumeGroupSnapshotClassName: csi-groupSnapclass
 source:
 selector:
 matchLabels:
 group: myGroup

In the VolumeGroupSnapshot spec, a user can specify the VolumeGroupSnapshotClass which has the information about which CSI driver should be used for creating the group snapshot. A VolumGroupSnapshotClass is required for dynamic provisioning.

apiVersion: groupsnapshot.storage.k8s.io/v1beta1
kind: VolumeGroupSnapshotClass
metadata:
 name: csi-groupSnapclass
 annotations:
 kubernetes.io/description: "Example group snapshot class"
driver: example.csi.k8s.io
deletionPolicy: Delete

As a result of the volume group snapshot creation, a corresponding VolumeGroupSnapshotContent object will be created with a volumeGroupSnapshotHandle pointing to a resource on the storage system.

Two individual volume snapshots will be created as part of the volume group snapshot creation.

NAME READYTOUSE SOURCEPVC RESTORESIZE SNAPSHOTCONTENT AGE
snapshot-0962a745b2bf930bb385b7b50c9b08af471f1a16780726de19429dd9c94eaca0 true pvc-0 100Mi snapcontent-0962a745b2bf930bb385b7b50c9b08af471f1a16780726de19429dd9c94eaca0 16m
snapshot-da577d76bd2106c410616b346b2e72440f6ec7b12a75156263b989192b78caff true pvc-1 100Mi snapcontent-da577d76bd2106c410616b346b2e72440f6ec7b12a75156263b989192b78caff 16m

Importing an existing group snapshot with Kubernetes

To import a pre-existing volume group snapshot into Kubernetes, you must also import the corresponding individual volume snapshots.

Identify the individual volume snapshot handles, manually construct a VolumeSnapshotContent object first, then create a VolumeSnapshot object pointing to the VolumeSnapshotContent object. Repeat this for every individual volume snapshot.

Then manually create a VolumeGroupSnapshotContent object, specifying the volumeGroupSnapshotHandle and individual volumeSnapshotHandles already existing on the storage system.

apiVersion: groupsnapshot.storage.k8s.io/v1beta1
kind: VolumeGroupSnapshotContent
metadata:
 name: static-group-content
spec:
 deletionPolicy: Delete
 driver: hostpath.csi.k8s.io
 source:
 groupSnapshotHandles:
 volumeGroupSnapshotHandle: e8779136-a93e-11ef-9549-66940726f2fd
 volumeSnapshotHandles:
 - e8779147-a93e-11ef-9549-66940726f2fd
 - e8783cd0-a93e-11ef-9549-66940726f2fd
 volumeGroupSnapshotRef:
 name: static-group-snapshot
 namespace: demo-namespace

After that create a VolumeGroupSnapshot object pointing to the VolumeGroupSnapshotContent object.

apiVersion: groupsnapshot.storage.k8s.io/v1beta1
kind: VolumeGroupSnapshot
metadata:
 name: static-group-snapshot
 namespace: demo-namespace
spec:
 source:
 volumeGroupSnapshotContentName: static-group-content

How to use group snapshot for restore in Kubernetes

At restore time, the user can request a new PersistentVolumeClaim to be created from a VolumeSnapshot object that is part of a VolumeGroupSnapshot. This will trigger provisioning of a new volume that is pre-populated with data from the specified snapshot. The user should repeat this until all volumes are created from all the snapshots that are part of a group snapshot.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: examplepvc-restored-2024-12-17
 namespace: demo-namespace
spec:
 storageClassName: example-foo-nearline
 dataSource:
 name: snapshot-0962a745b2bf930bb385b7b50c9b08af471f1a16780726de19429dd9c94eaca0
 kind: VolumeSnapshot
 apiGroup: snapshot.storage.k8s.io
 accessModes:
 - ReadWriteOncePod
 resources:
 requests:
 storage: 100Mi # must be enough storage to fit the existing snapshot

As a storage vendor, how do I add support for group snapshots to my CSI driver?

To implement the volume group snapshot feature, a CSI driver must:

See the CSI spec and the Kubernetes-CSI Driver Developer Guide for more details.

As mentioned earlier, it is strongly recommended that Kubernetes distributors bundle and deploy the volume snapshot controller and CRDs as part of their Kubernetes cluster management process (independent of any CSI Driver).

As part of this recommended deployment process, the Kubernetes team provides a number of sidecar (helper) containers, including the external-snapshotter sidecar container which has been updated to support volume group snapshot.

The external-snapshotter watches the Kubernetes API server for VolumeGroupSnapshotContent objects, and triggers CreateVolumeGroupSnapshot and DeleteVolumeGroupSnapshot operations against a CSI endpoint.

What are the limitations?

The beta implementation of volume group snapshots for Kubernetes has the following limitations:

What's next?

Depending on feedback and adoption, the Kubernetes project plans to push the volume group snapshot implementation to general availability (GA) in a future release.

How can I learn more?

How do I get involved?

This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. On behalf of SIG Storage, I would like to offer a huge thank you to the contributors who stepped up these last few quarters to help the project reach beta:

For those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We always welcome new contributors.

We also hold regular Data Protection Working Group meetings. New attendees are welcome to join our discussions.

18 Dec 2024 12:00am GMT

17 Dec 2024

feedKubernetes Blog

Enhancing Kubernetes API Server Efficiency with API Streaming

Managing Kubernetes clusters efficiently is critical, especially as their size is growing. A significant challenge with large clusters is the memory overhead caused by list requests.

In the existing implementation, the kube-apiserver processes list requests by assembling the entire response in-memory before transmitting any data to the client. But what if the response body is substantial, say hundreds of megabytes? Additionally, imagine a scenario where multiple list requests flood in simultaneously, perhaps after a brief network outage. While API Priority and Fairness has proven to reasonably protect kube-apiserver from CPU overload, its impact is visibly smaller for memory protection. This can be explained by the differing nature of resource consumption by a single API request - the CPU usage at any given time is capped by a constant, whereas memory, being uncompressible, can grow proportionally with the number of processed objects and is unbounded. This situation poses a genuine risk, potentially overwhelming and crashing any kube-apiserver within seconds due to out-of-memory (OOM) conditions. To better visualize the issue, let's consider the below graph.

Monitoring graph showing kube-apiserver memory usage

The graph shows the memory usage of a kube-apiserver during a synthetic test. (see the synthetic test section for more details). The results clearly show that increasing the number of informers significantly boosts the server's memory consumption. Notably, at approximately 16:40, the server crashed when serving only 16 informers.

Why does kube-apiserver allocate so much memory for list requests?

Our investigation revealed that this substantial memory allocation occurs because the server before sending the first byte to the client must:

This sequence results in significant temporary memory consumption. The actual usage depends on many factors like the page size, applied filters (e.g. label selectors), query parameters, and sizes of individual objects.

Unfortunately, neither API Priority and Fairness nor Golang's garbage collection or Golang memory limits can prevent the system from exhausting memory under these conditions. The memory is allocated suddenly and rapidly, and just a few requests can quickly deplete the available memory, leading to resource exhaustion.

Depending on how the API server is run on the node, it might either be killed through OOM by the kernel when exceeding the configured memory limits during these uncontrolled spikes, or if limits are not configured it might have even worse impact on the control plane node. And worst, after the first API server failure, the same requests will likely hit another control plane node in an HA setup with probably the same impact. Potentially a situation that is hard to diagnose and hard to recover from.

Streaming list requests

Today, we're excited to announce a major improvement. With the graduation of the watch list feature to beta in Kubernetes 1.32, client-go users can opt-in (after explicitly enabling WatchListClient feature gate) to streaming lists by switching from list to (a special kind of) watch requests.

Watch requests are served from the watch cache, an in-memory cache designed to improve scalability of read operations. By streaming each item individually instead of returning the entire collection, the new method maintains constant memory overhead. The API server is bound by the maximum allowed size of an object in etcd plus a few additional allocations. This approach drastically reduces the temporary memory usage compared to traditional list requests, ensuring a more efficient and stable system, especially in clusters with a large number of objects of a given type or large average object sizes where despite paging memory consumption used to be high.

Building on the insight gained from the synthetic test (see the synthetic test, we developed an automated performance test to systematically evaluate the impact of the watch list feature. This test replicates the same scenario, generating a large number of Secrets with a large payload, and scaling the number of informers to simulate heavy list request patterns. The automated test is executed periodically to monitor memory usage of the server with the feature enabled and disabled.

The results showed significant improvements with the watch list feature enabled. With the feature turned on, the kube-apiserver's memory consumption stabilized at approximately 2 GB. By contrast, with the feature disabled, memory usage increased to approximately 20GB, a 10x increase! These results confirm the effectiveness of the new streaming API, which reduces the temporary memory footprint.

Enabling API Streaming for your component

Upgrade to Kubernetes 1.32. Make sure your cluster uses etcd in version 3.4.31+ or 3.5.13+. Change your client software to use watch lists. If your client code is written in Golang, you'll want to enable WatchListClient for client-go. For details on enabling that feature, read Introducing Feature Gates to Client-Go: Enhancing Flexibility and Control.

What's next?

In Kubernetes 1.32, the feature is enabled in kube-controller-manager by default despite its beta state. This will eventually be expanded to other core components like kube-scheduler or kubelet; once the feature becomes generally available, if not earlier. Other 3rd-party components are encouraged to opt-in to the feature during the beta phase, especially when they are at risk of accessing a large number of resources or kinds with potentially large object sizes.

For the time being, API Priority and Fairness assigns a reasonable small cost to list requests. This is necessary to allow enough parallelism for the average case where list requests are cheap enough. But it does not match the spiky exceptional situation of many and large objects. Once the majority of the Kubernetes ecosystem has switched to watch list, the list cost estimation can be changed to larger values without risking degraded performance in the average case, and with that increasing the protection against this kind of requests that can still hit the API server in the future.

The synthetic test

In order to reproduce the issue, we conducted a manual test to understand the impact of list requests on kube-apiserver memory usage. In the test, we created 400 Secrets, each containing 1 MB of data, and used informers to retrieve all Secrets.

The results were alarming, only 16 informers were needed to cause the test server to run out of memory and crash, demonstrating how quickly memory consumption can grow under such conditions.

Special shout out to @deads2k for his help in shaping this feature.

17 Dec 2024 12:00am GMT