16 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Moving Volume Group Snapshots to v1beta2
Volume group snapshots were introduced as an Alpha feature with the Kubernetes 1.27 release and moved to Beta in the Kubernetes 1.32 release. The recent release of Kubernetes v1.34 moved that support to a second beta. The support for volume group snapshots relies on a set of extension APIs for group snapshots. These APIs allow users to take crash consistent snapshots for a set of volumes. Behind the scenes, Kubernetes uses a label selector to group multiple PersistentVolumeClaims for snapshotting. A key aim is to allow you restore that set of snapshots to new volumes and recover your workload based on a crash consistent recovery point.
This new feature is only supported for CSI volume drivers.
What's new in Beta 2?
While testing the beta version, we encountered an issue where the restoreSize
field is not set for individual VolumeSnapshotContents and VolumeSnapshots if CSI driver does not implement the ListSnapshots RPC call. We evaluated various options here and decided to make this change releasing a new beta for the API.
Specifically, a VolumeSnapshotInfo struct is added in v1beta2, it contains information for an individual volume snapshot that is a member of a volume group snapshot. VolumeSnapshotInfoList, a list of VolumeSnapshotInfo, is added to VolumeGroupSnapshotContentStatus, replacing VolumeSnapshotHandlePairList. VolumeSnapshotInfoList is a list of snapshot information returned by the CSI driver to identify snapshots on the storage system. VolumeSnapshotInfoList is populated by the csi-snapshotter sidecar based on the CSI CreateVolumeGroupSnapshotResponse returned by the CSI driver's CreateVolumeGroupSnapshot call.
The existing v1beta1 API objects will be converted to the new v1beta2 API objects by a conversion webhook.
What's next?
Depending on feedback and adoption, the Kubernetes project plans to push the volume group snapshot implementation to general availability (GA) in a future release.
How can I learn more?
- The design spec for the volume group snapshot feature.
- The code repository for volume group snapshot APIs and controller.
- CSI documentation on the group snapshot feature.
How do I get involved?
This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. On behalf of SIG Storage, I would like to offer a huge thank you to the contributors who stepped up these last few quarters to help the project reach beta:
- Ben Swartzlander (bswartz)
- Hemant Kumar (gnufied)
- Jan Šafránek (jsafrane)
- Madhu Rajanna (Madhu-1)
- Michelle Au (msau42)
- Niels de Vos (nixpanic)
- Leonardo Cecchi (leonardoce)
- Saad Ali (saad-ali)
- Xing Yang (xing-yang)
- Yati Padia (yati1998)
For those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We always welcome new contributors.
We also hold regular Data Protection Working Group meetings. New attendees are welcome to join our discussions.
16 Sep 2025 6:30pm GMT
15 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Decoupled Taint Manager Is Now Stable
This enhancement separates the responsibility of managing node lifecycle and pod eviction into two distinct components. Previously, the node lifecycle controller handled both marking nodes as unhealthy with NoExecute taints and evicting pods from them. Now, a dedicated taint eviction controller manages the eviction process, while the node lifecycle controller focuses solely on applying taints. This separation not only improves code organization but also makes it easier to improve taint eviction controller or build custom implementations of the taint based eviction.
What's new?
The feature gate SeparateTaintEvictionController
has been promoted to GA in this release. Users can optionally disable taint-based eviction by setting --controllers=-taint-eviction-controller
in kube-controller-manager.
How can I learn more?
For more details, refer to the KEP and to the beta announcement article: Kubernetes 1.29: Decoupling taint manager from node lifecycle controller.
How to get involved?
We offer a huge thank you to all the contributors who helped with design, implementation, and review of this feature and helped move it from beta to stable:
- Ed Bartosh (@bart0sh)
- Yuan Chen (@yuanchen8911)
- Aldo Culquicondor (@alculquicondor)
- Baofa Fan (@carlory)
- Sergey Kanzhelev (@SergeyKanzhelev)
- Tim Bannister (@lmktfy)
- Maciej Skoczeń (@macsko)
- Maciej Szulik (@soltysh)
- Wojciech Tyczynski (@wojtek-t)
15 Sep 2025 6:30pm GMT
12 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Autoconfiguration for Node Cgroup Driver Goes GA
Historically, configuring the correct cgroup driver has been a pain point for users running new Kubernetes clusters. On Linux systems, there are two different cgroup drivers: cgroupfs
and systemd
. In the past, both the kubelet and CRI implementation (like CRI-O or containerd) needed to be configured to use the same cgroup driver, or else the kubelet would misbehave without any explicit error message. This was a source of headaches for many cluster admins. Now, we've (almost) arrived at the end of that headache.
Automated cgroup driver detection
In v1.28.0, the SIG Node community introduced the feature gate KubeletCgroupDriverFromCRI
, which instructs the kubelet to ask the CRI implementation which cgroup driver to use. You can read more here. After many releases of waiting for each CRI implementation to have major versions released and packaged in major operating systems, this feature has gone GA as of Kubernetes 1.34.0.
In addition to setting the feature gate, a cluster admin needs to ensure their CRI implementation is new enough:
- containerd: Support was added in v2.0.0
- CRI-O: Support was added in v1.28.0
Announcement: Kubernetes is deprecating containerd v1.y support
While CRI-O releases versions that match Kubernetes versions, and thus CRI-O versions without this behavior are no longer supported, containerd maintains its own release cycle. containerd support for this feature is only in v2.0 and later, but Kubernetes 1.34 still supports containerd 1.7 and other LTS releases of containerd.
The Kubernetes SIG Node community has formally agreed upon a final support timeline for containerd v1.y. The last Kubernetes release to offer this support will be the last released version of v1.35, and support will be dropped in v1.36.0. To assist administrators in managing this future transition, a new detection mechanism is available. You are able to monitor the kubelet_cri_losing_support
metric to determine if any nodes in your cluster are using a containerd version that will soon be outdated. The presence of this metric with a version label of 1.36.0
will indicate that the node's containerd runtime is not new enough for the upcoming requirements. Consequently, an administrator will need to upgrade containerd to v2.0 or a later version before, or at the same time as, upgrading the kubelet to v1.36.0.
12 Sep 2025 6:30pm GMT
11 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Mutable CSI Node Allocatable Graduates to Beta
The functionality for CSI drivers to update information about attachable volume count on the nodes, first introduced as Alpha in Kubernetes v1.33, has graduated to Beta in the Kubernetes v1.34 release! This marks a significant milestone in enhancing the accuracy of stateful pod scheduling by reducing failures due to outdated attachable volume capacity information.
Background
Traditionally, Kubernetes CSI drivers report a static maximum volume attachment limit when initializing. However, actual attachment capacities can change during a node's lifecycle for various reasons, such as:
- Manual or external operations attaching/detaching volumes outside of Kubernetes control.
- Dynamically attached network interfaces or specialized hardware (GPUs, NICs, etc.) consuming available slots.
- Multi-driver scenarios, where one CSI driver's operations affect available capacity reported by another.
Static reporting can cause Kubernetes to schedule pods onto nodes that appear to have capacity but don't, leading to pods stuck in a ContainerCreating
state.
Dynamically adapting CSI volume limits
With this new feature, Kubernetes enables CSI drivers to dynamically adjust and report node attachment capacities at runtime. This ensures that the scheduler, as well as other components relying on this information, have the most accurate, up-to-date view of node capacity.
How it works
Kubernetes supports two mechanisms for updating the reported node volume limits:
- Periodic Updates: CSI drivers specify an interval to periodically refresh the node's allocatable capacity.
- Reactive Updates: An immediate update triggered when a volume attachment fails due to exhausted resources (
ResourceExhausted
error).
Enabling the feature
To use this beta feature, the MutableCSINodeAllocatableCount
feature gate must be enabled in these components:
kube-apiserver
kubelet
Example CSI driver configuration
Below is an example of configuring a CSI driver to enable periodic updates every 60 seconds:
apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
name: example.csi.k8s.io
spec:
nodeAllocatableUpdatePeriodSeconds: 60
This configuration directs kubelet to periodically call the CSI driver's NodeGetInfo
method every 60 seconds, updating the node's allocatable volume count. Kubernetes enforces a minimum update interval of 10 seconds to balance accuracy and resource usage.
Immediate updates on attachment failures
When a volume attachment operation fails due to a ResourceExhausted
error (gRPC code 8
), Kubernetes immediately updates the allocatable count instead of waiting for the next periodic update. The Kubelet then marks the affected pods as Failed, enabling their controllers to recreate them. This prevents pods from getting permanently stuck in the ContainerCreating
state.
Getting started
To enable this feature in your Kubernetes v1.34 cluster:
- Enable the feature gate
MutableCSINodeAllocatableCount
on thekube-apiserver
andkubelet
components. - Update your CSI driver configuration by setting
nodeAllocatableUpdatePeriodSeconds
. - Monitor and observe improvements in scheduling accuracy and pod placement reliability.
Next steps
This feature is currently in beta and the Kubernetes community welcomes your feedback. Test it, share your experiences, and help guide its evolution to GA stability.
Join discussions in the Kubernetes Storage Special Interest Group (SIG-Storage) to shape the future of Kubernetes storage capabilities.
11 Sep 2025 6:30pm GMT
10 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Use An Init Container To Define App Environment Variables
Kubernetes typically uses ConfigMaps and Secrets to set environment variables, which introduces additional API calls and complexity, For example, you need to separately manage the Pods of your workloads and their configurations, while ensuring orderly updates for both the configurations and the workload Pods.
Alternatively, you might be using a vendor-supplied container that requires environment variables (such as a license key or a one-time token), but you don't want to hard-code them or mount volumes just to get the job done.
If that's the situation you are in, you now have a new (alpha) way to achieve that. Provided you have the EnvFiles
feature gate enabled across your cluster, you can tell the kubelet to load a container's environment variables from a volume (the volume must be part of the Pod that the container belongs to). this feature gate allows you to load environment variables directly from a file in an emptyDir volume without actually mounting that file into the container. It's a simple yet elegant solution to some surprisingly common problems.
What's this all about?
At its core, this feature allows you to point your container to a file, one generated by an initContainer
, and have Kubernetes parse that file to set your environment variables. The file lives in an emptyDir
volume (a temporary storage space that lasts as long as the pod does), Your main container doesn't need to mount the volume. The kubelet will read the file and inject these variables when the container starts.
How It Works
Here's a simple example:
apiVersion: v1
kind: Pod
spec:
initContainers:
- name: generate-config
image: busybox
command: ['sh', '-c', 'echo "CONFIG_VAR=HELLO" > /config/config.env']
volumeMounts:
- name: config-volume
mountPath: /config
containers:
- name: app-container
image: gcr.io/distroless/static
env:
- name: CONFIG_VAR
valueFrom:
fileKeyRef:
path: config.env
volumeName: config-volume
key: CONFIG_VAR
volumes:
- name: config-volume
emptyDir: {}
Using this approach is a breeze. You define your environment variables in the pod spec using the fileKeyRef
field, which tells Kubernetes where to find the file and which key to pull. The file itself resembles the standard for .env syntax (think KEY=VALUE), and (for this alpha stage at least) you must ensure that it is written into an emptyDir
volume. Other volume types aren't supported for this feature. At least one init container must mount that emptyDir
volume (to write the file), but the main container doesn't need to-it just gets the variables handed to it at startup.
A word on security
While this feature supports handling sensitive data such as keys or tokens, note that its implementation relies on emptyDir
volumes mounted into pod. Operators with node filesystem access could therefore easily retrieve this sensitive data through pod directory paths.
If storing sensitive data like keys or tokens using this feature, ensure your cluster security policies effectively protect nodes against unauthorized access to prevent exposure of confidential information.
Summary
This feature will eliminate a number of complex workarounds used today, simplifying apps authoring, and opening doors for more use cases. Kubernetes stays flexible and open for feedback. Tell us how you use this feature or what is missing.
10 Sep 2025 6:30pm GMT
09 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Snapshottable API server cache
For years, the Kubernetes community has been on a mission to improve the stability and performance predictability of the API server. A major focus of this effort has been taming list requests, which have historically been a primary source of high memory usage and heavy load on the etcd
datastore. With each release, we've chipped away at the problem, and today, we're thrilled to announce the final major piece of this puzzle.
The snapshottable API server cache feature has graduated to Beta in Kubernetes v1.34, culminating a multi-release effort to allow virtually all read requests to be served directly from the API server's cache.
Evolving the cache for performance and stability
The path to the current state involved several key enhancements over recent releases that paved the way for today's announcement.
Consistent reads from cache (Beta in v1.31)
While the API server has long used a cache for performance, a key milestone was guaranteeing consistent reads of the latest data from it. This v1.31 enhancement allowed the watch cache to be used for strongly-consistent read requests for the first time, a huge win as it enabled filtered collections (e.g. "a list of pods bound to this node") to be safely served from the cache instead of etcd, dramatically reducing its load for common workloads.
Taming large responses with streaming (Beta in v1.33)
Another key improvement was tackling the problem of memory spikes when transmitting large responses. The streaming encoder, introduced in v1.33, allowed the API server to send list items one by one, rather than buffering the entire multi-gigabyte response in memory. This made the memory cost of sending a response predictable and minimal, regardless of its size.
The missing piece
Despite these huge improvements, a critical gap remained. Any request for a historical LIST
-most commonly used for paginating through large result sets-still had to bypass the cache and query etcd
directly. This meant that the cost of retrieving the data was still unpredictable and could put significant memory pressure on the API server.
Kubernetes 1.34: snapshots complete the picture
The snapshottable API server cache solves this final piece of the puzzle. This feature enhances the watch cache, enabling it to generate efficient, point-in-time snapshots of its state.
Here's how it works: for each update, the cache creates a lightweight snapshot. These snapshots are "lazy copies," meaning they don't duplicate objects but simply store pointers, making them incredibly memory-efficient.
When a list request for a historical resourceVersion
arrives, the API server now finds the corresponding snapshot and serves the response directly from its memory. This closes the final major gap, allowing paginated requests to be served entirely from the cache.
A new era of API Server performance 🚀
With this final piece in place, the synergy of these three features ushers in a new era of API server predictability and performance:
- Get Data from Cache: Consistent reads and snapshottable cache work together to ensure nearly all read requests-whether for the latest data or a historical snapshot-are served from the API server's memory.
- Send data via stream: Streaming list responses ensure that sending this data to the client has a minimal and constant memory footprint.
The result is a system where the resource cost of read operations is almost fully predictable and much more resiliant to spikes in request load. This means dramatically reduced memory pressure, a lighter load on etcd
, and a more stable, scalable, and reliable control plane for all Kubernetes clusters.
How to get started
With its graduation to Beta, the SnapshottableCache
feature gate is enabled by default in Kubernetes v1.34. There are no actions required to start benefiting from these performance and stability improvements.
Acknowledgements
Special thanks for designing, implementing, and reviewing these critical features go to:
- Ahmad Zolfaghari (@ah8ad3)
- Ben Luddy (@benluddy) - Red Hat
- Chen Chen (@z1cheng) - Microsoft
- Davanum Srinivas (@dims) - Nvidia
- David Eads (@deads2k) - Red Hat
- Han Kang (@logicalhan) - CoreWeave
- haosdent (@haosdent) - Shopee
- Joe Betz (@jpbetz) - Google
- Jordan Liggitt (@liggitt) - Google
- Łukasz Szaszkiewicz (@p0lyn0mial) - Red Hat
- Maciej Borsz (@mborsz) - Google
- Madhav Jivrajani (@MadhavJivrajani) - UIUC
- Marek Siarkowicz (@serathius) - Google
- NKeert (@NKeert)
- Tim Bannister (@lmktfy)
- Wei Fu (@fuweid) - Microsoft
- Wojtek Tyczyński (@wojtek-t) - Google
...and many others in SIG API Machinery. This milestone is a testament to the community's dedication to building a more scalable and robust Kubernetes.
09 Sep 2025 6:30pm GMT
08 Sep 2025
Kubernetes Blog
Kubernetes v1.34: VolumeAttributesClass for Volume Modification GA
The VolumeAttributesClass API, which empowers users to dynamically modify volume attributes, has officially graduated to General Availability (GA) in Kubernetes v1.34. This marks a significant milestone, providing a robust and stable way to tune your persistent storage directly within Kubernetes.
What is VolumeAttributesClass?
At its core, VolumeAttributesClass is a cluster-scoped resource that defines a set of mutable parameters for a volume. Think of it as a "profile" for your storage, allowing cluster administrators to expose different quality-of-service (QoS) levels or performance tiers.
Users can then specify a volumeAttributesClassName
in their PersistentVolumeClaim (PVC) to indicate which class of attributes they desire. The magic happens through the Container Storage Interface (CSI): when a PVC referencing a VolumeAttributesClass is updated, the associated CSI driver interacts with the underlying storage system to apply the specified changes to the volume.
This means you can now:
- Dynamically scale performance: Increase IOPS or throughput for a busy database, or reduce it for a less critical application.
- Optimize costs: Adjust attributes on the fly to match your current needs, avoiding over-provisioning.
- Simplify operations: Manage volume modifications directly within the Kubernetes API, rather than relying on external tools or manual processes.
What is new from Beta to GA
There are two major enhancements from beta.
Cancel support from infeasible errors
To improve resilience and user experience, the GA release introduces explicit cancel support when a requested volume modification becomes infeasible. If the underlying storage system or CSI driver indicates that the requested changes cannot be applied (e.g., due to invalid arguments), users can cancel the operation and revert the volume to its previous stable configuration, preventing the volume from being left in an inconsistent state.
Quota support based on scope
While VolumeAttributesClass doesn't add a new quota type, the Kubernetes control plane can be configured to enforce quotas on PersistentVolumeClaims that reference a specific VolumeAttributesClass.
This is achieved by using the scopeSelector
field in a ResourceQuota to target PVCs that have .spec.volumeAttributesClassName
set to a particular VolumeAttributesClass name. Please see more details here.
Drivers support VolumeAttributesClass
- Amazon EBS CSI Driver: The AWS EBS CSI driver has robust support for VolumeAttributesClass and allows you to modify parameters like volume type (e.g., gp2 to gp3, io1 to io2), IOPS, and throughput of EBS volumes dynamically.
- Google Compute Engine (GCE) Persistent Disk CSI Driver (pd.csi.storage.gke.io): This driver also supports dynamic modification of persistent disk attributes, including IOPS and throughput, via VolumeAttributesClass.
Contact
For any inquiries or specific questions related to VolumeAttributesClass, please reach out to the SIG Storage community.
08 Sep 2025 6:30pm GMT
05 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Pod Replacement Policy for Jobs Goes GA
In Kubernetes v1.34, the Pod replacement policy feature has reached general availability (GA). This blog post describes the Pod replacement policy feature and how to use it in your Jobs.
About Pod Replacement Policy
By default, the Job controller immediately recreates Pods as soon as they fail or begin terminating (when they have a deletion timestamp).
As a result, while some Pods are terminating, the total number of running Pods for a Job can temporarily exceed the specified parallelism. For Indexed Jobs, this can even mean multiple Pods running for the same index at the same time.
This behavior works fine for many workloads, but it can cause problems in certain cases.
For example, popular machine learning frameworks like TensorFlow and JAX expect exactly one Pod per worker index. If two Pods run at the same time, you might encounter errors such as:
/job:worker/task:4: Duplicate task registration with task_name=/job:worker/replica:0/task:4
Additionally, starting replacement Pods before the old ones fully terminate can lead to:
- Scheduling delays by kube-scheduler as the nodes remain occupied.
- Unnecessary cluster scale-ups to accommodate the replacement Pods.
- Temporary bypassing of quota checks by workload orchestrators like Kueue.
With Pod replacement policy, Kubernetes gives you control over when the control plane replaces terminating Pods, helping you avoid these issues.
How Pod Replacement Policy works
This enhancement means that Jobs in Kubernetes have an optional field .spec.podReplacementPolicy
.
You can choose one of two policies:
TerminatingOrFailed
(default): Replaces Pods as soon as they start terminating.Failed
: Replaces Pods only after they fully terminate and transition to theFailed
phase.
Setting the policy to Failed
ensures that a new Pod is only created after the previous one has completely terminated.
For Jobs with a Pod Failure Policy, the default podReplacementPolicy
is Failed
, and no other value is allowed. See Pod Failure Policy to learn more about Pod Failure Policies for Jobs.
You can check how many Pods are currently terminating by inspecting the Job's .status.terminating
field:
kubectl get job myjob -o=jsonpath='{.status.terminating}'
Example
Here's a Job example that executes a task two times (spec.completions: 2
) in parallel (spec.parallelism: 2
) and replaces Pods only after they fully terminate (spec.podReplacementPolicy: Failed
):
apiVersion: batch/v1
kind: Job
metadata:
name: example-job
spec:
completions: 2
parallelism: 2
podReplacementPolicy: Failed
template:
spec:
restartPolicy: Never
containers:
- name: worker
image: your-image
If a Pod receives a SIGTERM signal (deletion, eviction, preemption...), it begins terminating. When the container handles termination gracefully, cleanup may take some time.
When the Job starts, we will see two Pods running:
kubectl get pods
NAME READY STATUS RESTARTS AGE
example-job-qr8kf 1/1 Running 0 2s
example-job-stvb4 1/1 Running 0 2s
Let's delete one of the Pods (example-job-qr8kf
).
With the TerminatingOrFailed
policy, as soon as one Pod (example-job-qr8kf
) starts terminating, the Job controller immediately creates a new Pod (example-job-b59zk
) to replace it.
kubectl get pods
NAME READY STATUS RESTARTS AGE
example-job-b59zk 1/1 Running 0 1s
example-job-qr8kf 1/1 Terminating 0 17s
example-job-stvb4 1/1 Running 0 17s
With the Failed
policy, the new Pod (example-job-b59zk
) is not created while the old Pod (example-job-qr8kf
) is terminating.
kubectl get pods
NAME READY STATUS RESTARTS AGE
example-job-qr8kf 1/1 Terminating 0 17s
example-job-stvb4 1/1 Running 0 17s
When the terminating Pod has fully transitioned to the Failed
phase, a new Pod is created:
kubectl get pods
NAME READY STATUS RESTARTS AGE
example-job-b59zk 1/1 Running 0 1s
example-job-stvb4 1/1 Running 0 25s
How can you learn more?
- Read the user-facing documentation for Pod Replacement Policy, Backoff Limit per Index, and Pod Failure Policy.
- Read the KEPs for Pod Replacement Policy, Backoff Limit per Index, and Pod Failure Policy.
Acknowledgments
As with any Kubernetes feature, multiple people contributed to getting this done, from testing and filing bugs to reviewing code.
As this feature moves to stable after 2 years, we would like to thank the following people:
- Kevin Hannon - for writing the KEP and the initial implementation.
- Michał Woźniak - for guidance, mentorship, and reviews.
- Aldo Culquicondor - for guidance, mentorship, and reviews.
- Maciej Szulik - for guidance, mentorship, and reviews.
- Dejan Zele Pejchev - for taking over the feature and promoting it from Alpha through Beta to GA.
Get involved
This work was sponsored by the Kubernetes batch working group in close collaboration with the SIG Apps community.
If you are interested in working on new features in the space we recommend subscribing to our Slack channel and attending the regular community meetings.
05 Sep 2025 6:30pm GMT
04 Sep 2025
Kubernetes Blog
Kubernetes v1.34: PSI Metrics for Kubernetes Graduates to Beta
As Kubernetes clusters grow in size and complexity, understanding the health and performance of individual nodes becomes increasingly critical. We are excited to announce that as of Kubernetes v1.34, Pressure Stall Information (PSI) Metrics has graduated to Beta.
What is Pressure Stall Information (PSI)?
Pressure Stall Information (PSI) is a feature of the Linux kernel (version 4.20 and later) that provides a canonical way to quantify pressure on infrastructure resources, in terms of whether demand for a resource exceeds current supply. It moves beyond simple resource utilization metrics and instead measures the amount of time that tasks are stalled due to resource contention. This is a powerful way to identify and diagnose resource bottlenecks that can impact application performance.
PSI exposes metrics for CPU, memory, and I/O, categorized as either some
or full
pressure:
some
- The percentage of time that at least one task is stalled on a resource. This indicates some level of resource contention.
full
- The percentage of time that all non-idle tasks are stalled on a resource simultaneously. This indicates a more severe resource bottleneck.
PSI: 'Some' vs. 'Full' Pressure
These metrics are aggregated over 10-second, 1-minute, and 5-minute rolling windows, providing a comprehensive view of resource pressure over time.
PSI metrics in Kubernetes
With the KubeletPSI
feature gate enabled, the kubelet can now collect PSI metrics from the Linux kernel and expose them through two channels: the Summary API and the /metrics/cadvisor
Prometheus endpoint. This allows you to monitor and alert on resource pressure at the node, pod, and container level.
The following new metrics are available in Prometheus exposition format via /metrics/cadvisor
:
container_pressure_cpu_stalled_seconds_total
container_pressure_cpu_waiting_seconds_total
container_pressure_memory_stalled_seconds_total
container_pressure_memory_waiting_seconds_total
container_pressure_io_stalled_seconds_total
container_pressure_io_waiting_seconds_total
These metrics, along with the data from the Summary API, provide a granular view of resource pressure, enabling you to pinpoint the source of performance issues and take corrective action. For example, you can use these metrics to:
- Identify memory leaks: A steadily increasing
some
pressure for memory can indicate a memory leak in an application. - Optimize resource requests and limits: By understanding the resource pressure of your workloads, you can more accurately tune their resource requests and limits.
- Autoscale workloads: You can use PSI metrics to trigger autoscaling events, ensuring that your workloads have the resources they need to perform optimally.
How to enable PSI metrics
To enable PSI metrics in your Kubernetes cluster, you need to:
- Ensure your nodes are running a Linux kernel version 4.20 or later and are using cgroup v2.
- Enable the
KubeletPSI
feature gate on the kubelet.
Once enabled, you can start scraping the /metrics/cadvisor
endpoint with your Prometheus-compatible monitoring solution or query the Summary API to collect and visualize the new PSI metrics. Note that PSI is a Linux-kernel feature, so these metrics are not available on Windows nodes. Your cluster can contain a mix of Linux and Windows nodes, and on the Windows nodes the kubelet does not expose PSI metrics.
What's next?
We are excited to bring PSI metrics to the Kubernetes community and look forward to your feedback. As a beta feature, we are actively working on improving and extending this functionality towards a stable GA release. We encourage you to try it out and share your experiences with us.
To learn more about PSI metrics, check out the official Kubernetes documentation. You can also get involved in the conversation on the #sig-node Slack channel.
04 Sep 2025 6:30pm GMT
03 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Service Account Token Integration for Image Pulls Graduates to Beta
The Kubernetes community continues to advance security best practices by reducing reliance on long-lived credentials. Following the successful alpha release in Kubernetes v1.33, Service Account Token Integration for Kubelet Credential Providers has now graduated to beta in Kubernetes v1.34, bringing us closer to eliminating long-lived image pull secrets from Kubernetes clusters.
This enhancement allows credential providers to use workload-specific service account tokens to obtain registry credentials, providing a secure, ephemeral alternative to traditional image pull secrets.
What's new in beta?
The beta graduation brings several important changes that make the feature more robust and production-ready:
Required cacheType
field
Breaking change from alpha: The cacheType
field is required in the credential provider configuration when using service account tokens. This field is new in beta and must be specified to ensure proper caching behavior.
# CAUTION: this is not a complete configuration example, just a reference for the 'tokenAttributes.cacheType' field.
tokenAttributes:
serviceAccountTokenAudience: "my-registry-audience"
cacheType: "ServiceAccount" # Required field in beta
requireServiceAccount: true
Choose between two caching strategies:
Token
: Cache credentials per service account token (use when credential lifetime is tied to the token). This is useful when the credential provider transforms the service account token into registry credentials with the same lifetime as the token, or when registries support Kubernetes service account tokens directly. Note: The kubelet cannot send service account tokens directly to registries; credential provider plugins are needed to transform tokens into the username/password format expected by registries.ServiceAccount
: Cache credentials per service account identity (use when credentials are valid for all pods using the same service account)
Isolated image pull credentials
The beta release provides stronger security isolation for container images when using service account tokens for image pulls. It ensures that pods can only access images that were pulled using ServiceAccounts they're authorized to use. This prevents unauthorized access to sensitive container images and enables granular access control where different workloads can have different registry permissions based on their ServiceAccount.
When credential providers use service account tokens, the system tracks ServiceAccount identity (namespace, name, and UID) for each pulled image. When a pod attempts to use a cached image, the system verifies that the pod's ServiceAccount matches exactly with the ServiceAccount that was used to originally pull the image.
Administrators can revoke access to previously pulled images by deleting and recreating the ServiceAccount, which changes the UID and invalidates cached image access.
For more details about this capability, see the image pull credential verification documentation.
How it works
Configuration
Credential providers opt into using ServiceAccount tokens by configuring the tokenAttributes
field:
#
# CAUTION: this is an example configuration.
# Do not use this for your own cluster!
#
apiVersion: kubelet.config.k8s.io/v1
kind: CredentialProviderConfig
providers:
- name: my-credential-provider
matchImages:
- "*.myregistry.io/*"
defaultCacheDuration: "10m"
apiVersion: credentialprovider.kubelet.k8s.io/v1
tokenAttributes:
serviceAccountTokenAudience: "my-registry-audience"
cacheType: "ServiceAccount" # New in beta
requireServiceAccount: true
requiredServiceAccountAnnotationKeys:
- "myregistry.io/identity-id"
optionalServiceAccountAnnotationKeys:
- "myregistry.io/optional-annotation"
Image pull flow
At a high level, kubelet
coordinates with your credential provider and the container runtime as follows:
-
When the image is not present locally:
kubelet
checks its credential cache using the configuredcacheType
(Token
orServiceAccount
)- If needed,
kubelet
requests a ServiceAccount token for the pod's ServiceAccount and passes it, plus any required annotations, to the credential provider - The provider exchanges that token for registry credentials and returns them to
kubelet
kubelet
caches credentials per thecacheType
strategy and pulls the image with those credentialskubelet
records the ServiceAccount coordinates (namespace, name, UID) associated with the pulled image for later authorization checks
-
When the image is already present locally:
kubelet
verifies the pod's ServiceAccount coordinates match the coordinates recorded for the cached image- If they match exactly, the cached image can be used without pulling from the registry
- If they differ,
kubelet
performs a fresh pull using credentials for the new ServiceAccount
-
With image pull credential verification enabled:
- Authorization is enforced using the recorded ServiceAccount coordinates, ensuring pods only use images pulled by a ServiceAccount they are authorized to use
- Administrators can revoke access by deleting and recreating a ServiceAccount; the UID changes and previously recorded authorization no longer matches
Audience restriction
The beta release builds on service account node audience restriction (beta since v1.33) to ensure kubelet
can only request tokens for authorized audiences. Administrators configure allowed audiences using RBAC to enable kubelet to request service account tokens for image pulls:
#
# CAUTION: this is an example configuration.
# Do not use this for your own cluster!
#
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kubelet-credential-provider-audiences
rules:
- verbs: ["request-serviceaccounts-token-audience"]
apiGroups: [""]
resources: ["my-registry-audience"]
resourceNames: ["registry-access-sa"] # Optional: specific SA
Getting started with beta
Prerequisites
- Kubernetes v1.34 or later
- Feature gate enabled:
KubeletServiceAccountTokenForCredentialProviders=true
(beta, enabled by default) - Credential provider support: Update your credential provider to handle ServiceAccount tokens
Migration from alpha
If you're already using the alpha version, the migration to beta requires minimal changes:
- Add
cacheType
field: Update your credential provider configuration to include the requiredcacheType
field - Review caching strategy: Choose between
Token
andServiceAccount
cache types based on your provider's behavior - Test audience restrictions: Ensure your RBAC configuration, or other cluster authorization rules, will properly restrict token audiences
Example setup
Here's a complete example for setting up a credential provider with service account tokens (this example assumes your cluster uses RBAC authorization):
#
# CAUTION: this is an example configuration.
# Do not use this for your own cluster!
#
# Service Account with registry annotations
apiVersion: v1
kind: ServiceAccount
metadata:
name: registry-access-sa
namespace: default
annotations:
myregistry.io/identity-id: "user123"
---
# RBAC for audience restriction
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: registry-audience-access
rules:
- verbs: ["request-serviceaccounts-token-audience"]
apiGroups: [""]
resources: ["my-registry-audience"]
resourceNames: ["registry-access-sa"] # Optional: specific ServiceAccount
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kubelet-registry-audience
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: registry-audience-access
subjects:
- kind: Group
name: system:nodes
apiGroup: rbac.authorization.k8s.io
---
# Pod using the ServiceAccount
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
serviceAccountName: registry-access-sa
containers:
- name: my-app
image: myregistry.example/my-app:latest
What's next?
For Kubernetes v1.35, we - Kubernetes SIG Auth - expect the feature to stay in beta, and we will continue to solicit feedback.
You can learn more about this feature on the service account token for image pulls page in the Kubernetes documentation.
You can also follow along on the KEP-4412 to track progress across the coming Kubernetes releases.
Call to action
In this blog post, I have covered the beta graduation of ServiceAccount token integration for Kubelet Credential Providers in Kubernetes v1.34. I discussed the key improvements, including the required cacheType
field and enhanced integration with Ensure Secret Pull Images.
We have been receiving positive feedback from the community during the alpha phase and would love to hear more as we stabilize this feature for GA. In particular, we would like feedback from credential provider implementors as they integrate with the new beta API and caching mechanisms. Please reach out to us on the #sig-auth-authenticators-dev channel on Kubernetes Slack.
How to get involved
If you are interested in getting involved in the development of this feature, share feedback, or participate in any other ongoing SIG Auth projects, please reach out on the #sig-auth channel on Kubernetes Slack.
You are also welcome to join the bi-weekly SIG Auth meetings, held every other Wednesday.
03 Sep 2025 6:30pm GMT
02 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Introducing CPU Manager Static Policy Option for Uncore Cache Alignment
A new CPU Manager Static Policy Option called prefer-align-cpus-by-uncorecache
was introduced in Kubernetes v1.32 as an alpha feature, and has graduated to beta in Kubernetes v1.34. This CPU Manager Policy Option is designed to optimize performance for specific workloads running on processors with a split uncore cache architecture. In this article, I'll explain what that means and why it's useful.
Understanding the feature
What is uncore cache?
Until relatively recently, nearly all mainstream computer processors had a monolithic last-level-cache cache that was shared across every core in a multiple CPU package. This monolithic cache is also referred to as uncore cache (because it is not linked to a specific core), or as Level 3 cache. As well as the Level 3 cache, there is other cache, commonly called Level 1 and Level 2 cache, that is associated with a specific CPU core.
In order to reduce access latency between the CPU cores and their cache, recent AMD64 and ARM architecture based processors have introduced a split uncore cache architecture, where the last-level-cache is divided into multiple physical caches, that are aligned to specific CPU groupings within the physical package. The shorter distances within the CPU package help to reduce latency.
Kubernetes is able to place workloads in a way that accounts for the cache topology within the CPU package(s).
Cache-aware workload placement
The matrix below shows the CPU-to-CPU latency measured in nanoseconds (lower is better) when passing a packet between CPUs, via its cache coherence protocol on a processor that uses split uncore cache. In this example, the processor package consists of 2 uncore caches. Each uncore cache serves 8 CPU cores. Blue entries in the matrix represent latency between CPUs sharing the same uncore cache, while grey entries indicate latency between CPUs corresponding to different uncore caches. Latency between CPUs that correspond to different caches are higher than the latency between CPUs that belong to the same cache.
With prefer-align-cpus-by-uncorecache
enabled, the static CPU Manager attempts to allocates CPU resources for a container, such that all CPUs assigned to a container share the same uncore cache. This policy operates on a best-effort basis, aiming to minimize the distribution of a container's CPU resources across uncore caches, based on the container's requirements, and accounting for allocatable resources on the node.
By running a workload, where it can, on a set of CPUS that use the smallest feasible number of uncore caches, applications benefit from reduced cache latency (as seen in the matrix above), and from reduced contention against other workloads, which can result in overall higher throughput. The benefit only shows up if your nodes use a split uncore cache topology for their processors.
The following diagram below illustrates uncore cache alignment when the feature is enabled.
By default, Kubernetes does not account for uncore cache topology; containers are assigned CPU resources using a packed methodology. As a result, Container 1 and Container 2 can experience a noisy neighbor impact due to cache access contention on Uncore Cache 0. Additionally, Container 2 will have CPUs distributed across both caches which can introduce a cross-cache latency.
With prefer-align-cpus-by-uncorecache
enabled, each container is isolated on an individual cache. This resolves the cache contention between the containers and minimizes the cache latency for the CPUs being utilized.
Use cases
Common use cases can include telco applications like vRAN, Mobile Packet Core, and Firewalls. It's important to note that the optimization provided by prefer-align-cpus-by-uncorecache
can be dependent on the workload. For example, applications that are memory bandwidth bound may not benefit from uncore cache alignment, as utilizing more uncore caches can increase memory bandwidth access.
Enabling the feature
To enable this feature, set the CPU Manager Policy to static
and enable the CPU Manager Policy Options with prefer-align-cpus-by-uncorecache
.
For Kubernetes 1.34, the feature is in the beta stage and requires the CPUManagerPolicyBetaOptions
feature gate to also be enabled.
Append the following to the kubelet configuration file:
kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
featureGates:
...
CPUManagerPolicyBetaOptions: true
cpuManagerPolicy: "static"
cpuManagerPolicyOptions:
prefer-align-cpus-by-uncorecache: "true"
reservedSystemCPUs: "0"
...
If you're making this change to an existing node, remove the cpu_manager_state
file and then restart kubelet.
prefer-align-cpus-by-uncorecache
can be enabled on nodes with a monolithic uncore cache processor. The feature will mimic a best-effort socket alignment effect and will pack CPU resources on the socket similar to the default static CPU Manager policy.
Further reading
See Node Resource Managers to learn more about the CPU Manager and the available policies.
Reference the documentation for prefer-align-cpus-by-uncorecache
here.
Please see the Kubernetes Enhancement Proposal for more information on how prefer-align-cpus-by-uncorecache
is implemented.
Getting involved
This feature is driven by SIG Node. If you are interested in helping develop this feature, sharing feedback, or participating in any other ongoing SIG Node projects, please attend the SIG Node meeting for more details.
02 Sep 2025 6:30pm GMT
01 Sep 2025
Kubernetes Blog
Kubernetes v1.34: DRA has graduated to GA
Kubernetes 1.34 is here, and it has brought a huge wave of enhancements for Dynamic Resource Allocation (DRA)! This release marks a major milestone with many APIs in the resource.k8s.io
group graduating to General Availability (GA), unlocking the full potential of how you manage devices on Kubernetes. On top of that, several key features have moved to beta, and a fresh batch of new alpha features promise even more expressiveness and flexibility.
Let's dive into what's new for DRA in Kubernetes 1.34!
The core of DRA is now GA
The headline feature of the v1.34 release is that the core of DRA has graduated to General Availability.
Kubernetes Dynamic Resource Allocation (DRA) provides a flexible framework for managing specialized hardware and infrastructure resources, such as GPUs or FPGAs. DRA provides APIs that enable each workload to specify the properties of the devices it needs, but leaving it to the scheduler to allocate actual devices, allowing increased reliability and improved utilization of expensive hardware.
With the graduation to GA, DRA is stable and will be part of Kubernetes for the long run. The community can still expect a steady stream of new features being added to DRA over the next several Kubernetes releases, but they will not make any breaking changes to DRA. So users and developers of DRA drivers can start adopting DRA with confidence.
Starting with Kubernetes 1.34, DRA is enabled by default; the DRA features that have reached beta are also enabled by default. That's because the default API version for DRA is now the stable v1
version, and not the earlier versions (eg: v1beta1
or v1beta2
) that needed explicit opt in.
Features promoted to beta
Several powerful features have been promoted to beta, adding more control, flexibility, and observability to resource management with DRA.
Admin access labelling has been updated. In v1.34, you can restrict device support to people (or software) authorized to use it. This is meant as a way to avoid privilege escalation if a DRA driver grants additional privileges when admin access is requested and to avoid accessing devices which are in use by normal applications, potentially in another namespace. The restriction works by ensuring that only users with access to a namespace with the resource.k8s.io/admin-access: "true"
label are authorized to create ResourceClaim or ResourceClaimTemplates objects with the adminAccess
field set to true. This ensures that non-admin users cannot misuse the feature.
Prioritized list lets users specify a list of acceptable devices for their workloads, rather than just a single type of device. So while the workload might run best on a single high-performance GPU, it might also be able to run on 2 mid-level GPUs. The scheduler will attempt to satisfy the alternatives in the list in order, so the workload will be allocated the best set of devices available on the node.
The kubelet's API has been updated to report on Pod resources allocated through DRA. This allows node monitoring agents to know the allocated DRA resources for Pods on a node and makes it possible to use the DRA information in the PodResources API to develop new features and integrations.
New alpha features
Kubernetes 1.34 also introduces several new alpha features that give us a glimpse into the future of resource management with DRA.
Extended resource mapping support in DRA allows cluster administrators to advertise DRA-managed resources as extended resources, allowing developers to consume them using the familiar, simpler request syntax while still benefiting from dynamic allocation. This makes it possible for existing workloads to start using DRA without modifications, simplifying the transition to DRA for both application developers and cluster administrators.
Consumable capacity introduces a flexible device sharing model where multiple, independent resource claims from unrelated pods can each be allocated a share of the same underlying physical device. This new capability is managed through optional, administrator-defined sharing policies that govern how a device's total capacity is divided and enforced by the platform for each request. This allows for sharing of devices in scenarios where pre-defined partitions are not viable. A blog about this feature is coming soon.
Binding conditions improve scheduling reliability for certain classes of devices by allowing the Kubernetes scheduler to delay binding a pod to a node until its required external resources, such as attachable devices or FPGAs, are confirmed to be fully prepared. This prevents premature pod assignments that could lead to failures and ensures more robust, predictable scheduling by explicitly modeling resource readiness before the pod is committed to a node.
Resource health status for DRA improves observability by exposing the health status of devices allocated to a Pod via Pod Status. This works whether the device is allocated through DRA or Device Plugin. This makes it easier to understand the cause of an unhealthy device and respond properly. A blog about this feature is coming soon.
What's next?
While DRA got promoted to GA this cycle, the hard work on DRA doesn't stop. There are several features in alpha and beta that we plan to bring to GA in the next couple of releases and we are looking to continue to improve performance, scalability and reliability of DRA. So expect an equally ambitious set of features in DRA for the 1.35 release.
Getting involved
A good starting point is joining the WG Device Management Slack channel and meetings, which happen at US/EU and EU/APAC friendly time slots.
Not all enhancement ideas are tracked as issues yet, so come talk to us if you want to help or have some ideas yourself! We have work to do at all levels, from difficult core changes to usability enhancements in kubectl, which could be picked up by newcomers.
Acknowledgments
A huge thanks to the new contributors to DRA this cycle:
- Alay Patel (alaypatel07)
- Gaurav Kumar Ghildiyal (gauravkghildiyal)
- JP (Jpsassine)
- Kobayashi Daisuke (KobayashiD27)
- Laura Lorenz (lauralorenz)
- Sunyanan Choochotkaew (sunya-ch)
- Swati Gupta (guptaNswati)
- Yu Liao (yliaog)
01 Sep 2025 6:30pm GMT
29 Aug 2025
Kubernetes Blog
Kubernetes v1.34: Finer-Grained Control Over Container Restarts
With the release of Kubernetes 1.34, a new alpha feature is introduced that gives you more granular control over container restarts within a Pod. This feature, named Container Restart Policy and Rules, allows you to specify a restart policy for each container individually, overriding the Pod's global restart policy. In addition, it also allows you to conditionally restart individual containers based on their exit codes. This feature is available behind the alpha feature gate ContainerRestartRules
.
This has been a long-requested feature. Let's dive into how it works and how you can use it.
The problem with a single restart policy
Before this feature, the restartPolicy
was set at the Pod level. This meant that all containers in a Pod shared the same restart policy (Always
, OnFailure
, or Never
). While this works for many use cases, it can be limiting in others.
For example, consider a Pod with a main application container and an init container that performs some initial setup. You might want the main container to always restart on failure, but the init container should only run once and never restart. With a single Pod-level restart policy, this wasn't possible.
Introducing per-container restart policies
With the new ContainerRestartRules
feature gate, you can now specify a restartPolicy
for each container in your Pod's spec. You can also define restartPolicyRules
to control restarts based on exit codes. This gives you the fine-grained control you need to handle complex scenarios.
Use cases
Let's look at some real-life use cases where per-container restart policies can be beneficial.
In-place restarts for training jobs
In ML research, it's common to orchestrate a large number of long-running AI/ML training workloads. In these scenarios, workload failures are unavoidable. When a workload fails with a retriable exit code, you want the container to restart quickly without rescheduling the entire Pod, which consumes a significant amount of time and resources. Restarting the failed container "in-place" is critical for better utilization of compute resources. The container should only restart "in-place" if it failed due to a retriable error; otherwise, the container and Pod should terminate and possibly be rescheduled.
This can now be achieved with container-level restartPolicyRules
. The workload can exit with different codes to represent retriable and non-retriable errors. With restartPolicyRules
, the workload can be restarted in-place quickly, but only when the error is retriable.
Try-once init containers
Init containers are often used to perform initialization work for the main container, such as setting up environments and credentials. Sometimes, you want the main container to always be restarted, but you don't want to retry initialization if it fails.
With a container-level restartPolicy
, this is now possible. The init container can be executed only once, and its failure would be considered a Pod failure. If the initialization succeeds, the main container can be always restarted.
Pods with multiple containers
For Pods that run multiple containers, you might have different restart requirements for each container. Some containers might have a clear definition of success and should only be restarted on failure. Others might need to be always restarted.
This is now possible with a container-level restartPolicy
, allowing individual containers to have different restart policies.
How to use it
To use this new feature, you need to enable the ContainerRestartRules
feature gate on your Kubernetes cluster control-plane and worker nodes running Kubernetes 1.34+. Once enabled, you can specify the restartPolicy
and restartPolicyRules
fields in your container definitions.
Here are some examples:
Example 1: Restarting on specific exit codes
In this example, the container should restart if and only if it fails with a retriable error, represented by exit code 42.
To achieve this, the container has restartPolicy: Never
, and a restart policy rule that tells Kubernetes to restart the container in-place if it exits with code 42.
apiVersion: v1
kind: Pod
metadata:
name: restart-on-exit-codes
annotations:
kubernetes.io/description: "This Pod only restart the container only when it exits with code 42."
spec:
restartPolicy: Never
containers:
- name: restart-on-exit-codes
image: docker.io/library/busybox:1.28
command: ['sh', '-c', 'sleep 60 && exit 0']
restartPolicy: Never # Container restart policy must be specified if rules are specified
restartPolicyRules: # Only restart the container if it exits with code 42
- action: Restart
exitCodes:
operator: In
values: [42]
Example 2: A try-once init container
In this example, a Pod should always be restarted once the initialization succeeds. However, the initialization should only be tried once.
To achieve this, the Pod has an Always
restart policy. The init-once
init container will only try once. If it fails, the Pod will fail. This allows the Pod to fail if the initialization failed, but also keep running once the initialization succeeds.
apiVersion: v1
kind: Pod
metadata:
name: fail-pod-if-init-fails
annotations:
kubernetes.io/description: "This Pod has an init container that runs only once. After initialization succeeds, the main container will always be restarted."
spec:
restartPolicy: Always
initContainers:
- name: init-once # This init container will only try once. If it fails, the Pod will fail.
image: docker.io/library/busybox:1.28
command: ['sh', '-c', 'echo "Failing initialization" && sleep 10 && exit 1']
restartPolicy: Never
containers:
- name: main-container # This container will always be restarted once initialization succeeds.
image: docker.io/library/busybox:1.28
command: ['sh', '-c', 'sleep 1800 && exit 0']
Example 3: Containers with different restart policies
In this example, there are two containers with different restart requirements. One should always be restarted, while the other should only be restarted on failure.
This is achieved by using a different container-level restartPolicy
on each of the two containers.
apiVersion: v1
kind: Pod
metadata:
name: on-failure-pod
annotations:
kubernetes.io/description: "This Pod has two containers with different restart policies."
spec:
containers:
- name: restart-on-failure
image: docker.io/library/busybox:1.28
command: ['sh', '-c', 'echo "Not restarting after success" && sleep 10 && exit 0']
restartPolicy: OnFailure
- name: restart-always
image: docker.io/library/busybox:1.28
command: ['sh', '-c', 'echo "Always restarting" && sleep 1800 && exit 0']
restartPolicy: Always
Learn more
- Read the documentation for container restart policy.
- Read the KEP for the Container Restart Rules
Roadmap
More actions and signals to restart Pods and containers are coming! Notably, there are plans to add support for restarting the entire Pod. Planning and discussions on these features are in progress. Feel free to share feedback or requests with the SIG Node community!
Your feedback is welcome!
This is an alpha feature, and the Kubernetes project would love to hear your feedback. Please try it out. This feature is driven by the SIG Node. If you are interested in helping develop this feature, sharing feedback, or participating in any other ongoing SIG Node projects, please reach out to the SIG Node community!
You can reach SIG Node by several means:
29 Aug 2025 6:30pm GMT
28 Aug 2025
Kubernetes Blog
Kubernetes v1.34: User preferences (kuberc) are available for testing in kubectl 1.34
Have you ever wished you could enable interactive delete, by default, in kubectl
? Or maybe, you'd like to have custom aliases defined, but not necessarily generate hundreds of them manually? Look no further. SIG-CLI has been working hard to add user preferences to kubectl, and we are happy to announce that this functionality is reaching beta as part of the Kubernetes v1.34 release.
How it works
A full description of this functionality is available in our official documentation, but this blog post will answer both of the questions from the beginning of this article.
Before we dive into details, let's quickly cover what the user preferences file looks like and where to place it. By default, kubectl
will look for kuberc
file in your default kubeconfig directory, which is $HOME/.kube
. Alternatively, you can specify this location using --kuberc
option or the KUBERC
environment variable.
Just like every Kubernetes manifest, kuberc
file will start with an apiVersion
and kind
:
apiVersion: kubectl.config.k8s.io/v1beta1
kind: Preference
# the user preferences will follow here
Defaults
Let's start by setting default values for kubectl
command options. Our goal is to always use interactive delete, which means we want the --interactive
option for kubectl delete
to always be set to true
. This can be achieved with the following addition to our kuberc
file:
defaults:
- command: delete
options:
- name: interactive
default: "true"
In the above example, I'm introducing defaults
section, which allows users to define default values for kubectl
options. In this case, we're setting the interactive option for kubectl delete
to be true
by default. This default can be overridden if a user explicitly provides a different value such as kubectl delete --interactive=false
, in which case the explicit option takes precedence.
Another highly encouraged default from SIG-CLI, is using Server-Side Apply. To do so, you can add the following snippet to your preferences:
# continuing defaults section
- command: apply
options:
- name: server-side
default: "true"
Aliases
The ability to define aliases allows us to save precious seconds when typing commands. I bet that you most likely have one defined for kubectl
, because typing seven letters is definitely longer than just pressing k
.
For this reason, the ability to define aliases was a must-have when we decided to implement user preferences, alongside defaulting. To define an alias for any of the built-in commands, expand your kuberc
file with the following addition:
aliases:
- name: gns
command: get
prependArgs:
- namespace
options:
- name: output
default: json
There's a lot going on above, so let me break this down. First, we're introducing a new section: aliases
. Here, we're defining a new alias gns
, which is mapped to the command get
command. Next, we're defining arguments (namespace
resource) that will be inserted right after the command name. Additionally, we're setting --output=json
option for this alias. The structure of options
block is identical to the one in the defaults
section.
You probably noticed that we've introduced a mechanism for prepending arguments, and you might wonder if there is a complementary setting for appending them (in other words, adding to the end of the command, after user-provided arguments). This can be achieved through appendArgs
block, which is presented below:
# continuing aliases section
- name: runx
command: run
options:
- name: image
default: busybox
- name: namespace
default: test-ns
appendArgs:
- --
- custom-arg
Here, we're introducing another alias: runx
, which invokes kubectl run
command, passing --image
and --namespace
options with predefined values, and also appending --
and custom-arg
at the end of the invocation.
Debugging
We hope that kubectl
user preferences will open up new possibilities for our users. Whenever you're in doubt, feel free to run kubectl
with increased verbosity. At -v=5
, you should get all the possible debugging information from this feature, which will be crucial when reporting issues.
To learn more, I encourage you to read through our official documentation and the actual proposal.
Get involved
Kubectl user preferences feature has reached beta, and we are very interested in your feedback. We'd love to hear what you like about it and what problems you'd like to see it solve. Feel free to join SIG-CLI slack channel, or open an issue against kubectl repository. You can also join us at our community meetings, which happen every other Wednesday, and share your stories with us.
28 Aug 2025 6:30pm GMT
27 Aug 2025
Kubernetes Blog
Kubernetes v1.34: Of Wind & Will (O' WaW)
Editors: Agustina Barbetta, Alejandro Josue Leon Bellido, Graziano Casto, Melony Qin, Dipesh Rawat
Similar to previous releases, the release of Kubernetes v1.34 introduces new stable, beta, and alpha features. The consistent delivery of high-quality releases underscores the strength of our development cycle and the vibrant support from our community.
This release consists of 58 enhancements. Of those enhancements, 23 have graduated to Stable, 22 have entered Beta, and 13 have entered Alpha.
There are also some deprecations and removals in this release; make sure to read about those.
Release theme and logo

A release powered by the wind around us - and the will within us.
Every release cycle, we inherit winds that we don't really control - the state of our tooling, documentation, and the historical quirks of our project. Sometimes these winds fill our sails, sometimes they push us sideways or die down.
What keeps Kubernetes moving isn't the perfect winds, but the will of our sailors who adjust the sails, man the helm, chart the courses and keep the ship steady. The release happens not because conditions are always ideal, but because of the people who build it, the people who release it, and the bears ^, cats, dogs, wizards, and curious minds who keep Kubernetes sailing strong - no matter which way the wind blows.
This release, Of Wind & Will (O' WaW), honors the winds that have shaped us, and the will that propels us forward.
^ Oh, and you wonder why bears? Keep wondering!
Spotlight on key updates
Kubernetes v1.34 is packed with new features and improvements. Here are a few select updates the Release Team would like to highlight!
Stable: The core of DRA is GA
Dynamic Resource Allocation (DRA) enables more powerful ways to select, allocate, share, and configure GPUs, TPUs, NICs and other devices.
Since the v1.30 release, DRA has been based around claiming devices using structured parameters that are opaque to the core of Kubernetes. This enhancement took inspiration from dynamic provisioning for storage volumes. DRA with structured parameters relies on a set of supporting API kinds: ResourceClaim, DeviceClass, ResourceClaimTemplate, and ResourceSlice API types under resource.k8s.io
, while extending the .spec
for Pods with a new resourceClaims
field.
The resource.k8s.io/v1
APIs have graduated to stable and are now available by default.
This work was done as part of KEP #4381 led by WG Device Management.
Beta: Projected ServiceAccount tokens for kubelet
image credential providers
The kubelet
credential providers, used for pulling private container images, traditionally relied on long-lived Secrets stored on the node or in the cluster. This approach increased security risks and management overhead, as these credentials were not tied to the specific workload and did not rotate automatically.
To solve this, the kubelet
can now request short-lived, audience-bound ServiceAccount tokens for authenticating to container registries. This allows image pulls to be authorized based on the Pod's own identity rather than a node-level credential.
The primary benefit is a significant security improvement. It eliminates the need for long-lived Secrets for image pulls, reducing the attack surface and simplifying credential management for both administrators and developers.
This work was done as part of KEP #4412 led by SIG Auth and SIG Node.
Alpha: Support for KYAML, a Kubernetes dialect of YAML
KYAML aims to be a safer and less ambiguous YAML subset, and was designed specifically for Kubernetes. Whatever version of Kubernetes you use, starting from Kubernetes v1.34 you are able to use KYAML as a new output format for kubectl.
KYAML addresses specific challenges with both YAML and JSON. YAML's significant whitespace requires careful attention to indentation and nesting, while its optional string-quoting can lead to unexpected type coercion (for example: "The Norway Bug"). Meanwhile, JSON lacks comment support and has strict requirements for trailing commas and quoted keys.
You can write KYAML and pass it as an input to any version of kubectl
, because all KYAML files are also valid as YAML. With kubectl
v1.34, you are also able to request KYAML output (as in kubectl get -o kyaml …) by setting environment variable KUBECTL_KYAML=true
. If you prefer, you can still request the output in JSON or YAML format.
This work was done as part of KEP #5295 led by SIG CLI.
Features graduating to Stable
This is a selection of some of the improvements that are now stable following the v1.34 release.
Delayed creation of Job's replacement Pods
By default, Job controllers create replacement Pods immediately when a Pod starts terminating, causing both Pods to run simultaneously. This can cause resource contention in constrained clusters, where the replacement Pod may struggle to find available nodes until the original Pod fully terminates. The situation can also trigger unwanted cluster autoscaler scale-ups. Additionally, some machine learning frameworks like TensorFlow and JAX require only one Pod per index to run at a time, making simultaneous Pod execution problematic. This feature introduces .spec.podReplacementPolicy
in Jobs. You may choose to create replacement Pods only when the Pod is fully terminated (has .status.phase: Failed
). To do this, set .spec.podReplacementPolicy: Failed
.
Introduced as alpha in v1.28, this feature has graduated to stable in v1.34.
This work was done as part of KEP #3939 led by SIG Apps.
Recovery from volume expansion failure
This feature allows users to cancel volume expansions that are unsupported by the underlying storage provider, and retry volume expansion with smaller values that may succeed.
Introduced as alpha in v1.23, this feature has graduated to stable in v1.34.
This work was done as part of KEP #1790 led by SIG Storage.
VolumeAttributesClass for volume modification
VolumeAttributesClass has graduated to stable in v1.34. VolumeAttributesClass is a generic, Kubernetes-native API for modifying volume parameters like provisioned IO. It allows workloads to vertically scale their volumes on-line to balance cost and performance, if supported by their provider.
Like all new volume features in Kubernetes, this API is implemented via the container storage interface (CSI). Your provisioner-specific CSI driver must support the new ModifyVolume API which is the CSI side of this feature.
This work was done as part of KEP #3751 led by SIG Storage.
Structured authentication configuration
Kubernetes v1.29 introduced a configuration file format to manage API server client authentication, moving away from the previous reliance on a large set of command-line options. The AuthenticationConfiguration kind allows administrators to support multiple JWT authenticators, CEL expression validation, and dynamic reloading. This change significantly improves the manageability and auditability of the cluster's authentication settings - and has graduated to stable in v1.34.
This work was done as part of KEP #3331 led by SIG Auth.
Finer-grained authorization based on selectors
Kubernetes authorizers, including webhook authorizers and the built-in node authorizer, can now make authorization decisions based on field and label selectors in incoming requests. When you send list, watch or deletecollection requests with selectors, the authorization layer can now evaluate access with that additional context.
For example, you can write an authorization policy that only allows listing Pods bound to a specific .spec.nodeName
. The client (perhaps the kubelet on a particular node) must specify the field selector that the policy requires, otherwise the request is forbidden. This change makes it feasible to set up least privilege rules, provided that the client knows how to conform to the restrictions you set. Kubernetes v1.34 now supports more granular control in environments like per-node isolation or custom multi-tenant setups.
This work was done as part of KEP #4601 led by SIG Auth.
Restrict anonymous requests with fine-grained controls
Instead of fully enabling or disabling anonymous access, you can now configure a strict list of endpoints where unauthenticated requests are allowed. This provides a safer alternative for clusters that rely on anonymous access to health or bootstrap endpoints like /healthz
, /readyz
, or /livez
.
With this feature, accidental RBAC misconfigurations that grant broad access to anonymous users can be avoided without requiring changes to external probes or bootstrapping tools.
This work was done as part of KEP #4633 led by SIG Auth.
More efficient requeueing through plugin-specific callbacks
The kube-scheduler
can now make more accurate decisions about when to retry scheduling Pods that were previously unschedulable. Each scheduling plugin can now register callback functions that tell the scheduler whether an incoming cluster event is likely to make a rejected Pod schedulable again.
This reduces unnecessary retries and improves overall scheduling throughput - especially in clusters using dynamic resource allocation. The feature also lets certain plugins skip the usual backoff delay when it is safe to do so, making scheduling faster in specific cases.
This work was done as part of KEP #4247 led by SIG Scheduling.
Ordered Namespace deletion
Semi-random resource deletion order can create security gaps or unintended behavior, such as Pods persisting after their associated NetworkPolicies are deleted.
This improvement introduces a more structured deletion process for Kubernetes namespaces to ensure secure and deterministic resource removal. By enforcing a structured deletion sequence that respects logical and security dependencies, this approach ensures Pods are removed before other resources.
This feature was introduced in Kubernetes v1.33 and graduated to stable in v1.34. The graduation improves security and reliability by mitigating risks from non-deterministic deletions, including the vulnerability described in CVE-2024-7598.
This work was done as part of KEP #5080 led by SIG API Machinery.
Streaming list responses
Handling large list responses in Kubernetes previously posed a significant scalability challenge. When clients requested extensive resource lists, such as thousands of Pods or Custom Resources, the API server was required to serialize the entire collection of objects into a single, large memory buffer before sending it. This process created substantial memory pressure and could lead to performance degradation, impacting the overall stability of the cluster.
To address this limitation, a streaming encoding mechanism for collections (list responses) has been introduced. For the JSON and Kubernetes Protobuf response formats, that streaming mechanism is automatically active and the associated feature gate is stable. The primary benefit of this approach is the avoidance of large memory allocations on the API server, resulting in a much smaller and more predictable memory footprint. Consequently, the cluster becomes more resilient and performant, especially in large-scale environments where frequent requests for extensive resource lists are common.
This work was done as part of KEP #5116 led by SIG API Machinery.
Resilient watch cache initialization
Watch cache is a caching layer inside kube-apiserver
that maintains an eventually consistent cache of cluster state stored in etcd. In the past, issues could occur when the watch cache was not yet initialized during kube-apiserver
startup or when it required re-initialization.
To address these issues, the watch cache initialization process has been made more resilient to failures, improving control plane robustness and ensuring controllers and clients can reliably establish watches. This improvement was introduced as beta in v1.31 and is now stable.
This work was done as part of KEP #4568 led by SIG API Machinery and SIG Scalability.
Relaxing DNS search path validation
Previously, the strict validation of a Pod's DNS search
path in Kubernetes often created integration challenges in complex or legacy network environments. This restrictiveness could block configurations that were necessary for an organization's infrastructure, forcing administrators to implement difficult workarounds.
To address this, relaxed DNS validation was introduced as alpha in v1.32 and has now graduated to stable in v1.34. A common use case involves Pods that need to communicate with both internal Kubernetes services and external domains. By setting a single dot (.
) as the first entry in the searches
list of the Pod's .spec.dnsConfig
, administrators can prevent the system's resolver from appending the cluster's internal search domains to external queries. This avoids generating unnecessary DNS requests to the internal DNS server for external hostnames, improving efficiency and preventing potential resolution errors.
This work was done as part of KEP #4427 led by SIG Network.
Support for Direct Service Return (DSR) in Windows kube-proxy
DSR provides performance optimizations by allowing return traffic routed through load balancers to bypass the load balancer and respond directly to the client, reducing load on the load balancer and improving overall latency. For information on DSR on Windows, read Direct Server Return (DSR) in a nutshell.
Initially introduced in v1.14, this feature has graduated to stable in v1.34.
This work was done as part of KEP #5100 led by SIG Windows.
Sleep action for Container lifecycle hooks
A Sleep action for containers' PreStop and PostStart lifecycle hooks was introduced to provide a straightforward way to manage graceful shutdowns and improve overall container lifecycle management.
The Sleep action allows containers to pause for a specified duration after starting or before termination. Using a negative or zero sleep duration returns immediately, resulting in a no-op.
The Sleep action was introduced in Kubernetes v1.29, with zero value support added in v1.32. Both features graduated to stable in v1.34.
This work was done as part of KEP #3960 and KEP #4818 led by SIG Node.
Linux node swap support
Historically, the lack of swap support in Kubernetes could lead to workload instability, as nodes under memory pressure often had to terminate processes abruptly. This particularly affected applications with large but infrequently accessed memory footprints and prevented more graceful resource management.
To address this, configurable per-node swap support was introduced in v1.22. It has progressed through alpha and beta stages and has graduated to stable in v1.34. The primary mode, LimitedSwap
, allows Pods to use swap within their existing memory limits, providing a direct solution to the problem. By default, the kubelet
is configured with NoSwap
mode, which means Kubernetes workloads cannot use swap.
This feature improves workload stability and allows for more efficient resource utilization. It enables clusters to support a wider variety of applications, especially in resource-constrained environments, though administrators must consider the potential performance impact of swapping.
This work was done as part of KEP #2400 led by SIG Node.
Allow special characters in environment variables
The environment variable validation rules in Kubernetes have been relaxed to allow nearly all printable ASCII characters in variable names, excluding =
. This change supports scenarios where workloads require nonstandard characters in variable names - for example, frameworks like .NET Core that use :
to represent nested configuration keys.
The relaxed validation applies to environment variables defined directly in Pod spec, as well as those injected using envFrom
references to ConfigMaps and Secrets.
This work was done as part of KEP #4369 led by SIG Node.
Taint management is separated from Node lifecycle
Historically, the TaintManager
's logic for applying NoSchedule and NoExecute taints to nodes based on their condition (NotReady, Unreachable, etc.) was tightly coupled with the node lifecycle controller. This tight coupling made the code harder to maintain and test, and it also limited the flexibility of the taint-based eviction mechanism. This KEP refactors the TaintManager
into its own separate controller within the Kubernetes controller manager. It is an internal architectural improvement designed to increase code modularity and maintainability. This change allows the logic for taint-based evictions to be tested and evolved independently, but it has no direct user-facing impact on how taints are used.
This work was done as part of KEP #3902 led by SIG Scheduling and SIG Node.
New features in Beta
This is a selection of some of the improvements that are now beta following the v1.34 release.
Pod-level resource requests and limits
Defining resource needs for Pods with multiple containers has been challenging, as requests and limits could only be set on a per-container basis. This forced developers to either over-provision resources for each container or meticulously divide the total desired resources, making configuration complex and often leading to inefficient resource allocation. To simplify this, the ability to specify resource requests and limits at the Pod level was introduced. This allows developers to define an overall resource budget for a Pod, which is then shared among its constituent containers. This feature was introduced as alpha in v1.32 and has graduated to beta in v1.34, with HPA now supporting pod-level resource specifications.
The primary benefit is a more intuitive and straightforward way to manage resources for multi-container Pods. It ensures that the total resources used by all containers do not exceed the Pod's defined limits, leading to better resource planning, more accurate scheduling, and more efficient utilization of cluster resources.
This work was done as part of KEP #2837 led by SIG Scheduling and SIG Autoscaling.
.kuberc
file for kubectl
user preferences
A .kuberc
configuration file allows you to define preferences for kubectl
, such as default options and command aliases. Unlike the kubeconfig file, the .kuberc
configuration file does not contain cluster details, usernames or passwords.
This feature was introduced as alpha in v1.33, gated behind the environment variable KUBECTL_KUBERC
. It has graduated to beta in v1.34 and is enabled by default.
This work was done as part of KEP #3104 led by SIG CLI.
External ServiceAccount token signing
Traditionally, Kubernetes manages ServiceAccount tokens using static signing keys that are loaded from disk at kube-apiserver
startup. This feature introduces an ExternalJWTSigner
gRPC service for out-of-process signing, enabling Kubernetes distributions to integrate with external key management solutions (for example, HSMs, cloud KMSes) for ServiceAccount token signing instead of static disk-based keys.
Introduced as alpha in v1.32, this external JWT signing capability advances to beta and is enabled by default in v1.34.
This work was done as part of KEP #740 led by SIG Auth.
DRA features in beta
Admin access for secure resource monitoring
DRA supports controlled administrative access via the adminAccess
field in ResourceClaims or ResourceClaimTemplates, allowing cluster operators to access devices already in use by others for monitoring or diagnostics. This privileged mode is limited to users authorized to create such objects in namespaces labeled resource.k8s.io/admin-access: "true"
, ensuring regular workloads remain unaffected. Graduating to beta in v1.34, this feature provides secure introspection capabilities while preserving workload isolation through namespace-based authorization checks.
This work was done as part of KEP #5018 led by WG Device Management and SIG Auth.
Prioritized alternatives in ResourceClaims and ResourceClaimTemplates
While a workload might run best on a single high-performance GPU, it might also be able to run on two mid-level GPUs.
With the feature gate DRAPrioritizedList
(now enabled by default), ResourceClaims and ResourceClaimTemplates get a new field named firstAvailable
. This field is an ordered list that allows users to specify that a request may be satisfied in different ways, including allocating nothing at all if specific hardware is not available. The scheduler will attempt to satisfy the alternatives in the list in order, so the workload will be allocated the best set of devices available in the cluster.
This work was done as part of KEP #4816 led by WG Device Management.
The kubelet
reports allocated DRA resources
The kubelet
's API has been updated to report on Pod resources allocated through DRA. This allows node monitoring agents to discover the allocated DRA resources for Pods on a node. Additionally, it enables node components to use the PodResourcesAPI and leverage this DRA information when developing new features and integrations.
Starting from Kubernetes v1.34, this feature is enabled by default.
This work was done as part of KEP #3695 led by WG Device Management.
kube-scheduler
non-blocking API calls
The kube-scheduler
makes blocking API calls during scheduling cycles, creating performance bottlenecks. This feature introduces asynchronous API handling through a prioritized queue system with request deduplication, allowing the scheduler to continue processing Pods while API operations complete in the background. Key benefits include reduced scheduling latency, prevention of scheduler thread starvation during API delays, and immediate retry capability for unschedulable Pods. The implementation maintains backward compatibility and adds metrics for monitoring pending API operations.
This work was done as part of KEP #5229 led by SIG Scheduling.
Mutating admission policies
MutatingAdmissionPolicies offer a declarative, in-process alternative to mutating admission webhooks. This feature leverages CEL's object instantiation and JSON Patch strategies, combined with Server Side Apply's merge algorithms.
This significantly simplifies admission control by allowing administrators to define mutation rules directly in the API server.
Introduced as alpha in v1.32, mutating admission policies has graduated to beta in v1.34.
This work was done as part of KEP #3962 led by SIG API Machinery.
Snapshottable API server cache
The kube-apiserver
's caching mechanism (watch cache) efficiently serves requests for the latest observed state. However, list requests for previous states (for example, via pagination or by specifying a resourceVersion
) often bypass this cache and are served directly from etcd. This direct etcd access significantly increases performance costs and can lead to stability issues, particularly with large resources, due to memory pressure from transferring large data blobs.
With the ListFromCacheSnapshot
feature gate enabled by default, kube-apiserver
will attempt to serve the response from snapshots if one is available with resourceVersion
older than requested. The kube-apiserver
starts with no snapshots, creates a new snapshot on every watch event, and keeps them until it detects etcd is compacted or if cache is full with events older than 75 seconds. If the provided resourceVersion
is unavailable, the server will fallback to etcd.
This work was done as part of KEP #4988 led by SIG API Machinery.
Tooling for declarative validation of Kubernetes-native types
Prior to this release, validation rules for the APIs built into Kubernetes were written entirely by hand, which makes them difficult for maintainers to discover, understand, improve or test. There was no single way to find all the validation rules that might apply to an API. Declarative validation benefits Kubernetes maintainers by making API development, maintenance, and review easier while enabling programmatic inspection for better tooling and documentation. For people using Kubernetes libraries to write their own code (for example: a controller), the new approach streamlines adding new fields through IDL tags, rather than complex validation functions. This change helps speed up API creation by automating validation boilerplate, and provides more relevant error messages by performing validation on versioned types.
This enhancement (which graduated to beta in v1.33 and continues as beta in v1.34) brings CEL-based validation rules to native Kubernetes types. It allows for more granular and declarative validation to be defined directly in the type definitions, improving API consistency and developer experience.
This work was done as part of KEP #5073 led by SIG API Machinery.
Streaming informers for list requests
The streaming informers feature, which has been in beta since v1.32, gains further beta refinements in v1.34. This capability allows list requests to return data as a continuous stream of objects from the API server's watch cache, rather than assembling paged results directly from etcd. By reusing the same mechanics used for watch operations, the API server can serve large datasets while keeping memory usage steady and avoiding allocation spikes that can affect stability.
In this release, the kube-apiserver
and kube-controller-manager
both take advantage of the new WatchList
mechanism by default. For the kube-apiserver
, this means list requests are streamed more efficiently, while the kube-controller-manager
benefits from a more memory-efficient and predictable way to work with informers. Together, these improvements reduce memory pressure during large list operations, and improve reliability under sustained load, making list streaming more predictable and efficient.
This work was done as part of KEP #3157 led by SIG API Machinery and SIG Scalability.
Graceful node shutdown handling for Windows nodes
The kubelet
on Windows nodes can now detect system shutdown events and begin graceful termination of running Pods. This mirrors existing behavior on Linux and helps ensure workloads exit cleanly during planned shutdowns or restarts.
When the system begins shutting down, the kubelet
reacts by using standard termination logic. It respects the configured lifecycle hooks and grace periods, giving Pods time to stop before the node powers off. The feature relies on Windows pre-shutdown notifications to coordinate this process. This enhancement improves workload reliability during maintenance, restarts, or system updates. It is now in beta and enabled by default.
This work was done as part of KEP #4802 led by SIG Windows.
In-place Pod resize improvements
Graduated to beta and enabled by default in v1.33, in-place Pod resizing receives further improvements in v1.34. These include support for decreasing memory usage and integration with Pod-level resources.
This feature remains in beta in v1.34. For detailed usage instructions and examples, refer to the documentation: Resize CPU and Memory Resources assigned to Containers.
This work was done as part of KEP #1287 led by SIG Node and SIG Autoscaling.
New features in Alpha
This is a selection of some of the improvements that are now alpha following the v1.34 release.
Pod certificates for mTLS authentication
Authenticating workloads within a cluster, especially for communication with the API server, has primarily relied on ServiceAccount tokens. While effective, these tokens aren't always ideal for establishing a strong, verifiable identity for mutual TLS (mTLS) and can present challenges when integrating with external systems that expect certificate-based authentication.
Kubernetes v1.34 introduces a built-in mechanism for Pods to obtain X.509 certificates via PodCertificateRequests. The kubelet
can request and manage certificates for Pods, which can then be used to authenticate to the Kubernetes API server and other services using mTLS. The primary benefit is a more robust and flexible identity mechanism for Pods. It provides a native way to implement strong mTLS authentication without relying solely on bearer tokens, aligning Kubernetes with standard security practices and simplifying integrations with certificate-aware observability and security tooling.
This work was done as part of KEP #4317 led by SIG Auth.
"Restricted" Pod security standard now forbids remote probes
The host
field within probes and lifecycle handlers allows users to specify an entity other than the podIP
for the kubelet
to probe. However, this opens up a route for misuse and for attacks that bypass security controls, since the host
field could be set to any value, including security sensitive external hosts, or localhost on the node. In Kubernetes v1.34, Pods only meet the Restricted Pod security standard if they either leave the host
field unset, or if they don't even use this kind of probe. You can use Pod security admission, or a third party solution, to enforce that Pods meet this standard. Because these are security controls, check the documentation to understand the limitations and behavior of the enforcement mechanism you choose.
This work was done as part of KEP #4940 led by SIG Auth.
Use .status.nominatedNodeName
to express Pod placement
When the kube-scheduler
takes time to bind Pods to Nodes, cluster autoscalers may not understand that a Pod will be bound to a specific Node. Consequently, they may mistakenly consider the Node as underutilized and delete it.
To address this issue, the kube-scheduler
can use .status.nominatedNodeName
not only to indicate ongoing preemption but also to express Pod placement intentions. By enabling the NominatedNodeNameForExpectation
feature gate, the scheduler uses this field to indicate where a Pod will be bound. This exposes internal reservations to help external components make informed decisions.
This work was done as part of KEP #5278 led by SIG Scheduling.
DRA features in alpha
Resource health status for DRA
It can be difficult to know when a Pod is using a device that has failed or is temporarily unhealthy, which makes troubleshooting Pod crashes challenging or impossible.
Resource Health Status for DRA improves observability by exposing the health status of devices allocated to a Pod in the Pod's status. This makes it easier to identify the cause of Pod issues related to unhealthy devices and respond appropriately.
To enable this functionality, the ResourceHealthStatus
feature gate must be enabled, and the DRA driver must implement the DRAResourceHealth
gRPC service.
This work was done as part of KEP #4680 led by WG Device Management.
Extended resource mapping
Extended resource mapping provides a simpler alternative to DRA's expressive and flexible approach by offering a straightforward way to describe resource capacity and consumption. This feature enables cluster administrators to advertise DRA-managed resources as extended resources, allowing application developers and operators to continue using the familiar container's .spec.resources
syntax to consume them.
This enables existing workloads to adopt DRA without modifications, simplifying the transition to DRA for both application developers and cluster administrators.
This work was done as part of KEP #5004 led by WG Device Management.
DRA consumable capacity
Kubernetes v1.33 added support for resource drivers to advertise slices of a device that are available, rather than exposing the entire device as an all-or-nothing resource. However, this approach couldn't handle scenarios where device drivers manage fine-grained, dynamic portions of a device resource based on user demand, or share those resources independently of ResourceClaims, which are restricted by their spec and namespace.
Enabling the DRAConsumableCapacity
feature gate (introduced as alpha in v1.34) allows resource drivers to share the same device, or even a slice of a device, across multiple ResourceClaims or across multiple DeviceRequests. The feature also extends the scheduler to support allocating portions of device resources, as defined in the capacity
field. This DRA feature improves device sharing across namespaces and claims, tailoring it to Pod needs. It enables drivers to enforce capacity limits, enhances scheduling, and supports new use cases like bandwidth-aware networking and multi-tenant sharing.
This work was done as part of KEP #5075 led by WG Device Management.
Device binding conditions
The Kubernetes scheduler gets more reliable by delaying binding a Pod to a Node until its required external resources, such as attachable devices or FPGAs, are confirmed to be ready.
This delay mechanism is implemented in the PreBind phase of the scheduling framework. During this phase, the scheduler checks whether all required device conditions are satisfied before proceeding with binding. This enables coordination with external device controllers, ensuring more robust, predictable scheduling.
This work was done as part of KEP #5007 led by WG Device Management.
Container restart rules
Currently, all containers within a Pod will follow the same .spec.restartPolicy
when exited or crashed. However, Pods that run multiple containers might have different restart requirements for each container. For example, for init containers used to perform initialization, you may not want to retry initialization if they fail. Similarly, in ML research environments with long-running training workloads, containers that fail with retriable exit codes should restart quickly in place, rather than triggering Pod recreation and losing progress.
Kubernetes v1.34 introduces the ContainerRestartRules
feature gate. When enabled, a restartPolicy
can be specified for each container within a Pod. A restartPolicyRules
list can also be defined to override restartPolicy
based on the last exit code. This provides the fine-grained control needed to handle complex scenarios and better utilization of compute resources.
This work was done as part of KEP #5307 led by SIG Node.
Load environment variables from files created in runtime
Application developers have long requested greater flexibility in declaring environment variables. Traditionally, environment variables are declared on the API server side via static values, ConfigMaps, or Secrets.
Behind the EnvFiles
feature gate, Kubernetes v1.34 introduces the ability to declare environment variables at runtime. One container (typically an init container) can generate the variable and store it in a file, and a subsequent container can start with the environment variable loaded from that file. This approach eliminates the need to "wrap" the target container's entry point, enabling more flexible in-Pod container orchestration.
This feature particularly benefits AI/ML training workloads, where each Pod in a training Job requires initialization with runtime-defined values.
This work was done as part of KEP #5307 led by SIG Node.
Graduations, deprecations, and removals in v1.34
Graduations to stable
This lists all the features that graduated to stable (also known as general availability). For a full list of updates including new features and graduations from alpha to beta, see the release notes.
This release includes a total of 23 enhancements promoted to stable:
- Allow almost all printable ASCII characters in environment variables
- Allow for recreation of pods once fully terminated in the job controller
- Allow zero value for Sleep Action of PreStop Hook
- API Server tracing
- AppArmor support
- Authorize with Field and Label Selectors
- Consistent Reads from Cache
- Decouple TaintManager from NodeLifecycleController
- Discover cgroup driver from CRI
- DRA: structured parameters
- Introducing Sleep Action for PreStop Hook
- Kubelet OpenTelemetry Tracing
- Kubernetes VolumeAttributesClass ModifyVolume
- Node memory swap support
- Only allow anonymous auth for configured endpoints
- Ordered namespace deletion
- Per-plugin callback functions for accurate requeueing in kube-scheduler
- Relaxed DNS search string validation
- Resilient Watchcache Initialization
- Streaming Encoding for LIST Responses
- Structured Authentication Config
- Support for Direct Service Return (DSR) and overlay networking in Windows kube-proxy
- Support recovery from volume expansion failure
Deprecations and removals
As Kubernetes develops and matures, features may be deprecated, removed, or replaced with better ones to improve the project's overall health. See the Kubernetes deprecation and removal policy for more details on this process. Kubernetes v1.34 includes a couple of deprecations.
Manual cgroup driver configuration is deprecated
Historically, configuring the correct cgroup driver has been a pain point for users running Kubernetes clusters. Kubernetes v1.28 added a way for the kubelet
to query the CRI implementation and find which cgroup driver to use. That automated detection is now strongly recommended and support for it has graduated to stable in v1.34. If your CRI container runtime does not support the ability to report the cgroup driver it needs, you should upgrade or change your container runtime. The cgroupDriver
configuration setting in the kubelet
configuration file is now deprecated. The corresponding command-line option --cgroup-driver
was previously deprecated, as Kubernetes recommends using the configuration file instead. Both the configuration setting and command-line option will be removed in a future release, that removal will not happen before the v1.36 minor release.
This work was done as part of KEP #4033 led by SIG Node.
Kubernetes to end containerd 1.x support in v1.36
While Kubernetes v1.34 still supports containerd 1.7 and other LTS releases of containerd, as a consequence of automated cgroup driver detection, the Kubernetes SIG Node community has formally agreed upon a final support timeline for containerd v1.X. The last Kubernetes release to offer this support will be v1.35 (aligned with containerd 1.7 EOL). This is an early warning that if you are using containerd 1.X, consider switching to 2.0+ soon. You are able to monitor the kubelet_cri_losing_support
metric to determine if any nodes in your cluster are using a containerd version that will soon be outdated.
This work was done as part of KEP #4033 led by SIG Node.
PreferClose
traffic distribution is deprecated
The spec.trafficDistribution
field within a Kubernetes Service allows users to express preferences for how traffic should be routed to Service endpoints.
KEP-3015 deprecates PreferClose
and introduces two additional values: PreferSameZone
and PreferSameNode
. PreferSameZone
is an alias for the existing PreferClose
to clarify its semantics. PreferSameNode
allows connections to be delivered to a local endpoint when possible, falling back to a remote endpoint when not possible.
This feature was introduced in v1.33 behind the PreferSameTrafficDistribution
feature gate. It has graduated to beta in v1.34 and is enabled by default.
This work was done as part of KEP #3015 led by SIG Network.
Release notes
Check out the full details of the Kubernetes v1.34 release in our release notes.
Availability
Kubernetes v1.34 is available for download on GitHub or on the Kubernetes download page.
To get started with Kubernetes, check out these interactive tutorials or run local Kubernetes clusters using minikube. You can also easily install v1.34 using kubeadm.
Release Team
Kubernetes is only possible with the support, commitment, and hard work of its community. Each release team is made up of dedicated community volunteers who work together to build the many pieces that make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management.
We honor the memory of Rodolfo "Rodo" Martínez Vega, a dedicated contributor whose passion for technology and community building left a mark on the Kubernetes community. Rodo served as a member of the Kubernetes Release Team across multiple releases, including v1.22-v1.23 and v1.25-v1.30, demonstrating unwavering commitment to the project's success and stability.
Beyond his Release Team contributions, Rodo was deeply involved in fostering the Cloud Native LATAM community, helping to bridge language and cultural barriers in the space. His work on the Spanish version of Kubernetes documentation and the CNCF Glossary exemplified his dedication to making knowledge accessible to Spanish-speaking developers worldwide. Rodo's legacy lives on through the countless community members he mentored, the releases he helped deliver, and the vibrant LATAM Kubernetes community he helped cultivate.
We would like to thank the entire Release Team for the hours spent hard at work to deliver the Kubernetes v1.34 release to our community. The Release Team's membership ranges from first-time shadows to returning team leads with experience forged over several release cycles. A very special thanks goes out to our release lead, Vyom Yadav, for guiding us through a successful release cycle, for his hands-on approach to solving challenges, and for bringing the energy and care that drives our community forward.
Project Velocity
The CNCF K8s DevStats project aggregates a number of interesting data points related to the velocity of Kubernetes and various sub-projects. This includes everything from individual contributions to the number of companies that are contributing and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.
During the v1.34 release cycle, which spanned 15 weeks from 19th May 2025 to 27th August 2025, Kubernetes received contributions from as many as 106 different companies and 491 individuals. In the wider cloud native ecosystem, the figure goes up to 370 companies, counting 2235 total contributors.
Note that "contribution" counts when someone makes a commit, code review, comment, creates an issue or PR, reviews a PR (including blogs and documentation) or comments on issues and PRs.
If you are interested in contributing, visit Getting Started on our contributor website.
Source for this data:
Event Update
Explore upcoming Kubernetes and cloud native events, including KubeCon + CloudNativeCon, KCD, and other notable conferences worldwide. Stay informed and get involved with the Kubernetes community!
August 2025
- KCD - Kubernetes Community Days: Colombia: Aug 28, 2025 | Bogotá, Colombia
September 2025
- CloudCon Sydney: Sep 9-10, 2025 | Sydney, Australia.
- KCD - Kubernetes Community Days: San Francisco Bay Area: Sep 9, 2025 | San Francisco, USA
- KCD - Kubernetes Community Days: Washington DC: Sep 16, 2025 | Washington, D.C., USA
- KCD - Kubernetes Community Days: Sofia: Sep 18, 2025 | Sofia, Bulgaria
- KCD - Kubernetes Community Days: El Salvador: Sep 20, 2025 | San Salvador, El Salvador
October 2025
- KCD - Kubernetes Community Days: Warsaw: Oct 9, 2025 | Warsaw, Poland
- KCD - Kubernetes Community Days: Edinburgh: Oct 21, 2025 | Edinburgh, United Kingdom
- KCD - Kubernetes Community Days: Sri Lanka: Oct 26, 2025 | Colombo, Sri Lanka
November 2025
- KCD - Kubernetes Community Days: Porto: Nov 3, 2025 | Porto, Portugal
- KubeCon + CloudNativeCon North America 2025: Nov 10-13, 2025 | Atlanta, USA
- KCD - Kubernetes Community Days: Hangzhou: Nov 15, 2025 | Hangzhou, China
December 2025
- KCD - Kubernetes Community Days: Suisse Romande: Dec 4, 2025 | Geneva, Switzerland
You can find the latest event details here.
Upcoming Release Webinar
Join members of the Kubernetes v1.34 Release Team on Wednesday, September 24th 2025 at 4:00 PM (UTC), to learn about the release highlights of this release. For more information and registration, visit the event page on the CNCF Online Programs site.
Get Involved
The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you'd like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.
- Follow us on Bluesky @Kubernetesio for the latest updates
- Join the community discussion on Discuss
- Join the community on Slack
- Post questions (or answer questions) on Stack Overflow
- Share your Kubernetes story
- Read more about what's happening with Kubernetes on the blog
- Learn more about the Kubernetes Release Team
27 Aug 2025 6:30pm GMT
19 Aug 2025
Kubernetes Blog
Tuning Linux Swap for Kubernetes: A Deep Dive
The Kubernetes NodeSwap feature, likely to graduate to stable in the upcoming Kubernetes v1.34 release, allows swap usage: a significant shift from the conventional practice of disabling swap for performance predictability. This article focuses exclusively on tuning swap on Linux nodes, where this feature is available. By allowing Linux nodes to use secondary storage for additional virtual memory when physical RAM is exhausted, node swap support aims to improve resource utilization and reduce out-of-memory (OOM) kills.
However, enabling swap is not a "turn-key" solution. The performance and stability of your nodes under memory pressure are critically dependent on a set of Linux kernel parameters. Misconfiguration can lead to performance degradation and interfere with Kubelet's eviction logic.
In this blogpost, I'll dive into critical Linux kernel parameters that govern swap behavior. I will explore how these parameters influence Kubernetes workload performance, swap utilization, and crucial eviction mechanisms. I will present various test results showcasing the impact of different configurations, and share my findings on achieving optimal settings for stable and high-performing Kubernetes clusters.
Introduction to Linux swap
At a high level, the Linux kernel manages memory through pages, typically 4KiB in size. When physical memory becomes constrained, the kernel's page replacement algorithm decides which pages to move to swap space. While the exact logic is a sophisticated optimization, this decision-making process is influenced by certain key factors:
- Page access patterns (how recently pages are accessed)
- Page dirtyness (whether pages have been modified)
- Memory pressure (how urgently the system needs free memory)
Anonymous vs File-backed memory
It is important to understand that not all memory pages are the same. The kernel distinguishes between anonymous and file-backed memory.
Anonymous memory: This is memory that is not backed by a specific file on the disk, such as a program's heap and stack. From the application's perspective this is private memory, and when the kernel needs to reclaim these pages, it must write them to a dedicated swap device.
File-backed memory: This memory is backed by a file on a filesystem. This includes a program's executable code, shared libraries, and filesystem caches. When the kernel needs to reclaim these pages, it can simply discard them if they have not been modified ("clean"). If a page has been modified ("dirty"), the kernel must first write the changes back to the file before it can be discarded.
While a system without swap can still reclaim clean file-backed pages memory under pressure by dropping them, it has no way to offload anonymous memory. Enabling swap provides this capability, allowing the kernel to move less-frequently accessed memory pages to disk to conserve memory to avoid system OOM kills.
Key kernel parameters for swap tuning
To effectively tune swap behavior, Linux provides several kernel parameters that can be managed via sysctl
.
vm.swappiness
: This is the most well-known parameter. It is a value from 0 to 200 (100 in older kernels) that controls the kernel's preference for swapping anonymous memory pages versus reclaiming file-backed memory pages (page cache).- High value (eg: 90+): The kernel will be aggressive in swapping out less-used anonymous memory to make room for file-cache.
- Low value (eg: < 10): The kernel will strongly prefer dropping file cache pages over swapping anonymous memory.
vm.min_free_kbytes
: This parameter tells the kernel to keep a minimum amount of memory free as a buffer. When the amount of free memory drops below the this safety buffer, the kernel starts more aggressively reclaiming pages (swapping, and eventually handling OOM kills).- Function: It acts as a safety lever to ensure the kernel has enough memory for critical allocation requests that cannot be deferred.
- Impact on swap: Setting a higher
min_free_kbytes
effectively raises the floor for for free memory, causing the kernel to initiate swap earlier under memory pressure.
vm.watermark_scale_factor
: This setting controls the gap between different watermarks:min
,low
andhigh
, which are calculated based onmin_free_kbytes
.- Watermarks explained:
low
: When free memory is below this mark, thekswapd
kernel process wakes up to reclaim pages in the background. This is when a swapping cycle begins.min
: When free memory hits this minimum level, then aggressive page reclamation will block process allocation. Failing to reclaim pages will cause OOM kills.high
: Memory reclamation stops once the free memory reaches this level.
- Impact: A higher
watermark_scale_factor
careates a larger buffer between thelow
andmin
watermarks. This giveskswapd
more time to reclaim memory gradually before the system hits a critical state.
- Watermarks explained:
In a typical server workload, you might have a long-running process with some memory that becomes 'cold'. A higher swappiness
value can free up RAM by swapping out the cold memory, for other active processes that can benefit from keeping their file-cache.
Tuning the min_free_kbytes
and watermark_scale_factor
parameters to move the swapping window early will give more room for kswapd
to offload memory to disk and prevent OOM kills during sudden memory spikes.
Swap tests and results
To understand the real-impact of these parameters, I designed a series of stress tests.
Test setup
- Environment: GKE on Google Cloud
- Kubernetes version: 1.33.2
- Node configuration:
n2-standard-2
(8GiB RAM, 50GB swap on apd-balanced
disk, without encryption), Ubuntu 22.04 - Workload: A custom Go application designed to allocate memory at a configurable rate, generate file-cache pressure, and simulate different memory access patterns (random vs sequential).
- Monitoring: A sidecar container capturing system metrics every second.
- Protection: Critical system components (kubelet, container runtime, sshd) were prevented from swapping by setting
memory.swap.max=0
in their respective cgroups.
Test methodology
I ran a stress-test pod on nodes with different swappiness settings (0, 60, and 90) and varied the min_free_kbytes
and watermark_scale_factor
parameters to observe the outcomes under heavy memory allocation and I/O pressure.
Visualizing swap in action
The graph below, from a 100MBps stress test, shows swap in action. As free memory (in the "Memory Usage" plot) decreases, swap usage (Swap Used (GiB)
) and swap-out activity (Swap Out (MiB/s)
) increase. Critically, as the system relies more on swap, the I/O activity and corresponding wait time (IO Wait %
in the "CPU Usage" plot) also rises, indicating CPU stress.
Findings
My initial tests with default kernel parameters (swappiness=60
, min_free_kbytes=68MB
, watermark_scale_factor=10
) quickly led to OOM kills and even unexpected node restarts under high memory pressure. With selecting appropriate kernel parameters a good balance in node stability and performance can be achieved.
The impact of swappiness
The swappiness parameter directly influences the kernel's choice between reclaiming anonymous memory (swapping) and dropping page cache. To observe this, I ran a test where one pod generated and held file-cache pressure, followed by a second pod allocating anonymous memory at 100MB/s, to observe the kernel preference on reclaim:
My findings reveal a clear trade-off:
swappiness=90
: The kernel proactively swapped out the inactive anonymous memory to keep the file cache. This resulted in high and sustained swap usage and significant I/O activity ("Blocks Out"), which in turn caused spikes in I/O wait on the CPU.swappiness=0
: The kernel favored dropping file-cache pages delaying swap consumption. However, it's critical to understand that this does not disable swapping. When memory pressure was high, the kernel still swapped anonymous memory to disk.
The choice is workload-dependent. For workloads sensitive to I/O latency, a lower swappiness is preferable. For workloads that rely on a large and frequently accessed file cache, a higher swappiness may be beneficial, provided the underlying disk is fast enough to handle the load.
Tuning watermarks to prevent eviction and OOM kills
The most critical challenge I encountered was the interaction between rapid memory allocation and Kubelet's eviction mechanism. When my test pod, which was deliberately configured to overcommit memory, allocated it at a high rate (e.g., 300-500 MBps), the system quickly ran out of free memory.
With default watermarks, the buffer for reclamation was too small. Before kswapd
could free up enough memory by swapping, the node would hit a critical state, leading to two potential outcomes:
- Kubelet eviction If kubelet's eviction manager detected
memory.available
was below its threshold, it would evict the pod. - OOM killer In some high-rate scenarios, the OOM Killer would activate before eviction could complete, sometimes killing higher priority pods that were not the source of the pressure.
To mitigate this I tuned the watermarks:
- Increased
min_free_kbytes
to 512MiB: This forces the kernel to start reclaiming memory much earlier, providing a larger safety buffer. - Increased
watermark_scale_factor
to 2000: This widened the gap between thelow
andhigh
watermarks (from ≈337MB to ≈591MB in my test node's/proc/zoneinfo
), effectively increasing the swapping window.
This combination gave kswapd
a larger operational zone and more time to swap pages to disk during memory spikes, successfully preventing both premature evictions and OOM kills in my test runs.
Table compares watermark levels from /proc/zoneinfo
(Non-NUMA node):
min_free_kbytes=67584KiB and watermark_scale_factor=10 |
min_free_kbytes=524288KiB and watermark_scale_factor=2000 |
---|---|
Node 0, zone Normal pages free 583273 boost 0 min 10504 low 13130 high 15756 spanned 1310720 present 1310720 managed 1265603 |
Node 0, zone Normal pages free 470539 min 82109 low 337017 high 591925 spanned 1310720 present 1310720 managed 1274542 |
The graph below reveals that the kernel buffer size and scaling factor play a crucial role in determining how the system responds to memory load. With the right combination of these parameters, the system can effectively use swap space to avoid eviction and maintain stability.
Risks and recommendations
Enabling swap in Kubernetes is a powerful tool, but it comes with risks that must be managed through careful tuning.
-
Risk of performance degradation Swapping is orders of magnitude slower than accessing RAM. If an application's active working set is swapped out, its performance will suffer dramatically due to high I/O wait times (thrashing). Swap could preferably be provisioned with a SSD backed storage to improve performance.
-
Risk of masking memory leaks Swap can hide memory leaks in applications, which might otherwise lead to a quick OOM kill. With swap, a leaky application might slowly degrade node performance over time, making the root cause harder to diagnose.
-
Risk of disabling evictions Kubelet proactively monitors the node for memory-pressure and terminates pods to reclaim the resources. Improper tuning can lead to OOM kills before kubelet has a chance to evict pods gracefully. A properly configured
min_free_kbytes
is essential to ensure kubelet's eviction mechanism remains effective.
Kubernetes context
Together, the kernel watermarks and kubelet eviction threshold create a series of memory pressure zones on a node. The eviction-threshold parameters need to be adjusted to configure Kubernetes managed evictions occur before the OOM kills.
As the diagram shows, an ideal configuration will be to create a large enough 'swapping zone' (between high
and min
watermarks) so that the kernel can handle memory pressure by swapping before available memory drops into the Eviction/Direct Reclaim zone.
Recommended starting point
Based on these findings, I recommend the following as a starting point for Linux nodes with swap enabled. You should benchmark this with your own workloads.
vm.swappiness=60
: Linux default is a good starting point for general-purpose workloads. However, the ideal value is workload-dependent, and swap-sensitive applications may need more careful tuning.vm.min_free_kbytes=500000
(500MB): Set this to a reasonably high value (e.g., 2-3% of total node memory) to give the node a reasonable safety buffer.vm.watermark_scale_factor=2000
: Create a larger window forkswapd
to work with, preventing OOM kills during sudden memory allocation spikes.
I encourage running benchmark tests with your own workloads in test-environments, when setting up swap for the first time in your Kubernetes cluster. Swap performance can be sensitive to different environment differences such as CPU load, disk type (SSD vs HDD) and I/O patterns.
19 Aug 2025 6:30pm GMT