20 Oct 2025
Kubernetes Blog
7 Common Kubernetes Pitfalls (and How I Learned to Avoid Them)
It's no secret that Kubernetes can be both powerful and frustrating at times. When I first started dabbling with container orchestration, I made more than my fair share of mistakes enough to compile a whole list of pitfalls. In this post, I want to walk through seven big gotchas I've encountered (or seen others run into) and share some tips on how to avoid them. Whether you're just kicking the tires on Kubernetes or already managing production clusters, I hope these insights help you steer clear of a little extra stress.
1. Skipping resource requests and limits
The pitfall: Not specifying CPU and memory requirements in Pod specifications. This typically happens because Kubernetes does not require these fields, and workloads can often start and run without them-making the omission easy to overlook in early configurations or during rapid deployment cycles.
Context: In Kubernetes, resource requests and limits are critical for efficient cluster management. Resource requests ensure that the scheduler reserves the appropriate amount of CPU and memory for each pod, guaranteeing that it has the necessary resources to operate. Resource limits cap the amount of CPU and memory a pod can use, preventing any single pod from consuming excessive resources and potentially starving other pods. When resource requests and limits are not set:
- Resource Starvation: Pods may get insufficient resources, leading to degraded performance or failures. This is because Kubernetes schedules pods based on these requests. Without them, the scheduler might place too many pods on a single node, leading to resource contention and performance bottlenecks.
- Resource Hoarding: Conversely, without limits, a pod might consume more than its fair share of resources, impacting the performance and stability of other pods on the same node. This can lead to issues such as other pods getting evicted or killed by the Out-Of-Memory (OOM) killer due to lack of available memory.
How to avoid it:
- Start with modest
requests(for example100mCPU,128Mimemory) and see how your app behaves. - Monitor real-world usage and refine your values; the HorizontalPodAutoscaler can help automate scaling based on metrics.
- Keep an eye on
kubectl top podsor your logging/monitoring tool to confirm you're not over- or under-provisioning.
My reality check: Early on, I never thought about memory limits. Things seemed fine on my local cluster. Then, on a larger environment, Pods got OOMKilled left and right. Lesson learned. For detailed instructions on configuring resource requests and limits for your containers, please refer to Assign Memory Resources to Containers and Pods (part of the official Kubernetes documentation).
2. Underestimating liveness and readiness probes
The pitfall: Deploying containers without explicitly defining how Kubernetes should check their health or readiness. This tends to happen because Kubernetes will consider a container "running" as long as the process inside hasn't exited. Without additional signals, Kubernetes assumes the workload is functioning-even if the application inside is unresponsive, initializing, or stuck.
Context:
Liveness, readiness, and startup probes are mechanisms Kubernetes uses to monitor container health and availability.
- Liveness probes determine if the application is still alive. If a liveness check fails, the container is restarted.
- Readiness probes control whether a container is ready to serve traffic. Until the readiness probe passes, the container is removed from Service endpoints.
- Startup probes help distinguish between long startup times and actual failures.
How to avoid it:
- Add a simple HTTP
livenessProbeto check a health endpoint (for example/healthz) so Kubernetes can restart a hung container. - Use a
readinessProbeto ensure traffic doesn't reach your app until it's warmed up. - Keep probes simple. Overly complex checks can create false alarms and unnecessary restarts.
My reality check: I once forgot a readiness probe for a web service that took a while to load. Users hit it prematurely, got weird timeouts, and I spent hours scratching my head. A 3-line readiness probe would have saved the day.
For comprehensive instructions on configuring liveness, readiness, and startup probes for containers, please refer to Configure Liveness, Readiness and Startup Probes in the official Kubernetes documentation.
3. "We'll just look at container logs" (famous last words)
The pitfall: Relying solely on container logs retrieved via kubectl logs. This often happens because the command is quick and convenient, and in many setups, logs appear accessible during development or early troubleshooting. However, kubectl logs only retrieves logs from currently running or recently terminated containers, and those logs are stored on the node's local disk. As soon as the container is deleted, evicted, or the node is restarted, the log files may be rotated out or permanently lost.
How to avoid it:
- Centralize logs using CNCF tools like Fluentd or Fluent Bit to aggregate output from all Pods.
- Adopt OpenTelemetry for a unified view of logs, metrics, and (if needed) traces. This lets you spot correlations between infrastructure events and app-level behavior.
- Pair logs with Prometheus metrics to track cluster-level data alongside application logs. If you need distributed tracing, consider CNCF projects like Jaeger.
My reality check: The first time I lost Pod logs to a quick restart, I realized how flimsy "kubectl logs" can be on its own. Since then, I've set up a proper pipeline for every cluster to avoid missing vital clues.
4. Treating dev and prod exactly the same
The pitfall: Deploying the same Kubernetes manifests with identical settings across development, staging, and production environments. This often occurs when teams aim for consistency and reuse, but overlook that environment-specific factors-such as traffic patterns, resource availability, scaling needs, or access control-can differ significantly. Without customization, configurations optimized for one environment may cause instability, poor performance, or security gaps in another.
How to avoid it:
- Use environment overlays or kustomize to maintain a shared base while customizing resource requests, replicas, or config for each environment.
- Extract environment-specific configuration into ConfigMaps and / or Secrets. You can use a specialized tool such as Sealed Secrets to manage confidential data.
- Plan for scale in production. Your dev cluster can probably get away with minimal CPU/memory, but prod might need significantly more.
My reality check: One time, I scaled up replicaCount from 2 to 10 in a tiny dev environment just to "test." I promptly ran out of resources and spent half a day cleaning up the aftermath. Oops.
5. Leaving old stuff floating around
The pitfall: Leaving unused or outdated resources-such as Deployments, Services, ConfigMaps, or PersistentVolumeClaims-running in the cluster. This often happens because Kubernetes does not automatically remove resources unless explicitly instructed, and there is no built-in mechanism to track ownership or expiration. Over time, these forgotten objects can accumulate, consuming cluster resources, increasing cloud costs, and creating operational confusion, especially when stale Services or LoadBalancers continue to route traffic.
How to avoid it:
- Label everything with a purpose or owner label. That way, you can easily query resources you no longer need.
- Regularly audit your cluster: run
kubectl get all -n <namespace>to see what's actually running, and confirm it's all legit. - Adopt Kubernetes' Garbage Collection: K8s docs show how to remove dependent objects automatically.
- Leverage policy automation: Tools like Kyverno can automatically delete or block stale resources after a certain period, or enforce lifecycle policies so you don't have to remember every single cleanup step.
My reality check: After a hackathon, I forgot to tear down a "test-svc" pinned to an external load balancer. Three weeks later, I realized I'd been paying for that load balancer the entire time. Facepalm.
6. Diving too deep into networking too soon
The pitfall: Introducing advanced networking solutions-such as service meshes, custom CNI plugins, or multi-cluster communication-before fully understanding Kubernetes' native networking primitives. This commonly occurs when teams implement features like traffic routing, observability, or mTLS using external tools without first mastering how core Kubernetes networking works: including Pod-to-Pod communication, ClusterIP Services, DNS resolution, and basic ingress traffic handling. As a result, network-related issues become harder to troubleshoot, especially when overlays introduce additional abstractions and failure points.
How to avoid it:
- Start small: a Deployment, a Service, and a basic ingress controller such as one based on NGINX (e.g., Ingress-NGINX).
- Make sure you understand how traffic flows within the cluster, how service discovery works, and how DNS is configured.
- Only move to a full-blown mesh or advanced CNI features when you actually need them, complex networking adds overhead.
My reality check: I tried Istio on a small internal app once, then spent more time debugging Istio itself than the actual app. Eventually, I stepped back, removed Istio, and everything worked fine.
7. Going too light on security and RBAC
The pitfall: Deploying workloads with insecure configurations, such as running containers as the root user, using the latest image tag, disabling security contexts, or assigning overly broad RBAC roles like cluster-admin. These practices persist because Kubernetes does not enforce strict security defaults out of the box, and the platform is designed to be flexible rather than opinionated. Without explicit security policies in place, clusters can remain exposed to risks like container escape, unauthorized privilege escalation, or accidental production changes due to unpinned images.
How to avoid it:
- Use RBAC to define roles and permissions within Kubernetes. While RBAC is the default and most widely supported authorization mechanism, Kubernetes also allows the use of alternative authorizers. For more advanced or external policy needs, consider solutions like OPA Gatekeeper (based on Rego), Kyverno, or custom webhooks using policy languages such as CEL or Cedar.
- Pin images to specific versions (no more
:latest!). This helps you know what's actually deployed. - Look into Pod Security Admission (or other solutions like Kyverno) to enforce non-root containers, read-only filesystems, etc.
My reality check: I never had a huge security breach, but I've heard plenty of cautionary tales. If you don't tighten things up, it's only a matter of time before something goes wrong.
Final thoughts
Kubernetes is amazing, but it's not psychic, it won't magically do the right thing if you don't tell it what you need. By keeping these pitfalls in mind, you'll avoid a lot of headaches and wasted time. Mistakes happen (trust me, I've made my share), but each one is a chance to learn more about how Kubernetes truly works under the hood. If you're curious to dive deeper, the official docs and the community Slack are excellent next steps. And of course, feel free to share your own horror stories or success tips, because at the end of the day, we're all in this cloud native adventure together.
Happy Shipping!
20 Oct 2025 3:30pm GMT
18 Oct 2025
Kubernetes Blog
Spotlight on Policy Working Group
(Note: The Policy Working Group has completed its mission and is no longer active. This article reflects its work, accomplishments, and insights into how a working group operates.)
In the complex world of Kubernetes, policies play a crucial role in managing and securing clusters. But have you ever wondered how these policies are developed, implemented, and standardized across the Kubernetes ecosystem? To answer that, let's take a look back at the work of the Policy Working Group.
The Policy Working Group was dedicated to a critical mission: providing an overall architecture that encompasses both current policy-related implementations and future policy proposals in Kubernetes. Their goal was both ambitious and essential: to develop a universal policy architecture that benefits developers and end-users alike.
Through collaborative methods, this working group strove to bring clarity and consistency to the often complex world of Kubernetes policies. By focusing on both existing implementations and future proposals, they ensured that the policy landscape in Kubernetes remains coherent and accessible as the technology evolves.
This blog post dives deeper into the work of the Policy Working Group, guided by insights from its former co-chairs:
Interviewed by Arujjwal Negi.
These co-chairs explained what the Policy Working Group was all about.
Introduction
Hello, thank you for the time! Let's start with some introductions, could you tell us a bit about yourself, your role, and how you got involved in Kubernetes?
Jim Bugwadia: My name is Jim Bugwadia, and I am a co-founder and the CEO at Nirmata which provides solutions that automate security and compliance for cloud-native workloads. At Nirmata, we have been working with Kubernetes since it started in 2014. We initially built a Kubernetes policy engine in our commercial platform and later donated it to CNCF as the Kyverno project. I joined the CNCF Kubernetes Policy Working Group to help build and standardize various aspects of policy management for Kubernetes and later became a co-chair.
Andy Suderman: My name is Andy Suderman and I am the CTO of Fairwinds, a managed Kubernetes-as-a-Service provider. I began working with Kubernetes in 2016 building a web conferencing platform. I am an author and/or maintainer of several Kubernetes-related open-source projects such as Goldilocks, Pluto, and Polaris. Polaris is a JSON-schema-based policy engine, which started Fairwinds' journey into the policy space and my involvement in the Policy Working Group.
Poonam Lamba: My name is Poonam Lamba, and I currently work as a Product Manager for Google Kubernetes Engine (GKE) at Google. My journey with Kubernetes began back in 2017 when I was building an SRE platform for a large enterprise, using a private cloud built on Kubernetes. Intrigued by its potential to revolutionize the way we deployed and managed applications at the time, I dove headfirst into learning everything I could about it. Since then, I've had the opportunity to build the policy and compliance products for GKE. I lead and contribute to GKE CIS benchmarks. I am involved with the Gatekeeper project as well as I have contributed to Policy-WG for over 2 years and served as a co-chair for the group.
Responses to the following questions represent an amalgamation of insights from the former co-chairs.
About Working Groups
One thing even I am not aware of is the difference between a working group and a SIG. Can you help us understand what a working group is and how it is different from a SIG?
Unlike SIGs, working groups are temporary and focused on tackling specific, cross-cutting issues or projects that may involve multiple SIGs. Their lifespan is defined, and they disband once they've achieved their objective. Generally, working groups don't own code or have long-term responsibility for managing a particular area of the Kubernetes project.
(To know more about SIGs, visit the list of Special Interest Groups)
You mentioned that Working Groups involve multiple SIGS. What SIGS was the Policy WG closely involved with, and how did you coordinate with them?
The group collaborated closely with Kubernetes SIG Auth throughout our existence, and more recently, the group also worked with SIG Security since its formation. Our collaboration occurred in a few ways. We provided periodic updates during the SIG meetings to keep them informed of our progress and activities. Additionally, we utilize other community forums to maintain open lines of communication and ensured our work aligned with the broader Kubernetes ecosystem. This collaborative approach helped the group stay coordinated with related efforts across the Kubernetes community.
Policy WG
Why was the Policy Working Group created?
To enable a broad set of use cases, we recognize that Kubernetes is powered by a highly declarative, fine-grained, and extensible configuration management system. We've observed that a Kubernetes configuration manifest may have different portions that are important to various stakeholders. For example, some parts may be crucial for developers, while others might be of particular interest to security teams or address operational concerns. Given this complexity, we believe that policies governing the usage of these intricate configurations are essential for success with Kubernetes.
Our Policy Working Group was created specifically to research the standardization of policy definitions and related artifacts. We saw a need to bring consistency and clarity to how policies are defined and implemented across the Kubernetes ecosystem, given the diverse requirements and stakeholders involved in Kubernetes deployments.
Can you give me an idea of the work you did in the group?
We worked on several Kubernetes policy-related projects. Our initiatives included:
- We worked on a Kubernetes Enhancement Proposal (KEP) for the Kubernetes Policy Reports API. This aims to standardize how policy reports are generated and consumed within the Kubernetes ecosystem.
- We conducted a CNCF survey to better understand policy usage in the Kubernetes space. This helped gauge the practices and needs across the community at the time.
- We wrote a paper that will guide users in achieving PCI-DSS compliance for containers. This is intended to help organizations meet important security standards in their Kubernetes environments.
- We also worked on a paper highlighting how shifting security down can benefit organizations. This focuses on the advantages of implementing security measures earlier in the development and deployment process.
Can you tell us what were the main objectives of the Policy Working Group and some of your key accomplishments?
The charter of the Policy WG was to help standardize policy management for Kubernetes and educate the community on best practices.
To accomplish this we updated the Kubernetes documentation (Policies | Kubernetes), produced several whitepapers (Kubernetes Policy Management, Kubernetes GRC), and created the Policy Reports API (API reference) which standardizes reporting across various tools. Several popular tools such as Falco, Trivy, Kyverno, kube-bench, and others support the Policy Report API. A major milestone for the Policy WG was promoting the Policy Reports API to a SIG-level API or finding it a stable home.
Beyond that, as ValidatingAdmissionPolicy and MutatingAdmissionPolicy approached GA in Kubernetes, a key goal of the WG was to guide and educate the community on the tradeoffs and appropriate usage patterns for these built-in API objects and other CNCF policy management solutions like OPA/Gatekeeper and Kyverno.
Challenges
What were some of the major challenges that the Policy Working Group worked on?
During our work in the Policy Working Group, we encountered several challenges:
-
One of the main issues we faced was finding time to consistently contribute. Given that many of us have other professional commitments, it can be difficult to dedicate regular time to the working group's initiatives.
-
Another challenge we experienced was related to our consensus-driven model. While this approach ensures that all voices are heard, it can sometimes lead to slower decision-making processes. We valued thorough discussion and agreement, but this can occasionally delay progress on our projects.
-
We've also encountered occasional differences of opinion among group members. These situations require careful navigation to ensure that we maintain a collaborative and productive environment while addressing diverse viewpoints.
-
Lastly, we've noticed that newcomers to the group may find it difficult to contribute effectively without consistent attendance at our meetings. The complex nature of our work often requires ongoing context, which can be challenging for those who aren't able to participate regularly.
Can you tell me more about those challenges? How did you discover each one? What has the impact been? What were some strategies you used to address them?
There are no easy answers, but having more contributors and maintainers greatly helps! Overall the CNCF community is great to work with and is very welcoming to beginners. So, if folks out there are hesitating to get involved, I highly encourage them to attend a WG or SIG meeting and just listen in.
It often takes a few meetings to fully understand the discussions, so don't feel discouraged if you don't grasp everything right away. We made a point to emphasize this and encouraged new members to review documentation as a starting point for getting involved.
Additionally, differences of opinion were valued and encouraged within the Policy-WG. We adhered to the CNCF core values and resolve disagreements by maintaining respect for one another. We also strove to timebox our decisions and assign clear responsibilities to keep things moving forward.
This is where our discussion about the Policy Working Group ends. The working group, and especially the people who took part in this article, hope this gave you some insights into the group's aims and workings. You can get more info about Working Groups here.
18 Oct 2025 12:00am GMT
06 Oct 2025
Kubernetes Blog
Introducing Headlamp Plugin for Karpenter - Scaling and Visibility
Headlamp is an open‑source, extensible Kubernetes SIG UI project designed to let you explore, manage, and debug cluster resources.
Karpenter is a Kubernetes Autoscaling SIG node provisioning project that helps clusters scale quickly and efficiently. It launches new nodes in seconds, selects appropriate instance types for workloads, and manages the full node lifecycle, including scale-down.
The new Headlamp Karpenter Plugin adds real-time visibility into Karpenter's activity directly from the Headlamp UI. It shows how Karpenter resources relate to Kubernetes objects, displays live metrics, and surfaces scaling events as they happen. You can inspect pending pods during provisioning, review scaling decisions, and edit Karpenter-managed resources with built-in validation. The Karpenter plugin was made as part of a LFX mentor project.
The Karpenter plugin for Headlamp aims to make it easier for Kubernetes users and operators to understand, debug, and fine-tune autoscaling behavior in their clusters. Now we will give a brief tour of the Headlamp plugin.
Map view of Karpenter Resources and how they relate to Kubernetes resources
Easily see how Karpenter Resources like NodeClasses, NodePool and NodeClaims connect with core Kubernetes resources like Pods, Nodes etc.

Visualization of Karpenter Metrics
Get instant insights of Resource Usage v/s Limits, Allowed disruptions, Pending Pods, Provisioning Latency and many more .


Scaling decisions
Shows which instances are being provisioned for your workloads and understand the reason behind why Karpenter made those choices. Helpful while debugging.


Config editor with validation support
Make live edits to Karpenter configurations. The editor includes diff previews and resource validation for safer adjustments.

Real time view of Karpenter resources
View and track Karpenter specific resources in real time such as "NodeClaims" as your cluster scales up and down.



Dashboard for Pending Pods
View all pending pods with unmet scheduling requirements/Failed Scheduling highlighting why they couldn't be scheduled.

Karpenter Providers
This plugin should work with most Karpenter providers, but has only so far been tested on the ones listed in the table. Additionally, each provider gives some extra information, and the ones in the table below are displayed by the plugin.
| Provider Name | Tested | Extra provider specific info supported |
|---|---|---|
| AWS | ✅ | ✅ |
| Azure | ✅ | ✅ |
| AlibabaCloud | ❌ | ❌ |
| Bizfly Cloud | ❌ | ❌ |
| Cluster API | ❌ | ❌ |
| GCP | ❌ | ❌ |
| Proxmox | ❌ | ❌ |
| Oracle Cloud Infrastructure (OCI) | ❌ | ❌ |
Please submit an issue if you test one of the untested providers or if you want support for this provider (PRs also gladly accepted).
How to use
Please see the plugins/karpenter/README.md for instructions on how to use.
Feedback and Questions
Please submit an issue if you use Karpenter and have any other ideas or feedback. Or come to the Kubernetes slack headlamp channel for a chat.
06 Oct 2025 12:00am GMT
25 Sep 2025
Kubernetes Blog
Announcing Changed Block Tracking API support (alpha)
We're excited to announce the alpha support for a changed block tracking mechanism. This enhances the Kubernetes storage ecosystem by providing an efficient way for CSI storage drivers to identify changed blocks in PersistentVolume snapshots. With a driver that can use the feature, you could benefit from faster and more resource-efficient backup operations.
If you're eager to try this feature, you can skip to the Getting Started section.
What is changed block tracking?
Changed block tracking enables storage systems to identify and track modifications at the block level between snapshots, eliminating the need to scan entire volumes during backup operations. The improvement is a change to the Container Storage Interface (CSI), and also to the storage support in Kubernetes itself. With the alpha feature enabled, your cluster can:
- Identify allocated blocks within a CSI volume snapshot
- Determine changed blocks between two snapshots of the same volume
- Streamline backup operations by focusing only on changed data blocks
For Kubernetes users managing large datasets, this API enables significantly more efficient backup processes. Backup applications can now focus only on the blocks that have changed, rather than processing entire volumes.
Note:
As of now, the Changed Block Tracking API is supported only for block volumes and not for file volumes. CSI drivers that manage file-based storage systems will not be able to implement this capability.Benefits of changed block tracking support in Kubernetes
As Kubernetes adoption grows for stateful workloads managing critical data, the need for efficient backup solutions becomes increasingly important. Traditional full backup approaches face challenges with:
- Long backup windows: Full volume backups can take hours for large datasets, making it difficult to complete within maintenance windows.
- High resource utilization: Backup operations consume substantial network bandwidth and I/O resources, especially for large data volumes and data-intensive applications.
- Increased storage costs: Repetitive full backups store redundant data, causing storage requirements to grow linearly even when only a small percentage of data actually changes between backups.
The Changed Block Tracking API addresses these challenges by providing native Kubernetes support for incremental backup capabilities through the CSI interface.
Key components
The implementation consists of three primary components:
- CSI SnapshotMetadata Service API: An API, offered by gRPC, that provides volume snapshot and changed block data.
- SnapshotMetadataService API: A Kubernetes CustomResourceDefinition (CRD) that advertises CSI driver metadata service availability and connection details to cluster clients.
- External Snapshot Metadata Sidecar: An intermediary component that connects CSI drivers to backup applications via a standardized gRPC interface.
Implementation requirements
Storage provider responsibilities
If you're an author of a storage integration with Kubernetes and want to support the changed block tracking feature, you must implement specific requirements:
-
Implement CSI RPCs: Storage providers need to implement the
SnapshotMetadataservice as defined in the CSI specifications protobuf. This service requires server-side streaming implementations for the following RPCs:GetMetadataAllocated: For identifying allocated blocks in a snapshotGetMetadataDelta: For determining changed blocks between two snapshots
-
Storage backend capabilities: Ensure the storage backend has the capability to track and report block-level changes.
-
Deploy external components: Integrate with the
external-snapshot-metadatasidecar to expose the snapshot metadata service. -
Register custom resource: Register the
SnapshotMetadataServiceresource using a CustomResourceDefinition and create aSnapshotMetadataServicecustom resource that advertises the availability of the metadata service and provides connection details. -
Support error handling: Implement proper error handling for these RPCs according to the CSI specification requirements.
Backup solution responsibilities
A backup solution looking to leverage this feature must:
-
Set up authentication: The backup application must provide a Kubernetes ServiceAccount token when using the Kubernetes SnapshotMetadataService API. Appropriate access grants, such as RBAC RoleBindings, must be established to authorize the backup application ServiceAccount to obtain such tokens.
-
Implement streaming client-side code: Develop clients that implement the streaming gRPC APIs defined in the schema.proto file. Specifically:
- Implement streaming client code for
GetMetadataAllocatedandGetMetadataDeltamethods - Handle server-side streaming responses efficiently as the metadata comes in chunks
- Process the
SnapshotMetadataResponsemessage format with proper error handling
The
external-snapshot-metadataGitHub repository provides a convenient iterator support package to simplify client implementation. - Implement streaming client code for
-
Handle large dataset streaming: Design clients to efficiently handle large streams of block metadata that could be returned for volumes with significant changes.
-
Optimize backup processes: Modify backup workflows to use the changed block metadata to identify and only transfer changed blocks to make backups more efficient, reducing both backup duration and resource consumption.
Getting started
To use changed block tracking in your cluster:
- Ensure your CSI driver supports volume snapshots and implements the snapshot metadata capabilities with the required
external-snapshot-metadatasidecar - Make sure the SnapshotMetadataService custom resource is registered using CRD
- Verify the presence of a SnapshotMetadataService custom resource for your CSI driver
- Create clients that can access the API using appropriate authentication (via Kubernetes ServiceAccount tokens)
The API provides two main functions:
GetMetadataAllocated: Lists blocks allocated in a single snapshotGetMetadataDelta: Lists blocks changed between two snapshots
What's next?
Depending on feedback and adoption, the Kubernetes developers hope to push the CSI Snapshot Metadata implementation to Beta in the future releases.
Where can I learn more?
For those interested in trying out this new feature:
- Official Kubernetes CSI Developer Documentation
- The enhancement proposal for the snapshot metadata feature.
- GitHub repository for implementation and release status of
external-snapshot-metadata - Complete gRPC protocol definitions for snapshot metadata API: schema.proto
- Example snapshot metadata client implementation: snapshot-metadata-lister
- End-to-end example with csi-hostpath-driver: example documentation
How do I get involved?
This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. On behalf of SIG Storage, I would like to offer a huge thank you to the contributors who helped review the design and implementation of the project, including but not limited to the following:
- Ben Swartzlander (bswartz)
- Carl Braganza (carlbraganza)
- Daniil Fedotov (hairyhum)
- Ivan Sim (ihcsim)
- Nikhil Ladha (Nikhil-Ladha)
- Prasad Ghangal (PrasadG193)
- Praveen M (iPraveenParihar)
- Rakshith R (Rakshith-R)
- Xing Yang (xing-yang)
Thank also to everyone who has contributed to the project, including others who helped review the KEP and the CSI spec PR
For those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We always welcome new contributors.
The SIG also holds regular Data Protection Working Group meetings. New attendees are welcome to join our discussions.
25 Sep 2025 1:00pm GMT
22 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Pod Level Resources Graduated to Beta
On behalf of the Kubernetes community, I am thrilled to announce that the Pod Level Resources feature has graduated to Beta in the Kubernetes v1.34 release and is enabled by default! This significant milestone introduces a new layer of flexibility for defining and managing resource allocation for your Pods. This flexibility stems from the ability to specify CPU and memory resources for the Pod as a whole. Pod level resources can be combined with the container-level specifications to express the exact resource requirements and limits your application needs.
Pod-level specification for resources
Until recently, resource specifications that applied to Pods were primarily defined at the individual container level. While effective, this approach sometimes required duplicating or meticulously calculating resource needs across multiple containers within a single Pod. As a beta feature, Kubernetes allows you to specify the CPU, memory and hugepages resources at the Pod-level. This means you can now define resource requests and limits for an entire Pod, enabling easier resource sharing without requiring granular, per-container management of these resources where it's not needed.
Why does Pod-level specification matter?
This feature enhances resource management in Kubernetes by offering flexible resource management at both the Pod and container levels.
-
It provides a consolidated approach to resource declaration, reducing the need for meticulous, per-container management, especially for Pods with multiple containers.
-
Pod-level resources enable containers within a pod to share unused resoures amongst themselves, promoting efficient utilization within the pod. For example, it prevents sidecar containers from becoming performance bottlenecks. Previously, a sidecar (e.g., a logging agent or service mesh proxy) hitting its individual CPU limit could be throttled and slow down the entire Pod, even if the main application container had plenty of spare CPU. With pod-level resources, the sidecar and the main container can share Pod's resource budget, ensuring smooth operation during traffic spikes - either the whole Pod is throttled or all containers work.
-
When both pod-level and container-level resources are specified, pod-level requests and limits take precedence. This gives you - and cluster administrators - a powerful way to enforce overall resource boundaries for your Pods.
For scheduling, if a pod-level request is explicitly defined, the scheduler uses that specific value to find a suitable node, insteaf of the aggregated requests of the individual containers. At runtime, the pod-level limit acts as a hard ceiling for the combined resource usage of all containers. Crucially, this pod-level limit is the absolute enforcer; even if the sum of the individual container limits is higher, the total resource consumption can never exceed the pod-level limit.
-
Pod-level resources are prioritized in influencing the Quality of Service (QoS) class of the Pod.
-
For Pods running on Linux nodes, the Out-Of-Memory (OOM) score adjustment calculation considers both pod-level and container-level resources requests.
-
Pod-level resources are designed to be compatible with existing Kubernetes functionalities, ensuring a smooth integration into your workflows.
How to specify resources for an entire Pod
Using PodLevelResources feature gate requires Kubernetes v1.34 or newer for all cluster components, including the control plane and every node. This feature gate is in beta and enabled by default in v1.34.
Example manifest
You can specify CPU, memory and hugepages resources directly in the Pod spec manifest at the resources field for the entire Pod.
Here's an example demonstrating a Pod with both CPU and memory requests and limits defined at the Pod level:
apiVersion: v1
kind: Pod
metadata:
name: pod-resources-demo
namespace: pod-resources-example
spec:
# The 'resources' field at the Pod specification level defines the overall
# resource budget for all containers within this Pod combined.
resources: # Pod-level resources
# 'limits' specifies the maximum amount of resources the Pod is allowed to use.
# The sum of the limits of all containers in the Pod cannot exceed these values.
limits:
cpu: "1" # The entire Pod cannot use more than 1 CPU core.
memory: "200Mi" # The entire Pod cannot use more than 200 MiB of memory.
# 'requests' specifies the minimum amount of resources guaranteed to the Pod.
# This value is used by the Kubernetes scheduler to find a node with enough capacity.
requests:
cpu: "1" # The Pod is guaranteed 1 CPU core when scheduled.
memory: "100Mi" # The Pod is guaranteed 100 MiB of memory when scheduled.
containers:
- name: main-app-container
image: nginx
...
# This container has no resource requests or limits specified.
- name: auxiliary-container
image: fedora
command: ["sleep", "inf"]
...
# This container has no resource requests or limits specified.
In this example, the pod-resources-demo Pod as a whole requests 1 CPU and 100 MiB of memory, and is limited to 1 CPU and 200 MiB of memory. The containers within will operate under these overall Pod-level constraints, as explained in the next section.
Interaction with container-level resource requests or limits
When both pod-level and container-level resources are specified, pod-level requests and limits take precedence. This means the node allocates resources based on the pod-level specifications.
Consider a Pod with two containers where pod-level CPU and memory requests and limits are defined, and only one container has its own explicit resource definitions:
apiVersion: v1
kind: Pod
metadata:
name: pod-resources-demo
namespace: pod-resources-example
spec:
resources:
limits:
cpu: "1"
memory: "200Mi"
requests:
cpu: "1"
memory: "100Mi"
containers:
- name: main-app-container
image: nginx
resources:
requests:
cpu: "0.5"
memory: "50Mi"
- name: auxiliary-container
image: fedora
command: [ "sleep", "inf"]
# This container has no resource requests or limits specified.
-
Pod-Level Limits: The pod-level limits (cpu: "1", memory: "200Mi") establish an absolute boundary for the entire Pod. The sum of resources consumed by all its containers is enforced at this ceiling and cannot be surpassed.
-
Resource Sharing and Bursting: Containers can dynamically borrow any unused capacity, allowing them to burst as needed, so long as the Pod's aggregate usage stays within the overall limit.
-
Pod-Level Requests: The pod-level requests (cpu: "1", memory: "100Mi") serve as the foundational resource guarantee for the entire Pod. This value informs the scheduler's placement decision and represents the minimum resources the Pod can rely on during node-level contention.
-
Container-Level Requests: Container-level requests create a priority system within the Pod's guaranteed budget. Because main-app-container has an explicit request (cpu: "0.5", memory: "50Mi"), it is given precedence for its share of resources under resource pressure over the auxiliary-container, which has no such explicit claim.
Limitations
-
First of all, in-place resize of pod-level resources is not supported for Kubernetes v1.34 (or earlier). Attempting to modify the pod-level resource limits or requests on a running Pod results in an error: the resize is rejected. The v1.34 implementation of Pod level resources focuses on allowing initial declaration of an overall resource envelope, that applies to the entire Pod. That is distinct from in-place pod resize, which (despite what the name might suggest) allows you to make dynamic adjustments to container resource requests and limits, within a running Pod, and potentially without a container restart. In-place resizing is also not yet a stable feature; it graduated to Beta in the v1.33 release.
-
Only CPU, memory, and hugepages resources can be specified at pod-level.
-
Pod-level resources are not supported for Windows pods. If the Pod specification explicitly targets Windows (e.g., by setting spec.os.name: "windows"), the API server will reject the Pod during the validation step. If the Pod is not explicitly marked for Windows but is scheduled to a Windows node (e.g., via a nodeSelector), the Kubelet on that Windows node will reject the Pod during its admission process.
-
The Topology Manager, Memory Manager and CPU Manager do not align pods and containers based on pod-level resources as these resource managers don't currently support pod-level resources.
Getting started and providing feedback
Ready to explore Pod Level Resources feature? You'll need a Kubernetes cluster running version 1.34 or later. Remember to enable the PodLevelResources feature gate across your control plane and all nodes.
As this feature moves through Beta, your feedback is invaluable. Please report any issues or share your experiences via the standard Kubernetes communication channels:
22 Sep 2025 6:30pm GMT
19 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Recovery From Volume Expansion Failure (GA)
Have you ever made a typo when expanding your persistent volumes in Kubernetes? Meant to specify 2TB but specified 20TiB? This seemingly innocuous problem was kinda hard to fix - and took the project almost 5 years to fix. Automated recovery from storage expansion has been around for a while in beta; however, with the v1.34 release, we have graduated this to general availability.
While it was always possible to recover from failing volume expansions manually, it usually required cluster-admin access and was tedious to do (See aformentioned link for more information).
What if you make a mistake and then realize immediately? With Kubernetes v1.34, you should be able to reduce the requested size of the PersistentVolumeClaim (PVC) and, as long as the expansion to previously requested size hadn't finished, you can amend the size requested. Kubernetes will automatically work to correct it. Any quota consumed by failed expansion will be returned to the user and the associated PersistentVolume should be resized to the latest size you specified.
I'll walk through an example of how all of this works.
Reducing PVC size to recover from failed expansion
Imagine that you are running out of disk space for one of your database servers, and you want to expand the PVC from previously specified 10TB to 100TB - but you make a typo and specify 1000TB.
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: myclaim
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1000TB # newly specified size - but incorrect!
Now, you may be out of disk space on your disk array or simply ran out of allocated quota on your cloud-provider. But, assume that expansion to 1000TB is never going to succeed.
In Kubernetes v1.34, you can simply correct your mistake and request a new PVC size, that is smaller than the mistake, provided it is still larger than the original size of the actual PersistentVolume.
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: myclaim
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100TB # Corrected size; has to be greater than 10TB.
# You cannot shrink the volume below its actual size.
This requires no admin intervention. Even better, any surplus Kubernetes quota that you temporarily consumed will be automatically returned.
This fault recovery mechanism does have a caveat: whatever new size you specify for the PVC, it must be still higher than the original size in .status.capacity. Since Kubernetes doesn't support shrinking your PV objects, you can never go below the size that was originally allocated for your PVC request.
Improved error handling and observability of volume expansion
Implementing what might look like a relatively minor change also required us to almost fully redo how volume expansion works under the hood in Kubernetes. There are new API fields available in PVC objects which you can monitor to observe progress of volume expansion.
Improved observability of in-progress expansion
You can query .status.allocatedResourceStatus['storage'] of a PVC to monitor progress of a volume expansion operation. For a typical block volume, this should transition between ControllerResizeInProgress, NodeResizePending and NodeResizeInProgress and become nil/empty when volume expansion has finished.
If for some reason, volume expansion to requested size is not feasible it should accordingly be in states like - ControllerResizeInfeasible or NodeResizeInfeasible.
You can also observe size towards which Kubernetes is working by watching pvc.status.allocatedResources.
Improved error handling and reporting
Kubernetes should now retry your failed volume expansions at slower rate, it should make fewer requests to both storage system and Kubernetes apiserver.
Errors observerd during volume expansion are now reported as condition on PVC objects and should persist unlike events. Kubernetes will now populate pvc.status.conditions with error keys ControllerResizeError or NodeResizeError when volume expansion fails.
Fixes long standing bugs in resizing workflows
This feature also has allowed us to fix long standing bugs in resizing workflow such as Kubernetes issue #115294. If you observe anything broken, please report your bugs to https://github.com/kubernetes/kubernetes/issues, along with details about how to reproduce the problem.
Working on this feature through its lifecycle was challenging and it wouldn't have been possible to reach GA without feedback from @msau42, @jsafrane and @xing-yang.
All of the contributors who worked on this also appreciate the input provided by @thockin and @liggitt at various Kubernetes contributor summits.
19 Sep 2025 6:30pm GMT
18 Sep 2025
Kubernetes Blog
Kubernetes v1.34: DRA Consumable Capacity
Dynamic Resource Allocation (DRA) is a Kubernetes API for managing scarce resources across Pods and containers. It enables flexible resource requests, going beyond simply allocating N number of devices to support more granular usage scenarios. With DRA, users can request specific types of devices based on their attributes, define custom configurations tailored to their workloads, and even share the same resource among multiple containers or Pods.
In this blog, we focus on the device sharing feature and dive into a new capability introduced in Kubernetes 1.34: DRA consumable capacity, which extends DRA to support finer-grained device sharing.
Background: device sharing via ResourceClaims
From the beginning, DRA introduced the ability for multiple Pods to share a device by referencing the same ResourceClaim. This design decouples resource allocation from specific hardware, allowing for more dynamic and reusable provisioning of devices.
In Kubernetes 1.33, the new support for partitionable devices allowed resource drivers to advertise slices of a device that are available, rather than exposing the entire device as an all-or-nothing resource. This enabled Kubernetes to model shareable hardware more accurately.
But there was still a missing piece: it didn't yet support scenarios where the device driver manages fine-grained, dynamic portions of a device resource - like network bandwidth - based on user demand, or to share those resources independently of ResourceClaims, which are restricted by their spec and namespace.
That's where consumable capacity for DRA comes in.
Benefits of DRA consumable capacity support
Here's a taste of what you get in a cluster with the DRAConsumableCapacity feature gate enabled.
Device sharing across multiple ResourceClaims or DeviceRequests
Resource drivers can now support sharing the same device - or even a slice of a device - across multiple ResourceClaims or across multiple DeviceRequests.
This means that Pods from different namespaces can simultaneously share the same device, if permitted and supported by the specific DRA driver.
Device resource allocation
Kubernetes extends the allocation algorithm in the scheduler to support allocating a portion of a device's resources, as defined in the capacity field. The scheduler ensures that the total allocated capacity across all consumers never exceeds the device's total capacity, even when shared across multiple ResourceClaims or DeviceRequests. This is very similar to the way the scheduler allows Pods and containers to share allocatable resources on Nodes; in this case, it allows them to share allocatable (consumable) resources on Devices.
This feature expands support for scenarios where the device driver is able to manage resources within a device and on a per-process basis - for example, allocating a specific amount of memory (e.g., 8 GiB) from a virtual GPU, or setting bandwidth limits on virtual network interfaces allocated to specific Pods. This aims to provide safe and efficient resource sharing.
DistinctAttribute constraint
This feature also introduces a new constraint: DistinctAttribute, which is the complement of the existing MatchAttribute constraint.
The primary goal of DistinctAttribute is to prevent the same underlying device from being allocated multiple times within a single ResourceClaim, which could happen since we are allocating shares (or subsets) of devices. This constraint ensures that each allocation refers to a distinct resource, even if they belong to the same device class.
It is useful for use cases such as allocating network devices connecting to different subnets to expand coverage or provide redundancy across failure domains.
How to use consumable capacity?
DRAConsumableCapacity is introduced as an alpha feature in Kubernetes 1.34. The feature gate DRAConsumableCapacity must be enabled in kubelet, kube-apiserver, kube-scheduler and kube-controller-manager.
--feature-gates=...,DRAConsumableCapacity=true
As a DRA driver developer
As a DRA driver developer writing in Golang, you can make a device within a ResourceSlice allocatable to multiple ResourceClaims (or devices.requests) by setting AllowMultipleAllocations to true.
Device {
...
AllowMultipleAllocations: ptr.To(true),
...
}
Additionally, you can define a policy to restrict how each device's Capacity should be consumed by each DeviceRequest by defining RequestPolicy field in the DeviceCapacity. The example below shows how to define a policy that requires a GPU with 40 GiB of memory to allocate at least 5 GiB per request, with each allocation in multiples of 5 GiB.
DeviceCapacity{
Value: resource.MustParse("40Gi"),
RequestPolicy: &CapacityRequestPolicy{
Default: ptr.To(resource.MustParse("5Gi")),
ValidRange: &CapacityRequestPolicyRange {
Min: ptr.To(resource.MustParse("5Gi")),
Step: ptr.To(resource.MustParse("5Gi")),
}
}
}
This will be published to the ResourceSlice, as partially shown below:
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
...
spec:
devices:
- name: gpu0
allowMultipleAllocations: true
capacity:
memory:
value: 40Gi
requestPolicy:
default: 5Gi
validRange:
min: 5Gi
step: 5Gi
An allocated device with a specified portion of consumed capacity will have a ShareID field set in the allocation status.
claim.Status.Allocation.Devices.Results[i].ShareID
This ShareID allows the driver to distinguish between different allocations that refer to the same device or same statically-partitioned slice but come from different ResourceClaim requests.
It acts as a unique identifier for each shared slice, enabling the driver to manage and enforce resource limits independently across multiple consumers.
As a consumer
As a consumer (or user), the device resource can be requested with a ResourceClaim like this:
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
...
spec:
devices:
requests: # for devices
- name: req0
exactly:
deviceClassName: resource.example.com
capacity:
requests: # for resources which must be provided by those devices
memory: 10Gi
This configuration ensures that the requested device can provide at least 10GiB of memory.
Notably that any resource.example.com device that has at least 10GiB of memory can be allocated. If a device that does not support multiple allocations is chosen, the allocation would consume the entire device. To filter only devices that support multiple allocations, you can define a selector like this:
selectors:
- cel:
expression: |-
device.allowMultipleAllocations == true
Integration with DRA device status
In device sharing, general device information is provided through the resource slice. However, some details are set dynamically after allocation. These can be conveyed using the .status.devices field of a ResourceClaim. That field is only published in clusters where the DRAResourceClaimDeviceStatus feature gate is enabled.
If you do have device status support available, a driver can expose additional device-specific information beyond the ShareID. One particularly useful use case is for virtual networks, where a driver can include the assigned IP address(es) in the status. This is valuable for both network service operations and troubleshooting.
You can find more information by watching our recording at: KubeCon Japan 2025 - Reimagining Cloud Native Networks: The Critical Role of DRA.
What can you do next?
-
Check out the CNI DRA Driver project for an example of DRA integration in Kubernetes networking. Try integrating with network resources like
macvlan,ipvlan, or smart NICs. -
Start enabling the
DRAConsumableCapacityfeature gate and experimenting with virtualized or partitionable devices. Specify your workloads with consumable capacity (for example: fractional bandwidth or memory). -
Let us know your feedback:
- ✅ What worked well?
- ⚠️ What didn't?
If you encountered issues to fix or opportunities to enhance, please file a new issue and reference KEP-5075 there, or reach out via Slack (#wg-device-management).
Conclusion
Consumable capacity support enhances the device sharing capability of DRA by allowing effective device sharing across namespaces, across claims, and tailored to each Pod's actual needs. It also empowers drivers to enforce capacity limits, improves scheduling accuracy, and unlocks new use cases like bandwidth-aware networking and multi-tenant device sharing.
Try it out, experiment with consumable resources, and help shape the future of dynamic resource allocation in Kubernetes!
Further Reading
- DRA in the Kubernetes documentation
- KEP for DRA Partitionable Devices
- KEP for DRA Device Status
- KEP for DRA Consumable Capacity
- Kubernetes 1.34 Release Notes
18 Sep 2025 6:30pm GMT
17 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Pods Report DRA Resource Health
The rise of AI/ML and other high-performance workloads has made specialized hardware like GPUs, TPUs, and FPGAs a critical component of many Kubernetes clusters. However, as discussed in a previous blog post about navigating failures in Pods with devices, when this hardware fails, it can be difficult to diagnose, leading to significant downtime. With the release of Kubernetes v1.34, we are excited to announce a new alpha feature that brings much-needed visibility into the health of these devices.
This work extends the functionality of KEP-4680, which first introduced a mechanism for reporting the health of devices managed by Device Plugins. Now, this capability is being extended to Dynamic Resource Allocation (DRA). Controlled by the ResourceHealthStatus feature gate, this enhancement allows DRA drivers to report device health directly into a Pod's .status field, providing crucial insights for operators and developers.
Why expose device health in Pod status?
For stateful applications or long-running jobs, a device failure can be disruptive and costly. By exposing device health in the .status field for a Pod, Kubernetes provides a standardized way for users and automation tools to quickly diagnose issues. If a Pod is failing, you can now check its status to see if an unhealthy device is the root cause, saving valuable time that might otherwise be spent debugging application code.
How it works
This feature introduces a new, optional communication channel between the Kubelet and DRA drivers, built on three core components.
A new gRPC health service
A new gRPC service, DRAResourceHealth, is defined in the dra-health/v1alpha1 API group. DRA drivers can implement this service to stream device health updates to the Kubelet. The service includes a NodeWatchResources server-streaming RPC that sends the health status (Healthy, Unhealthy, or Unknown) for the devices it manages.
Kubelet integration
The Kubelet's DRAPluginManager discovers which drivers implement the health service. For each compatible driver, it starts a long-lived NodeWatchResources stream to receive health updates. The DRA Manager then consumes these updates and stores them in a persistent healthInfoCache that can survive Kubelet restarts.
Populating the Pod status
When a device's health changes, the DRA manager identifies all Pods affected by the change and triggers a Pod status update. A new field, allocatedResourcesStatus, is now part of the v1.ContainerStatus API object. The Kubelet populates this field with the current health of each device allocated to the container.
A practical example
If a Pod is in a CrashLoopBackOff state, you can use kubectl describe pod <pod-name> to inspect its status. If an allocated device has failed, the output will now include the allocatedResourcesStatus field, clearly indicating the problem:
status:
containerStatuses:
- name: my-gpu-intensive-container
# ... other container statuses
allocatedResourcesStatus:
- name: "claim:my-gpu-claim"
resources:
- resourceID: "example.com/gpu-a1b2-c3d4"
health: "Unhealthy"
This explicit status makes it clear that the issue is with the underlying hardware, not the application.
Now you can improve the failure detection logic to react on the unhealthy devices associated with the Pod by de-scheduling a Pod.
How to use this feature
As this is an alpha feature in Kubernetes v1.34, you must take the following steps to use it:
- Enable the
ResourceHealthStatusfeature gate on your kube-apiserver and kubelets. - Ensure you are using a DRA driver that implements the
v1alpha1 DRAResourceHealthgRPC service.
DRA drivers
If you are developing a DRA driver, make sure to think about device failure detection strategy and ensure that your driver is integrated with this feature. This way, your driver will improve the user experience and simplify debuggability of hardware issues.
What's next?
This is the first step in a broader effort to improve how Kubernetes handles device failures. As we gather feedback on this alpha feature, the community is planning several key enhancements before graduating to Beta:
- Detailed health messages: To improve the troubleshooting experience, we plan to add a human-readable message field to the gRPC API. This will allow DRA drivers to provide specific context for a health status, such as "GPU temperature exceeds threshold" or "NVLink connection lost".
- Configurable health timeouts: The timeout for marking a device's health as "Unknown" is currently hardcoded. We plan to make this configurable, likely on a per-driver basis, to better accommodate the different health-reporting characteristics of various hardware.
- Improved post-mortem troubleshooting: We will address a known limitation where health updates may not be applied to pods that have already terminated. This fix will ensure that the health status of a device at the time of failure is preserved, which is crucial for troubleshooting batch jobs and other "run-to-completion" workloads.
This feature was developed as part of KEP-4680, and community feedback is crucial as we work toward graduating it to Beta. We have more improvements of device failure handling in k8s and encourage you to try it out and share your experiences with the SIG Node community!
17 Sep 2025 6:30pm GMT
16 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Moving Volume Group Snapshots to v1beta2
Volume group snapshots were introduced as an Alpha feature with the Kubernetes 1.27 release and moved to Beta in the Kubernetes 1.32 release. The recent release of Kubernetes v1.34 moved that support to a second beta. The support for volume group snapshots relies on a set of extension APIs for group snapshots. These APIs allow users to take crash consistent snapshots for a set of volumes. Behind the scenes, Kubernetes uses a label selector to group multiple PersistentVolumeClaims for snapshotting. A key aim is to allow you restore that set of snapshots to new volumes and recover your workload based on a crash consistent recovery point.
This new feature is only supported for CSI volume drivers.
What's new in Beta 2?
While testing the beta version, we encountered an issue where the restoreSize field is not set for individual VolumeSnapshotContents and VolumeSnapshots if CSI driver does not implement the ListSnapshots RPC call. We evaluated various options here and decided to make this change releasing a new beta for the API.
Specifically, a VolumeSnapshotInfo struct is added in v1beta2, it contains information for an individual volume snapshot that is a member of a volume group snapshot. VolumeSnapshotInfoList, a list of VolumeSnapshotInfo, is added to VolumeGroupSnapshotContentStatus, replacing VolumeSnapshotHandlePairList. VolumeSnapshotInfoList is a list of snapshot information returned by the CSI driver to identify snapshots on the storage system. VolumeSnapshotInfoList is populated by the csi-snapshotter sidecar based on the CSI CreateVolumeGroupSnapshotResponse returned by the CSI driver's CreateVolumeGroupSnapshot call.
The existing v1beta1 API objects will be converted to the new v1beta2 API objects by a conversion webhook.
What's next?
Depending on feedback and adoption, the Kubernetes project plans to push the volume group snapshot implementation to general availability (GA) in a future release.
How can I learn more?
- The design spec for the volume group snapshot feature.
- The code repository for volume group snapshot APIs and controller.
- CSI documentation on the group snapshot feature.
How do I get involved?
This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. On behalf of SIG Storage, I would like to offer a huge thank you to the contributors who stepped up these last few quarters to help the project reach beta:
- Ben Swartzlander (bswartz)
- Hemant Kumar (gnufied)
- Jan Šafránek (jsafrane)
- Madhu Rajanna (Madhu-1)
- Michelle Au (msau42)
- Niels de Vos (nixpanic)
- Leonardo Cecchi (leonardoce)
- Saad Ali (saad-ali)
- Xing Yang (xing-yang)
- Yati Padia (yati1998)
For those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We always welcome new contributors.
We also hold regular Data Protection Working Group meetings. New attendees are welcome to join our discussions.
16 Sep 2025 6:30pm GMT
15 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Decoupled Taint Manager Is Now Stable
This enhancement separates the responsibility of managing node lifecycle and pod eviction into two distinct components. Previously, the node lifecycle controller handled both marking nodes as unhealthy with NoExecute taints and evicting pods from them. Now, a dedicated taint eviction controller manages the eviction process, while the node lifecycle controller focuses solely on applying taints. This separation not only improves code organization but also makes it easier to improve taint eviction controller or build custom implementations of the taint based eviction.
What's new?
The feature gate SeparateTaintEvictionController has been promoted to GA in this release. Users can optionally disable taint-based eviction by setting --controllers=-taint-eviction-controller in kube-controller-manager.
How can I learn more?
For more details, refer to the KEP and to the beta announcement article: Kubernetes 1.29: Decoupling taint manager from node lifecycle controller.
How to get involved?
We offer a huge thank you to all the contributors who helped with design, implementation, and review of this feature and helped move it from beta to stable:
- Ed Bartosh (@bart0sh)
- Yuan Chen (@yuanchen8911)
- Aldo Culquicondor (@alculquicondor)
- Baofa Fan (@carlory)
- Sergey Kanzhelev (@SergeyKanzhelev)
- Tim Bannister (@lmktfy)
- Maciej Skoczeń (@macsko)
- Maciej Szulik (@soltysh)
- Wojciech Tyczynski (@wojtek-t)
15 Sep 2025 6:30pm GMT
12 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Autoconfiguration for Node Cgroup Driver Goes GA
Historically, configuring the correct cgroup driver has been a pain point for users running new Kubernetes clusters. On Linux systems, there are two different cgroup drivers: cgroupfs and systemd. In the past, both the kubelet and CRI implementation (like CRI-O or containerd) needed to be configured to use the same cgroup driver, or else the kubelet would misbehave without any explicit error message. This was a source of headaches for many cluster admins. Now, we've (almost) arrived at the end of that headache.
Automated cgroup driver detection
In v1.28.0, the SIG Node community introduced the feature gate KubeletCgroupDriverFromCRI, which instructs the kubelet to ask the CRI implementation which cgroup driver to use. You can read more here. After many releases of waiting for each CRI implementation to have major versions released and packaged in major operating systems, this feature has gone GA as of Kubernetes 1.34.0.
In addition to setting the feature gate, a cluster admin needs to ensure their CRI implementation is new enough:
- containerd: Support was added in v2.0.0
- CRI-O: Support was added in v1.28.0
Announcement: Kubernetes is deprecating containerd v1.y support
While CRI-O releases versions that match Kubernetes versions, and thus CRI-O versions without this behavior are no longer supported, containerd maintains its own release cycle. containerd support for this feature is only in v2.0 and later, but Kubernetes 1.34 still supports containerd 1.7 and other LTS releases of containerd.
The Kubernetes SIG Node community has formally agreed upon a final support timeline for containerd v1.y. The last Kubernetes release to offer this support will be the last released version of v1.35, and support will be dropped in v1.36.0. To assist administrators in managing this future transition, a new detection mechanism is available. You are able to monitor the kubelet_cri_losing_support metric to determine if any nodes in your cluster are using a containerd version that will soon be outdated. The presence of this metric with a version label of 1.36.0 will indicate that the node's containerd runtime is not new enough for the upcoming requirements. Consequently, an administrator will need to upgrade containerd to v2.0 or a later version before, or at the same time as, upgrading the kubelet to v1.36.0.
12 Sep 2025 6:30pm GMT
11 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Mutable CSI Node Allocatable Graduates to Beta
The functionality for CSI drivers to update information about attachable volume count on the nodes, first introduced as Alpha in Kubernetes v1.33, has graduated to Beta in the Kubernetes v1.34 release! This marks a significant milestone in enhancing the accuracy of stateful pod scheduling by reducing failures due to outdated attachable volume capacity information.
Background
Traditionally, Kubernetes CSI drivers report a static maximum volume attachment limit when initializing. However, actual attachment capacities can change during a node's lifecycle for various reasons, such as:
- Manual or external operations attaching/detaching volumes outside of Kubernetes control.
- Dynamically attached network interfaces or specialized hardware (GPUs, NICs, etc.) consuming available slots.
- Multi-driver scenarios, where one CSI driver's operations affect available capacity reported by another.
Static reporting can cause Kubernetes to schedule pods onto nodes that appear to have capacity but don't, leading to pods stuck in a ContainerCreating state.
Dynamically adapting CSI volume limits
With this new feature, Kubernetes enables CSI drivers to dynamically adjust and report node attachment capacities at runtime. This ensures that the scheduler, as well as other components relying on this information, have the most accurate, up-to-date view of node capacity.
How it works
Kubernetes supports two mechanisms for updating the reported node volume limits:
- Periodic Updates: CSI drivers specify an interval to periodically refresh the node's allocatable capacity.
- Reactive Updates: An immediate update triggered when a volume attachment fails due to exhausted resources (
ResourceExhaustederror).
Enabling the feature
To use this beta feature, the MutableCSINodeAllocatableCount feature gate must be enabled in these components:
kube-apiserverkubelet
Example CSI driver configuration
Below is an example of configuring a CSI driver to enable periodic updates every 60 seconds:
apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
name: example.csi.k8s.io
spec:
nodeAllocatableUpdatePeriodSeconds: 60
This configuration directs kubelet to periodically call the CSI driver's NodeGetInfo method every 60 seconds, updating the node's allocatable volume count. Kubernetes enforces a minimum update interval of 10 seconds to balance accuracy and resource usage.
Immediate updates on attachment failures
When a volume attachment operation fails due to a ResourceExhausted error (gRPC code 8), Kubernetes immediately updates the allocatable count instead of waiting for the next periodic update. The Kubelet then marks the affected pods as Failed, enabling their controllers to recreate them. This prevents pods from getting permanently stuck in the ContainerCreating state.
Getting started
To enable this feature in your Kubernetes v1.34 cluster:
- Enable the feature gate
MutableCSINodeAllocatableCounton thekube-apiserverandkubeletcomponents. - Update your CSI driver configuration by setting
nodeAllocatableUpdatePeriodSeconds. - Monitor and observe improvements in scheduling accuracy and pod placement reliability.
Next steps
This feature is currently in beta and the Kubernetes community welcomes your feedback. Test it, share your experiences, and help guide its evolution to GA stability.
Join discussions in the Kubernetes Storage Special Interest Group (SIG-Storage) to shape the future of Kubernetes storage capabilities.
11 Sep 2025 6:30pm GMT
10 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Use An Init Container To Define App Environment Variables
Kubernetes typically uses ConfigMaps and Secrets to set environment variables, which introduces additional API calls and complexity, For example, you need to separately manage the Pods of your workloads and their configurations, while ensuring orderly updates for both the configurations and the workload Pods.
Alternatively, you might be using a vendor-supplied container that requires environment variables (such as a license key or a one-time token), but you don't want to hard-code them or mount volumes just to get the job done.
If that's the situation you are in, you now have a new (alpha) way to achieve that. Provided you have the EnvFiles feature gate enabled across your cluster, you can tell the kubelet to load a container's environment variables from a volume (the volume must be part of the Pod that the container belongs to). this feature gate allows you to load environment variables directly from a file in an emptyDir volume without actually mounting that file into the container. It's a simple yet elegant solution to some surprisingly common problems.
What's this all about?
At its core, this feature allows you to point your container to a file, one generated by an initContainer, and have Kubernetes parse that file to set your environment variables. The file lives in an emptyDir volume (a temporary storage space that lasts as long as the pod does), Your main container doesn't need to mount the volume. The kubelet will read the file and inject these variables when the container starts.
How It Works
Here's a simple example:
apiVersion: v1
kind: Pod
spec:
initContainers:
- name: generate-config
image: busybox
command: ['sh', '-c', 'echo "CONFIG_VAR=HELLO" > /config/config.env']
volumeMounts:
- name: config-volume
mountPath: /config
containers:
- name: app-container
image: gcr.io/distroless/static
env:
- name: CONFIG_VAR
valueFrom:
fileKeyRef:
path: config.env
volumeName: config-volume
key: CONFIG_VAR
volumes:
- name: config-volume
emptyDir: {}
Using this approach is a breeze. You define your environment variables in the pod spec using the fileKeyRef field, which tells Kubernetes where to find the file and which key to pull. The file itself resembles the standard for .env syntax (think KEY=VALUE), and (for this alpha stage at least) you must ensure that it is written into an emptyDir volume. Other volume types aren't supported for this feature. At least one init container must mount that emptyDir volume (to write the file), but the main container doesn't need to-it just gets the variables handed to it at startup.
A word on security
While this feature supports handling sensitive data such as keys or tokens, note that its implementation relies on emptyDir volumes mounted into pod. Operators with node filesystem access could therefore easily retrieve this sensitive data through pod directory paths.
If storing sensitive data like keys or tokens using this feature, ensure your cluster security policies effectively protect nodes against unauthorized access to prevent exposure of confidential information.
Summary
This feature will eliminate a number of complex workarounds used today, simplifying apps authoring, and opening doors for more use cases. Kubernetes stays flexible and open for feedback. Tell us how you use this feature or what is missing.
10 Sep 2025 6:30pm GMT
09 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Snapshottable API server cache
For years, the Kubernetes community has been on a mission to improve the stability and performance predictability of the API server. A major focus of this effort has been taming list requests, which have historically been a primary source of high memory usage and heavy load on the etcd datastore. With each release, we've chipped away at the problem, and today, we're thrilled to announce the final major piece of this puzzle.
The snapshottable API server cache feature has graduated to Beta in Kubernetes v1.34, culminating a multi-release effort to allow virtually all read requests to be served directly from the API server's cache.
Evolving the cache for performance and stability
The path to the current state involved several key enhancements over recent releases that paved the way for today's announcement.
Consistent reads from cache (Beta in v1.31)
While the API server has long used a cache for performance, a key milestone was guaranteeing consistent reads of the latest data from it. This v1.31 enhancement allowed the watch cache to be used for strongly-consistent read requests for the first time, a huge win as it enabled filtered collections (e.g. "a list of pods bound to this node") to be safely served from the cache instead of etcd, dramatically reducing its load for common workloads.
Taming large responses with streaming (Beta in v1.33)
Another key improvement was tackling the problem of memory spikes when transmitting large responses. The streaming encoder, introduced in v1.33, allowed the API server to send list items one by one, rather than buffering the entire multi-gigabyte response in memory. This made the memory cost of sending a response predictable and minimal, regardless of its size.
The missing piece
Despite these huge improvements, a critical gap remained. Any request for a historical LIST-most commonly used for paginating through large result sets-still had to bypass the cache and query etcd directly. This meant that the cost of retrieving the data was still unpredictable and could put significant memory pressure on the API server.
Kubernetes 1.34: snapshots complete the picture
The snapshottable API server cache solves this final piece of the puzzle. This feature enhances the watch cache, enabling it to generate efficient, point-in-time snapshots of its state.
Here's how it works: for each update, the cache creates a lightweight snapshot. These snapshots are "lazy copies," meaning they don't duplicate objects but simply store pointers, making them incredibly memory-efficient.
When a list request for a historical resourceVersion arrives, the API server now finds the corresponding snapshot and serves the response directly from its memory. This closes the final major gap, allowing paginated requests to be served entirely from the cache.
A new era of API Server performance 🚀
With this final piece in place, the synergy of these three features ushers in a new era of API server predictability and performance:
- Get Data from Cache: Consistent reads and snapshottable cache work together to ensure nearly all read requests-whether for the latest data or a historical snapshot-are served from the API server's memory.
- Send data via stream: Streaming list responses ensure that sending this data to the client has a minimal and constant memory footprint.
The result is a system where the resource cost of read operations is almost fully predictable and much more resiliant to spikes in request load. This means dramatically reduced memory pressure, a lighter load on etcd, and a more stable, scalable, and reliable control plane for all Kubernetes clusters.
How to get started
With its graduation to Beta, the SnapshottableCache feature gate is enabled by default in Kubernetes v1.34. There are no actions required to start benefiting from these performance and stability improvements.
Acknowledgements
Special thanks for designing, implementing, and reviewing these critical features go to:
- Ahmad Zolfaghari (@ah8ad3)
- Ben Luddy (@benluddy) - Red Hat
- Chen Chen (@z1cheng) - Microsoft
- Davanum Srinivas (@dims) - Nvidia
- David Eads (@deads2k) - Red Hat
- Han Kang (@logicalhan) - CoreWeave
- haosdent (@haosdent) - Shopee
- Joe Betz (@jpbetz) - Google
- Jordan Liggitt (@liggitt) - Google
- Łukasz Szaszkiewicz (@p0lyn0mial) - Red Hat
- Maciej Borsz (@mborsz) - Google
- Madhav Jivrajani (@MadhavJivrajani) - UIUC
- Marek Siarkowicz (@serathius) - Google
- NKeert (@NKeert)
- Tim Bannister (@lmktfy)
- Wei Fu (@fuweid) - Microsoft
- Wojtek Tyczyński (@wojtek-t) - Google
...and many others in SIG API Machinery. This milestone is a testament to the community's dedication to building a more scalable and robust Kubernetes.
09 Sep 2025 6:30pm GMT
08 Sep 2025
Kubernetes Blog
Kubernetes v1.34: VolumeAttributesClass for Volume Modification GA
The VolumeAttributesClass API, which empowers users to dynamically modify volume attributes, has officially graduated to General Availability (GA) in Kubernetes v1.34. This marks a significant milestone, providing a robust and stable way to tune your persistent storage directly within Kubernetes.
What is VolumeAttributesClass?
At its core, VolumeAttributesClass is a cluster-scoped resource that defines a set of mutable parameters for a volume. Think of it as a "profile" for your storage, allowing cluster administrators to expose different quality-of-service (QoS) levels or performance tiers.
Users can then specify a volumeAttributesClassName in their PersistentVolumeClaim (PVC) to indicate which class of attributes they desire. The magic happens through the Container Storage Interface (CSI): when a PVC referencing a VolumeAttributesClass is updated, the associated CSI driver interacts with the underlying storage system to apply the specified changes to the volume.
This means you can now:
- Dynamically scale performance: Increase IOPS or throughput for a busy database, or reduce it for a less critical application.
- Optimize costs: Adjust attributes on the fly to match your current needs, avoiding over-provisioning.
- Simplify operations: Manage volume modifications directly within the Kubernetes API, rather than relying on external tools or manual processes.
What is new from Beta to GA
There are two major enhancements from beta.
Cancellation support when errors occur
To improve resilience and user experience, the GA release introduces explicit cancel support when a requested volume modification encounters an error. If the underlying storage system or CSI driver indicates that the requested changes cannot be applied (e.g., due to invalid arguments), users can cancel the operation and revert the volume to its previous stable configuration, preventing the volume from being left in an inconsistent state.
Quota support based on scope
While VolumeAttributesClass doesn't add a new quota type, the Kubernetes control plane can be configured to enforce quotas on PersistentVolumeClaims that reference a specific VolumeAttributesClass.
This is achieved by using the scopeSelector field in a ResourceQuota to target PVCs that have .spec.volumeAttributesClassName set to a particular VolumeAttributesClass name. Please see more details here.
Drivers support VolumeAttributesClass
- Amazon EBS CSI Driver: The AWS EBS CSI driver has robust support for VolumeAttributesClass and allows you to modify parameters like volume type (e.g., gp2 to gp3, io1 to io2), IOPS, and throughput of EBS volumes dynamically.
- Google Compute Engine (GCE) Persistent Disk CSI Driver (pd.csi.storage.gke.io): This driver also supports dynamic modification of persistent disk attributes, including IOPS and throughput, via VolumeAttributesClass.
Contact
For any inquiries or specific questions related to VolumeAttributesClass, please reach out to the SIG Storage community.
08 Sep 2025 6:30pm GMT
05 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Pod Replacement Policy for Jobs Goes GA
In Kubernetes v1.34, the Pod replacement policy feature has reached general availability (GA). This blog post describes the Pod replacement policy feature and how to use it in your Jobs.
About Pod Replacement Policy
By default, the Job controller immediately recreates Pods as soon as they fail or begin terminating (when they have a deletion timestamp).
As a result, while some Pods are terminating, the total number of running Pods for a Job can temporarily exceed the specified parallelism. For Indexed Jobs, this can even mean multiple Pods running for the same index at the same time.
This behavior works fine for many workloads, but it can cause problems in certain cases.
For example, popular machine learning frameworks like TensorFlow and JAX expect exactly one Pod per worker index. If two Pods run at the same time, you might encounter errors such as:
/job:worker/task:4: Duplicate task registration with task_name=/job:worker/replica:0/task:4
Additionally, starting replacement Pods before the old ones fully terminate can lead to:
- Scheduling delays by kube-scheduler as the nodes remain occupied.
- Unnecessary cluster scale-ups to accommodate the replacement Pods.
- Temporary bypassing of quota checks by workload orchestrators like Kueue.
With Pod replacement policy, Kubernetes gives you control over when the control plane replaces terminating Pods, helping you avoid these issues.
How Pod Replacement Policy works
This enhancement means that Jobs in Kubernetes have an optional field .spec.podReplacementPolicy.
You can choose one of two policies:
TerminatingOrFailed(default): Replaces Pods as soon as they start terminating.Failed: Replaces Pods only after they fully terminate and transition to theFailedphase.
Setting the policy to Failed ensures that a new Pod is only created after the previous one has completely terminated.
For Jobs with a Pod Failure Policy, the default podReplacementPolicy is Failed, and no other value is allowed. See Pod Failure Policy to learn more about Pod Failure Policies for Jobs.
You can check how many Pods are currently terminating by inspecting the Job's .status.terminating field:
kubectl get job myjob -o=jsonpath='{.status.terminating}'
Example
Here's a Job example that executes a task two times (spec.completions: 2) in parallel (spec.parallelism: 2) and replaces Pods only after they fully terminate (spec.podReplacementPolicy: Failed):
apiVersion: batch/v1
kind: Job
metadata:
name: example-job
spec:
completions: 2
parallelism: 2
podReplacementPolicy: Failed
template:
spec:
restartPolicy: Never
containers:
- name: worker
image: your-image
If a Pod receives a SIGTERM signal (deletion, eviction, preemption...), it begins terminating. When the container handles termination gracefully, cleanup may take some time.
When the Job starts, we will see two Pods running:
kubectl get pods
NAME READY STATUS RESTARTS AGE
example-job-qr8kf 1/1 Running 0 2s
example-job-stvb4 1/1 Running 0 2s
Let's delete one of the Pods (example-job-qr8kf).
With the TerminatingOrFailed policy, as soon as one Pod (example-job-qr8kf) starts terminating, the Job controller immediately creates a new Pod (example-job-b59zk) to replace it.
kubectl get pods
NAME READY STATUS RESTARTS AGE
example-job-b59zk 1/1 Running 0 1s
example-job-qr8kf 1/1 Terminating 0 17s
example-job-stvb4 1/1 Running 0 17s
With the Failed policy, the new Pod (example-job-b59zk) is not created while the old Pod (example-job-qr8kf) is terminating.
kubectl get pods
NAME READY STATUS RESTARTS AGE
example-job-qr8kf 1/1 Terminating 0 17s
example-job-stvb4 1/1 Running 0 17s
When the terminating Pod has fully transitioned to the Failed phase, a new Pod is created:
kubectl get pods
NAME READY STATUS RESTARTS AGE
example-job-b59zk 1/1 Running 0 1s
example-job-stvb4 1/1 Running 0 25s
How can you learn more?
- Read the user-facing documentation for Pod Replacement Policy, Backoff Limit per Index, and Pod Failure Policy.
- Read the KEPs for Pod Replacement Policy, Backoff Limit per Index, and Pod Failure Policy.
Acknowledgments
As with any Kubernetes feature, multiple people contributed to getting this done, from testing and filing bugs to reviewing code.
As this feature moves to stable after 2 years, we would like to thank the following people:
- Kevin Hannon - for writing the KEP and the initial implementation.
- Michał Woźniak - for guidance, mentorship, and reviews.
- Aldo Culquicondor - for guidance, mentorship, and reviews.
- Maciej Szulik - for guidance, mentorship, and reviews.
- Dejan Zele Pejchev - for taking over the feature and promoting it from Alpha through Beta to GA.
Get involved
This work was sponsored by the Kubernetes batch working group in close collaboration with the SIG Apps community.
If you are interested in working on new features in the space we recommend subscribing to our Slack channel and attending the regular community meetings.
05 Sep 2025 6:30pm GMT