17 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Pods Report DRA Resource Health
The rise of AI/ML and other high-performance workloads has made specialized hardware like GPUs, TPUs, and FPGAs a critical component of many Kubernetes clusters. However, as discussed in a previous blog post about navigating failures in Pods with devices, when this hardware fails, it can be difficult to diagnose, leading to significant downtime. With the release of Kubernetes v1.34, we are excited to announce a new alpha feature that brings much-needed visibility into the health of these devices.
This work extends the functionality of KEP-4680, which first introduced a mechanism for reporting the health of devices managed by Device Plugins. Now, this capability is being extended to Dynamic Resource Allocation (DRA). Controlled by the ResourceHealthStatus
feature gate, this enhancement allows DRA drivers to report device health directly into a Pod's .status
field, providing crucial insights for operators and developers.
Why expose device health in Pod status?
For stateful applications or long-running jobs, a device failure can be disruptive and costly. By exposing device health in the .status
field for a Pod, Kubernetes provides a standardized way for users and automation tools to quickly diagnose issues. If a Pod is failing, you can now check its status to see if an unhealthy device is the root cause, saving valuable time that might otherwise be spent debugging application code.
How it works
This feature introduces a new, optional communication channel between the Kubelet and DRA drivers, built on three core components.
A new gRPC health service
A new gRPC service, DRAResourceHealth
, is defined in the dra-health/v1alpha1
API group. DRA drivers can implement this service to stream device health updates to the Kubelet. The service includes a NodeWatchResources
server-streaming RPC that sends the health status (Healthy
, Unhealthy
, or Unknown
) for the devices it manages.
Kubelet integration
The Kubelet's DRAPluginManager
discovers which drivers implement the health service. For each compatible driver, it starts a long-lived NodeWatchResources
stream to receive health updates. The DRA Manager then consumes these updates and stores them in a persistent healthInfoCache
that can survive Kubelet restarts.
Populating the Pod status
When a device's health changes, the DRA manager identifies all Pods affected by the change and triggers a Pod status update. A new field, allocatedResourcesStatus
, is now part of the v1.ContainerStatus
API object. The Kubelet populates this field with the current health of each device allocated to the container.
A practical example
If a Pod is in a CrashLoopBackOff
state, you can use kubectl describe pod <pod-name>
to inspect its status. If an allocated device has failed, the output will now include the allocatedResourcesStatus
field, clearly indicating the problem:
status:
containerStatuses:
- name: my-gpu-intensive-container
# ... other container statuses
allocatedResourcesStatus:
- name: "claim:my-gpu-claim"
resources:
- resourceID: "example.com/gpu-a1b2-c3d4"
health: "Unhealthy"
This explicit status makes it clear that the issue is with the underlying hardware, not the application.
Now you can improve the failure detection logic to react on the unhealthy devices associated with the Pod by de-scheduling a Pod.
How to use this feature
As this is an alpha feature in Kubernetes v1.34, you must take the following steps to use it:
- Enable the
ResourceHealthStatus
feature gate on your kube-apiserver and kubelets. - Ensure you are using a DRA driver that implements the
v1alpha1 DRAResourceHealth
gRPC service.
DRA drivers
If you are developing a DRA driver, make sure to think about device failure detection strategy and ensure that your driver is integrated with this feature. This way, your driver will improve the user experience and simplify debuggability of hardware issues.
What's next?
This is the first step in a broader effort to improve how Kubernetes handles device failures. As we gather feedback on this alpha feature, the community is planning several key enhancements before graduating to Beta:
- Detailed health messages: To improve the troubleshooting experience, we plan to add a human-readable message field to the gRPC API. This will allow DRA drivers to provide specific context for a health status, such as "GPU temperature exceeds threshold" or "NVLink connection lost".
- Configurable health timeouts: The timeout for marking a device's health as "Unknown" is currently hardcoded. We plan to make this configurable, likely on a per-driver basis, to better accommodate the different health-reporting characteristics of various hardware.
- Improved post-mortem troubleshooting: We will address a known limitation where health updates may not be applied to pods that have already terminated. This fix will ensure that the health status of a device at the time of failure is preserved, which is crucial for troubleshooting batch jobs and other "run-to-completion" workloads.
This feature was developed as part of KEP-4680, and community feedback is crucial as we work toward graduating it to Beta. We have more improvements of device failure handling in k8s and encourage you to try it out and share your experiences with the SIG Node community!
17 Sep 2025 6:30pm GMT
16 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Moving Volume Group Snapshots to v1beta2
Volume group snapshots were introduced as an Alpha feature with the Kubernetes 1.27 release and moved to Beta in the Kubernetes 1.32 release. The recent release of Kubernetes v1.34 moved that support to a second beta. The support for volume group snapshots relies on a set of extension APIs for group snapshots. These APIs allow users to take crash consistent snapshots for a set of volumes. Behind the scenes, Kubernetes uses a label selector to group multiple PersistentVolumeClaims for snapshotting. A key aim is to allow you restore that set of snapshots to new volumes and recover your workload based on a crash consistent recovery point.
This new feature is only supported for CSI volume drivers.
What's new in Beta 2?
While testing the beta version, we encountered an issue where the restoreSize
field is not set for individual VolumeSnapshotContents and VolumeSnapshots if CSI driver does not implement the ListSnapshots RPC call. We evaluated various options here and decided to make this change releasing a new beta for the API.
Specifically, a VolumeSnapshotInfo struct is added in v1beta2, it contains information for an individual volume snapshot that is a member of a volume group snapshot. VolumeSnapshotInfoList, a list of VolumeSnapshotInfo, is added to VolumeGroupSnapshotContentStatus, replacing VolumeSnapshotHandlePairList. VolumeSnapshotInfoList is a list of snapshot information returned by the CSI driver to identify snapshots on the storage system. VolumeSnapshotInfoList is populated by the csi-snapshotter sidecar based on the CSI CreateVolumeGroupSnapshotResponse returned by the CSI driver's CreateVolumeGroupSnapshot call.
The existing v1beta1 API objects will be converted to the new v1beta2 API objects by a conversion webhook.
What's next?
Depending on feedback and adoption, the Kubernetes project plans to push the volume group snapshot implementation to general availability (GA) in a future release.
How can I learn more?
- The design spec for the volume group snapshot feature.
- The code repository for volume group snapshot APIs and controller.
- CSI documentation on the group snapshot feature.
How do I get involved?
This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. On behalf of SIG Storage, I would like to offer a huge thank you to the contributors who stepped up these last few quarters to help the project reach beta:
- Ben Swartzlander (bswartz)
- Hemant Kumar (gnufied)
- Jan Šafránek (jsafrane)
- Madhu Rajanna (Madhu-1)
- Michelle Au (msau42)
- Niels de Vos (nixpanic)
- Leonardo Cecchi (leonardoce)
- Saad Ali (saad-ali)
- Xing Yang (xing-yang)
- Yati Padia (yati1998)
For those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We always welcome new contributors.
We also hold regular Data Protection Working Group meetings. New attendees are welcome to join our discussions.
16 Sep 2025 6:30pm GMT
15 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Decoupled Taint Manager Is Now Stable
This enhancement separates the responsibility of managing node lifecycle and pod eviction into two distinct components. Previously, the node lifecycle controller handled both marking nodes as unhealthy with NoExecute taints and evicting pods from them. Now, a dedicated taint eviction controller manages the eviction process, while the node lifecycle controller focuses solely on applying taints. This separation not only improves code organization but also makes it easier to improve taint eviction controller or build custom implementations of the taint based eviction.
What's new?
The feature gate SeparateTaintEvictionController
has been promoted to GA in this release. Users can optionally disable taint-based eviction by setting --controllers=-taint-eviction-controller
in kube-controller-manager.
How can I learn more?
For more details, refer to the KEP and to the beta announcement article: Kubernetes 1.29: Decoupling taint manager from node lifecycle controller.
How to get involved?
We offer a huge thank you to all the contributors who helped with design, implementation, and review of this feature and helped move it from beta to stable:
- Ed Bartosh (@bart0sh)
- Yuan Chen (@yuanchen8911)
- Aldo Culquicondor (@alculquicondor)
- Baofa Fan (@carlory)
- Sergey Kanzhelev (@SergeyKanzhelev)
- Tim Bannister (@lmktfy)
- Maciej Skoczeń (@macsko)
- Maciej Szulik (@soltysh)
- Wojciech Tyczynski (@wojtek-t)
15 Sep 2025 6:30pm GMT