26 Jun 2026

feedKubernetes Blog

Open source maintainership in the age of AI

AI has really changed the game around software development. More people are leveraging AI than ever to contribute patches to projects they use. To me, this is a good thing as more folks will contribute patches rather than fork or not fix them. The main problem is that AI has made generating code fast but there has been very little improvement in maintaining code bases. In this post, we will highlight the ways the Kubernetes community is adapting to the world of AI assisted coding.

The first step of this journey was to develop an AI policy. This seems mundane and bureaucratic but there were many PRs that derailed into discussions around AI usage. The AI policy helps steer the conversation around the project's stance on AI and provides a clear signal to contributors on how to use these tools responsibly.

Kubernetes AI policy

The Kubernetes project has established clear guidelines for AI-assisted contributions that balance innovation with accountability. These policies are designed to maintain code quality and ensure human oversight while acknowledging that AI tools can be valuable aids in the development process.

Transparency first

Contributors must disclose when AI tools have been used to assist with a pull request. A simple statement in the PR description such as "This PR was written in part with the assistance of generative AI" is sufficient. This transparency helps reviewers understand the context and apply appropriate scrutiny.

Human accountability

While AI tools can assist, the human contributor remains fully responsible for every change. The policy explicitly prohibits:

This isn't about diminishing AI's role as a tool-it's about maintaining clear accountability. If something breaks, there needs to be a human who understands why and can fix it.

CLA enforcement for co-authors

The CNCF provides a tool for verifying the contributor license agreements on each pull request. AI agents are not able to solve these contributor license agreements so one enforcement the project made is to enable the CLA check for co-authors. This provides a flag to reviewers that the PR is not ready to merge.

Human engagement required

Perhaps the most critical aspect of the policy: reviewers expect to engage with humans, not with AI. Contributors cannot rely on AI to respond to review comments. If you cannot personally explain changes that AI helped generate, your PR will be closed. This requirement ensures that knowledge transfer happens and that contributors genuinely understand the code they're submitting.

Verification obligations

Contributors must verify AI-generated changes through code review, testing, and personal understanding. It's not enough for the code to work-you need to know why it works and be able to maintain it.

These policies reflect a mature approach to AI: embrace it as a tool, but never let it replace human judgment, understanding, or responsibility.

Automated AI reviews

There exist many tools to aid in reviewing code. AI pull request tools introduce governance challenges so one of the first tasks the community took on was to document the process for what is needed to bring in new AI tools. One of the major evaluation criteria for these tools is to find maintainers willing to test drive them in kubernetes-sigs repositories. Kueue, JobSet and Agent-Sandbox have been experimenting with these tools to provide more support for maintainers.

Copilot

One tool that many maintainers started using was GitHub Copilot. The CNCF provides access for maintainers so this ended up being the first tool many started using. It provides some good experience on tuning reviews but there were some growing pains with this tool. The biggest blocker for community adoption is relying on contributors to have a copilot license. Only maintainers were able to request copilot reviews and automated reviews of pull requests was out of reach for the community. One of the goals of AI review tools is to provide an automated review tool that maintainers don't need to request. This demonstrated the need for organization control rather than relying on contributors having access.

CodeRabbit

In mid 2026, the Kubernetes community has rolled out CodeRabbit to a few projects. As with copilot, some tuning has been required to provide better reviews but the overall feedback has been positive. There is a lot of configuration available for this tool and one of the most interesting uses of this tool comes from agent-sandbox.

AI pull request tools can be a quality gate. Contributors can at least get a quick spot check review without waiting for a maintainer. Agent-sandbox has added a label on PRs to reflect that there is still a need to resolve some of the comments from AI tools.

Next steps

The reality is that leveraging AI in open source projects is an area of active exploration. The community could use your help in tuning reviews tools, evaluating tools or evaluating emerging technologies in the AI space.

Some areas we are exploring more:

26 Jun 2026 6:00pm GMT

25 Jun 2026

feedKubernetes Blog

Introducing the Cluster API plugin for Headlamp

Headlamp is an open-source, extensible Kubernetes SIG UI project designed to let you explore, manage, and debug cluster resources directly from a browser.

Cluster API (CAPI) is a Kubernetes sub-project that brings declarative, Kubernetes-style APIs to cluster lifecycle management. It lets platform teams provision, upgrade, and manage the lifecycle of Kubernetes clusters using standard Kubernetes objects stored and reconciled in a management cluster.

Managing Cluster API resources has historically required raw kubectl commands and deep familiarity with ownership hierarchies. The Headlamp Cluster API plugin brings visual clarity, faster debugging, and simplified operations for platform teams, directly inside Headlamp.

What this plugin provides

The Cluster API plugin adds a dedicated Cluster API section to Headlamp and brings full visibility into core CAPI resources through consistent list and detail views.

Feature Description
Cluster overview View clusters with live control plane and worker replica status.
Machine visibility Inspect MachineDeployments, MachineSets, Machines, and MachinePools with status and conditions.
Cluster API dashboard Get a centralized view of Cluster API resource health, active condition issues, provider information, and remediation guidance.
Control plane monitoring Track KubeadmControlPlane replicas, versions, and associated Machines.
Scale from the UI Scale MachineDeployments and MachineSets directly from Headlamp.
Owned resource hierarchy Trace relationships between clusters, deployments, sets, and machines.
KubeadmConfig inspection View bootstrap configs, files, kubelet args, and join/init settings.
Topology awareness Automatically detect and label ClusterClass-managed resources.
Map view Visualize Cluster, Control Plane, and Worker relationships.
Dynamic API versioning Supports both v1beta1 and v1beta2 Cluster API versions.
Prometheus metrics View live metrics from the Headlamp Prometheus plugin inline on Cluster API resource detail pages.

A tour of the plugin

The Headlamp Cluster API plugin brings core Cluster API resources into a consistent, visual interface inside Headlamp. Here are some of the key views included in the first release.

Cluster API dashboard

The dashboard provides a centralized view of Cluster API resources and their health across a management cluster.

Cluster API dashboard showing overall resource health

The overview summarizes the status of clusters, Machines, MachineDeployments, MachinePools, MachineSets, and control planes. It also highlights active condition issues, provider information, and configuration template counts to help operators quickly identify degraded or unhealthy resources.

Cluster details and remediation guidance

Selecting a cluster opens a detailed health view showing control plane and worker status, machine information, infrastructure details, and resource conditions. When issues are detected, the dashboard provides remediation guidance and diagnostic commands to assist with troubleshooting.

Bring full Cluster API visibility into Headlamp

The cluster list view shows all Cluster resources in the management cluster, including control plane and worker replica status. This gives you an at-a-glance understanding of overall cluster health.

Cluster list view showing control plane and worker replica status

The cluster detail view provides resource status, conditions, infrastructure references, control plane references, and related Machines on a single page.

Cluster detail view showing resource status and conditions

Cluster detail view showing related machines

Explore Cluster API resources in a visual interface

Dedicated views are available for MachineDeployments, MachineSets, Machines, and MachinePools. These pages surface replica counts, ownership relationships, provider IDs, versions, and conditions to support day-to-day operations and debugging.

MachineDeployment list view showing replica counts, ownership, and conditions

Scale workloads directly from Headlamp

MachineDeployments and MachineSets include a built-in Scale action, allowing you to adjust replica counts directly from Headlamp without using terminal commands.

For topology-managed clusters, the plugin also indicates when scaling should be performed at the Cluster level.

Scale dialog for a MachineDeployment

Topology-managed cluster showing scaling guidance at the Cluster level

Inspect bootstrap configuration without raw YAML

Bootstrap configurations can be viewed in a structured format, including inline files, kubelet arguments, extra volumes, and join or init settings. This removes the need to inspect raw YAML or secrets manually.

KubeadmConfig detail view showing bootstrap configuration in structured format

Visualize cluster relationships with map view

A visual map view displays the relationships between Cluster, control plane, and worker resources. It offers a faster way to understand ownership hierarchies and overall cluster structure.

Map view showing Cluster, Control Plane, and Worker resource relationships

Prometheus metrics integration

The Cluster API plugin integrates with the Headlamp Prometheus plugin to surface metrics directly inside Cluster API resource detail pages.

When the Prometheus plugin is installed and configured, metrics are embedded inline on the detail pages for Clusters, MachineDeployments, MachineSets, and Machines. You can view resource health and performance data alongside status conditions and ownership relationships, without switching to a separate dashboard.

This makes it easier to correlate infrastructure state with live metrics during debugging or day-to-day cluster operations, all from within Headlamp.

Prometheus metrics embedded inline on a Cluster detail page

How to use

See the plugins/cluster-api/README.md for installation and usage instructions.

Developed during LFX Mentorship

This plugin was developed as part of the CNCF LFX Mentorship program under the Headlamp project. The mentorship provided an opportunity to work closely with the Headlamp community while building features to improve the Cluster API management experience.

The focus was not only on implementing features but also on understanding real-world usability challenges around Cluster API operations. Discussions with mentors and community members helped shape the plugin's direction, improve the user experience, and prioritize features most useful to platform teams.

The mentorship also provided valuable experience contributing to large open-source projects: collaborating with maintainers, participating in design discussions, handling release feedback, and iterating on features based on community input.

Work on the plugin is ongoing, with additional improvements and features planned beyond the initial Alpha release.

Feedback and questions

This is an Alpha release, and community feedback directly shapes what comes next.

25 Jun 2026 10:00pm GMT

Inspect Volcano workloads faster with Headlamp

Volcano is a cloud native batch scheduler for Kubernetes, built for high-performance computing, AI/ML, and other batch workloads.

Headlamp is an extensible Kubernetes web UI. With its plugin system, Headlamp can surface APIs and workflows beyond the built-in Kubernetes resources. The Volcano plugin brings core Volcano resources into Headlamp so you can inspect workload state, queue behavior, and gang scheduling details in one place.

Kubernetes was originally designed around long-running services, where applications are expected to start and remain available over time. Batch, AI/ML, and HPC workloads often behave differently: jobs arrive dynamically, compete for limited resources, and may need multiple workers to start together before useful work can begin.

Volcano extends Kubernetes with concepts such as queues, priorities, quotas, and gang scheduling. Instead of treating every Pod independently, Volcano schedules workloads with awareness of the job as a whole and the resources it needs to make progress.

To make these workloads easier to operate and troubleshoot, the Volcano plugin brings that scheduling context directly into Headlamp.

Watch this short walkthrough to see the Volcano plugin in Headlamp:

Visual context helps teams understand Volcano jobs, queues, and PodGroups faster

Working with Volcano often means moving across several related resources while trying to understand a batch workload. You might start with a Job, then look at the related PodGroup, inspect the Pods behind it, check the Queue, and finally return to the Job again. All of that is possible with CLI tools like kubectl and the Volcano CLI, but it can become fragmented very quickly.

The Volcano plugin for Headlamp makes that workflow easier by bringing the key resources together in a single UI. Instead of reconstructing relationships manually, you can move directly between Jobs, Queues, PodGroups, Pods, and events from the same interface.

Volcano introduces its own resources on top of core Kubernetes objects:

Job
Describes a batch workload as a set of tasks and the Pods they create.
Queue
Divides cluster capacity between teams or workloads using quotas and priorities.
PodGroup
Ties a group of Pods together so the scheduler can treat them as a single unit for gang scheduling.

The plugin surfaces all three resource types directly in Headlamp, providing dedicated list and detail views for each of them under a Volcano section in the sidebar.

Jobs: workload status, actions, and logs

The Job view is the center of the plugin experience. In the list view, you can quickly understand the basics of a workload, including its status, queue, running versus minimum-available values, task count, and age.

Volcano Jobs list in Headlamp

The detail view goes further by surfacing the information you usually need while debugging a Job: task details, Pod status, related Queue and PodGroup links, conditions, events, and more. Instead of forcing you to jump between several CLI commands, the plugin keeps that context together in a single page.

The Job page also adds supported lifecycle actions for appropriate states, including Suspend and Resume, so you can act on a Job directly from the UI.

Another useful addition is direct Job logs access. You can open logs for Pods created by a Volcano Job without leaving the Job detail page. The logs viewer supports both single-Pod and all-Pods views, along with container selection and common log controls such as line count, previous logs, timestamps, and follow.

Volcano Job logs in Headlamp

Queues: scheduling capacity and resource context

The Queue view provides much more than a small set of top-level fields. It helps you understand how resources are being allocated and constrained by surfacing capacity, allocated resources, deserved and guaranteed resources, reservation details, child queues, and more.

This makes the Queue page much more useful when trying to understand how resources are being shared and limited across queues.

Volcano Queue details in Headlamp

PodGroups: gang scheduling state and blockers

PodGroups are central to understanding gang scheduling in Volcano, and the plugin makes that state easier to inspect. The PodGroup view highlights progress, conditions, minimum resource requirements, and more.

This also gives you a clearer picture of whether a workload is blocked because it has not yet met the scheduling conditions required to run as a group.

Volcano PodGroup details in Headlamp

Map view: jobs, queues, PodGroups, and pods in one place

The map view shows how Volcano resources are connected. Instead of inspecting each resource separately, you can see how Jobs, PodGroups, Queues, and Pods relate to one another.

This is especially useful when a workload is pending or not progressing as expected. The map can show the Job, its related PodGroup, the Pods created for the workload, and the Queue context around it. Warning and error states also make it easier to spot resources that need attention.

Volcano resources in the Headlamp map view

Why use this alongside CLI tools

The plugin is not trying to replace kubectl or the Volcano CLI. Those remain important for automation, scripting, and raw object inspection. What the plugin improves is the interactive troubleshooting experience: discovering related resources more quickly, understanding structured detail pages, and moving from scheduling state to runtime output without switching tools constantly.

What's next

This work brings the main Volcano workflow into Headlamp, including Jobs, Queues, PodGroups, and the map view. Possible future work includes Prometheus integration, richer scheduling insights, and more workflow-oriented visibility across Volcano workloads.

Try it and share feedback

To try the plugin:

  1. Install Headlamp.
  2. Open the Plugin Catalog from the Headlamp UI.
  3. Search for Volcano.
  4. Install the Volcano plugin.
  5. Connect Headlamp to a Kubernetes cluster where Volcano is already installed.
Volcano plugin in the Headlamp Plugin Catalog

If you have ideas, feature requests, or bug reports, open an issue in the Headlamp plugins repository. Feedback from real Volcano users will help shape what comes next.

25 Jun 2026 8:00pm GMT

See your serverless: introducing the Headlamp plugin for Knative

Headlamp is an open-source, extensible Kubernetes SIG UI project designed to let you explore, manage, and debug cluster resources.

Knative brings serverless workloads to Kubernetes, handling traffic routing, autoscaling, and revision management so teams can deploy and iterate without fighting infrastructure. But operating Knative workloads day-to-day can be difficult, there's still a lot of jumping between the kn CLI, kubectl, and the Kubernetes UI to get a full picture of what's running.

We built the Headlamp Knative plugin to bridge that very gap, allowing operators to inspect, understand and act on their workloads all from a single place. This plugin was built as part of the LFX mentorship. Here's a tour of what we shipped.

Here is a short walkthrough of the Knative plugin for Headlamp:

Integrating Knative resources with Headlamp's map view

Headlamp's resource mapping works for Knative CRDs too. You can see how KServices, Revisions, and DomainMappings relate to each other in a single graph view.

Knative resources in Headlamp Map View

KService management: edit traffic splits, restart pods, and view logs

A KService is the top-level resource in Knative: it manages the lifecycle of Routes, Configurations, Revisions, and everything needed to run and expose your application.

The plugin gives KServices a full detail view with an Edit Mode toggle for making live changes to traffic splits, autoscaling annotations, and more. Common actions like viewing the YAML, opening logs, triggering a redeploy, or restarting backing pods are surfaced in the header, gated by your current RBAC permissions.

Knative Service Detail View

Traffic splitting: route across revisions for gradual rollouts and testing

Knative makes it possible to route traffic across multiple Revisions of the same service. This is useful for canary releases, gradual rollouts, tagged preview URLs, and A/B testing.

The plugin shows the traffic assigned to each Revision, the latest ready Revision, readiness status, age, and configured tags. In edit mode, you can adjust percentages and tags inline. The plugin validates that traffic sums to 100% and that tags are unique before saving. Tagged routes with a reported URL render as clickable links.

Traffic Splitting between Revisions

Autoscaling configuration: view effective settings and cluster defaults

Knative's autoscaler supports a range of settings: concurrency targets, target utilization, RPS targets, min/max scale, initial scale, stable window, scale-down delay, and more. The effective value for any workload is a combination of KService-level annotations and cluster-wide ConfigMaps.

The plugin reads config-autoscaler and config-defaults and shows the effective configuration per KService in context, so you can see at a glance whether a setting is explicitly configured or falling back to the cluster default.

Autoscaling and Concurrency View

Prometheus metrics: monitor request rates, latency, and resource utilization

When paired with the Prometheus plugin for Headlamp, the plugin renders request rate, latency, and resource utilization graphs on KService and Revision detail pages. The per-revision request rate breakdown is particularly useful when validating a traffic split in progress.

Knative metrics filtered by revision

Dashboard for other CRDs

The plugin also includes list and detail views for Revisions, DomainMappings, ClusterDomainClaims, and a cluster-level Networking overview (reading config-network and config-gateway to surface the effective ingress class, gateway settings, and backing services). These give operators a complete picture of Knative's state without leaving Headlamp.

Knative Revision List View Knative Domain Mapping List View Knative Cluster Domain Claim List View

How to install the Knative plugin in Headlamp

  1. Make sure Knative is installed in your cluster.
  2. In Headlamp Desktop, open the Plugin Catalog, search for Knative, and click Install.
  3. Reload Headlamp, a new Knative entry will appear in the sidebar.

For development or source-level setup, see the Knative plugin README. The current release is 0.3.0-beta.

Share your feedback

We'd love feedback from Knative operators and users. If you hit a bug or want support for a workflow we haven't covered, please open an issue. You can also find us in the Kubernetes Slack #headlamp channel.

25 Jun 2026 6:00pm GMT

24 Jun 2026

feedKubernetes Blog

Spotlight on WG Device Management

The rising popularity of AI, Edge, and Telecommunications workloads on Kubernetes has led to new requirements for hardware management. We now need hardware specification beyond CPU time and memory allocations. This includes allocating GPUs, TPUs, network interfaces, and other hardware, sometimes after pod start and occasionally through time-sharing.

Efficiently managing this specialized hardware is the mission of the Device Management Working Group. Their cornerstone project, Dynamic Resource Allocation (DRA), recently graduated to GA, marking a fundamental shift in how the project handles hardware-intensive workloads at scale.

In this spotlight, we sit down with working group chairs Kevin Klues, Patrick Ohly, and John Belamaric to discuss the limitations of the legacy device model, the NP-hard challenges of scheduling, and how they're building a more programmable, hardware-aware future for Kubernetes.

Introducing Device Management

Natalie Fisher: Can you introduce yourself, your role, and how you got involved in the Device Management Working Group?

Kevin Klues: My name is Kevin Klues. I am a Distinguished Engineer at NVIDIA. I have been a co-chair of the device management working group since its inception at Kubecon EU 2024. I have also been involved with DRA (the working group's primary deliverable) since its inception in 2019 / 2020. I have also been a kubelet maintainer since 2019, with a focus on its device manager, CPU manager, and topology manager subcomponents. The challenges we saw with using these components for workloads that relied on external accelerators (e.g., GPUs) are what triggered us to start working on DRA in the first place.

Patrick Ohly: I am a Principal Engineer at Intel. In Kubernetes, I am a Tech Lead for SIG Testing and SIG Instrumentation and co-chair of the Device Management WG. I was co-chair of the WG Structured Logging and a member of the Steering Committee. Some of my early contributions to Kubernetes include ephemeral CSI volumes and storage capacity tracking, so I had some experience with API design, implementation, and scheduling. We knew that introducing a major new API for accelerators would be hard. Somewhat foolishly, I accepted that challenge in 2020, wrote the initial DRA KEP (now known as "classic DRA") and implemented most of it, then started over with a second KEP for today's "structured parameters DRA". Initially, it was an uphill battle to convince maintainers that this work was necessary. It was only around 2023 that interest in DRA picked up, leading to the formation of the working group.

John Belamaric: I am a Senior Staff SWE at Google, and the third co-chair of WG Device Management, also since its inception. I am also a co-chair of SIG Architecture since 2019. As Patrick mentioned, in late 2023, interest in DRA really picked up. The initial implementation, made autoscaling very challenging, and so there was some concern in the community about advancing it to beta. I got involved to try to help address some of those concerns, and the three of us, along with Tim Hockin, worked hard over the next few months to build a consensus around a new design. To facilitate this collaboration, we formed the working group after discussion at KubeCon in Paris in 2024.

The problem and the solution

The working group emerged from a fundamental rethink of how Kubernetes interacts with specialized hardware. At the heart of this evolution is Dynamic Resource Allocation (DRA). Rather than treating devices as simple integers, DRA provides a structured framework that breaks device management into four distinct stages:

NF: For readers who may not be familiar, what is the Device Management Working Group, and what problems is it trying to solve?

KK: The Device Management Working Group was chartered to enable simple and efficient configuration, sharing, and allocation of accelerators and other specialized hardware across Kubernetes workloads. Think GPUs, TPUs, FPGAs, and similar devices that don't fit neatly into Kubernetes' traditional resource model.

The problem we set out to solve is that the legacy Device Plugin API (which has been the primary mechanism for exposing hardware accelerators in Kubernetes) is fundamentally limited. It treats devices as opaque integers: you can request "2 GPUs," but you can't say anything meaningful about which GPUs you need, how they should be connected to each other, whether they can be shared, or how they should be partitioned. That was fine for simple cases, but modern AI/ML workloads are anything but simple. They span multiple nodes, require specific interconnect topologies, and increasingly need to share or partition hardware dynamically.

The working group's primary deliverable is Dynamic Resource Allocation (DRA), a new framework that replaces the rigid device plugin model with a flexible, declarative API. With DRA, workloads can describe their hardware requirements (e.g., GPU type, memory capacity, interconnect topology, desired partitioning) and drivers can publish fine-grained device attributes that the scheduler can act on. DRA graduated to GA in Kubernetes 1.34, and the ecosystem around it (e.g., drivers, tooling, and new API extensions) is growing rapidly.

PO: As Kevin said, the working group was formed around the existing effort to develop DRA. The initial work was done with only a handful of people actively involved, and perhaps also could only be done successfully in such a setup. But because it touches on so many different areas of Kubernetes, we also needed a place to discuss that and get the broader community of Kubernetes maintainers, device vendors, and, to a lesser extent, also end-users involved. The working group provides that place, with regular meetings online (one slot for Americas/EMEA, one for EMEA/Asia) and at KubeCon.

JB: DRA is the first problem the WG has addressed. It is focused on selection, allocation, and configuration of the devices. We broke the problem down into four parts: how does the vendor model the device and advertise capacity, how does the user request it, how do we schedule that request on top of the advertised capacity, and how do we actuate that result (that is, how do we make the device ready and available to the Pod).

One thing that is fundamental to the approach we took is an awareness of the incredible diversity of hardware and the rapid rate of change in the hardware industry. We knew that we couldn't keep up with the change if the Kubernetes APIs had to change for every type of hardware. Instead, we created a general approach where we address the hardware aspects that are important to Kubernetes. What we have done so far is focus on the scheduling and configuration aspects of devices. We build a device modeling API (the ResourceSlice API) that vendors use to model the scheduling characteristics of their devices, and allow users to pass through arbitrary configurations to those devices. By doing this, Kubernetes can be "programmed" to understand these aspects of the devices, without needing to be modified.

But DRA, as it stands right now, is very focused on scheduling. There are other aspects of Device Management that are in scope for the WG. In particular, we are looking into device failure detection and mitigation, and whether there is some better support we can build into Kubernetes to help.

Also, as Kevin alluded to, devices are often allocated and used in groups, rather than individually. Choosing the right devices to work together in a group depends on how they are interconnected; for example, NVIDIA GPUs may be in an any-to-any fabric arrangement in an NVLINK domain, whereas TPUs may have a 3D torus interconnect. This affects the "selection, allocation and configuration" of devices, and we have a lot more work to do to address these use cases.

A cross-SIG effort

Because device management touches scheduling, node operations, autoscaling, networking, and API design, the work naturally spans multiple SIGs across the Kubernetes project.

NF: How does collaboration across these SIGs work in practice, and why is it necessary?

KK: Device management touches nearly every layer of the Kubernetes stack, which is why the working group was chartered as a cross-SIG effort from the start. We have five stakeholder SIGs: sig-node, sig-scheduling, sig-autoscaling, sig-network, and sig-architecture.

In practice, the working group serves as a coordination layer. We don't own code directly; instead, our deliverables take the form of KEPs and implementations that live in the respective SIGs. What we provide is a unified forum where the people building the scheduler, the kubelet, the autoscaler, and the network plane can design together rather than in isolation.

Why is this necessary? Consider a simple example: a user requests a set of GPUs that need to communicate via NVLink. That requirement involves the scheduler (place the pods on the right nodes), the kubelet (configure the devices and expose them to the container), and potentially autoscaling (provision the right node type if none exists).

If those three groups design independently, you end up with inconsistent abstractions, duplicated logic, and integration bugs that only surface in production. The working group ensures that a single coherent API and data model flows through all of these components.

The cross-SIG model also means that design decisions are reviewed from multiple angles. Someone from sig-scheduling will catch scheduler complexity that a sig-node contributor might overlook, and vice versa. It slows down individual decisions slightly, but produces much more robust outcomes.

Current focus areas

With DRA now generally available, the working group's focus has expanded to enable more advanced scheduling models, shared semantics, operational visibility, and support for increasingly complex hardware topologies.

NF: What are some of the key initiatives or deliverables the working group is currently focused on?

KK: We maintain a project board at Kubernetes Project Board with real-time tracking of our initiatives and their progress.

PO: The scope and feature set of core DRA were intentionally limited to enable graduation to GA within a reasonable time. Additional KEPs add more features, on their own schedule. Those fall roughly into three categories:

  1. Extend the expressiveness of DRA to support more complex devices and scheduling scenarios.
  2. Support day two operations like health monitoring.
  3. Improve multi-node support, primarily by integrating with workload-aware scheduling.

In addition to the project board, we also maintain a table which summarizes all the KEPs which are currently in flight. This is the status for 1.36; more are likely to be added for 1.37:

KEP Description Release
1.32 1.33 1.34 1.35 1.36
4381 DRA: Structured Parameters Beta Beta Stable
5004 DRA: Extended Resource Requests via DRA Alpha Alpha Beta
4817 DRA: Resource Claim Status Alpha Beta Beta Beta Beta
5018 DRA: Namespace Controlled Admin Access Alpha Beta Beta Stable
5055 DRA: Device Taints and Tolerations Alpha Alpha Alpha Beta
4816 DRA: Prioritized Alternatives in Device Requests Alpha Beta Beta Stable
5075 DRA: Consumable Capacity Alpha Alpha Beta
4815 DRA: Partitionable Devices Alpha Alpha Alpha Beta
5304 DRA: Attributes Downward API Alpha
5729 DRA: ResourceClaim Support for Workloads Alpha
4680 Resource Health Status in Pod Status Alpha Alpha Alpha Alpha Beta
5517 DRA: Native Resource Requests Alpha
5677 DRA: Resource Availability Visibility Alpha
5007 DRA: Device Binding Conditions Alpha Alpha Beta
5491 DRA: List Types for Attributes Alpha

NF: One of the core challenges is efficient device utilization and sharing. What progress is being made in this area?

JB: Good question. One way to think about it is what we are doing in the two primary APIs: ResourceClaim and ResourceSlice.

The ResourceClaim API is how the user asks for devices. We have built some features that allow the user to be more flexible in their requests. For example, instead of asking for a specific model of GPU, they can ask for a GPU with at least a certain amount of memory. Or they can ask for a list of alternatives: "I'd like one A100 (80GB) GPU, but if you don't have it, I'll take 2 A100 (40 GB) GPUs." This gives the scheduler some options to satisfy the request, which can lead to better obtainability and utilization of hardware that otherwise would not be selected.

The ResourceClaim API allows users to explicitly share devices. You can point multiple containers (in the same or different Pods) at a ResourceClaim; this allows the devices allocated by that claim to be used in all of those containers, if the device supports it.

The ResourceSlice API is how vendors model and advertise their devices. This is where we implement support for other sharing models. For example, we have a way to represent "overlapping partitions", enabling the scheduler to dynamically select a MIG partition, and make any overlapping MIG partitions unavailable automatically. This works well in combination with a request like "give me any GPU with 20GB or more of memory" - the scheduler can satisfy that with a MIG or a real GPU.

Some features require changes in both. We have another sharing method we call "consumable capacity". In the explicit sharing case described above, a user needs to point containers at the same ResourceClaim; there is one ResourceClaim shared amongst several containers and Pods. With consumable capacity, the device sharing works more like how Pods share a Node. The user creates a ResourceClaim that asks for a certain amount of resources, for example, "I need a NIC with 2Gbps of bandwidth". The scheduler knows that there is a NIC with 40Gbps of bandwidth available, and so it allocates 2Gbps out of that 40Gbps and gives it to that ResourceClaim. In this case, each Pod has its own ResourceClaim, but the underlying device is shared between those claims. It's up to the on-node DRA driver to properly set up the device for this sort of sharing (in the NIC case, likely by creating a subinterface). We call this "platform-mediated sharing" to differentiate it from the explicit "user-mediated sharing".

Real-world impact

While much of the work is deeply technical, the underlying goal is practical: enabling Kubernetes to better support real-world AI/ML and hardware-intensive workloads at scale.

NF: What are the biggest challenges users face today when running hardware-intensive workloads (like AI/ML) on Kubernetes?

PO: Such workloads depart from traditional container workloads in several ways: they may consist of multiple communicating pods which all need to run at the same time ("gang scheduling"). They are often long-running and expensive to initialize, and their performance is sensitive to where they run (topology within a node and interconnects between nodes for multiple pods). The Kubernetes scheduler traditionally has not supported either of this well because it schedules one pod at a time and is unaware of the topology within a node. Several external schedulers try to fill this gap, which often isn't ideal, in particular when the Kubernetes scheduler schedules other pods to the same cluster.

NF: How should platform engineers think about device management when designing their Kubernetes platforms?

JB: We're still learning here, but one idea of DRA is to enable a shift to more "requirements driven" specifications. This can allow less coupling between end users that write the workload specification and the cluster administrators that set up the clusters. Instead of agreeing on labeling conventions and requiring users to understand the cluster topology, the users can specify what their workload needs, and the scheduler can figure out how to satisfy it. If we can make this work, it can make even complex workloads more portable across clusters.

Challenges and trade-offs

As with many areas of Kubernetes, increasing flexibility and expressiveness also introduces new layers of complexity, particularly around scheduling and optimization.

NF: What are some of the hardest technical challenges the working group is tackling today?

PO: There's an inherent conflict between flexibility and scheduling complexity. The current implementation is focused on finding some solution that satisfies the requested resources, but it's not necessarily the best one, whatever "best" means, which is also not always clear. The other big challenge is exposing node-allocatable resources (RAM, CPU) as devices with additional metadata; this is necessary to fine-tune scheduling of workloads which need perfect alignment on a node for optimal performance.

JB: Patrick's list is good. Complex device modeling is hard, and making sure that we build the right semantics such that they apply to lots of different hardware is always tricky.

On top of that, scheduling in general is very complex and is an NP-hard problem. All the metadata and flexibility DRA adds gives the scheduler more options, which has pros and cons. More options are helpful if you are constrained in your choices, as it means you can schedule something that you otherwise could not. But it also means it is even harder to find an optimal solution when there are many possibilities in a given cluster. DRA works well in our common use cases so far, but we have a lot of work to do to improve the optimality of the chosen scheduling solution and ensure the performance of making that choice.

Looking ahead

Despite the challenges, contributors across the working group remain excited about the pace of innovation and the growing community forming around device management in Kubernetes.

NF: Looking ahead, what are you most excited about in the future of device management in Kubernetes?

KK: NVIDIA recently donated its DRA driver for GPUs to the Kubernetes project. I'm personally excited for more community members to start contributing to the project and defining its future direction.

PO: For me, it's primarily the number of new contributors and people stepping up to help out. This poses new challenges around reviewing proposals and helping developers get those implemented and merged. It's nice and rewarding to see others succeed, and it bodes well for the future because more people are familiar with the topic.

JB: I am excited about a lot of things. The community really has grown and has so many interesting features in the works to enable modeling of more complex devices, and to better model multi-node devices.

I am really excited to see the creative ways people will use these APIs. They were primarily designed to address "devices", but just like how "everything is a file" in Unix/Linux, the APIs themselves are quite flexible as to what they model. They really build out a more programmable scheduler, which can have interesting applications. For example, I recently prototyped using DRA to schedule pods to nodes where a large AI model is already locally cached. It's really quite flexible, and I have great confidence in the creativity of our community, so I think we'll see some unexpected solutions in the ecosystem.

Getting involved

NF: How can contributors get involved with the Device Management Working Group?

KK: The easiest first step is to join our mailing list at wg-device-management@kubernetes.io. Subscribing will automatically add calendar invites for our biweekly meetings to your calendar.

We have two meeting slots to accommodate different time zones:

Meeting notes, agendas, and recordings are all publicly accessible (links available from Device Management page). You can get a feel for the work in progress before attending your first meeting.

On Slack, find us in #wg-device-management on the Kubernetes Slack workspace. That's the best place for quick questions or to introduce yourself.

For more hands-on contributions, the DRA Driver for NVIDIA GPUs is now a community project and a great place to start. It's a real-world, production-grade implementation that the broader community is now shaping together.

We welcome contributors at all levels - whether you're interested in the API design, the scheduler internals, driver development, or documentation. Come say hello.

Summary

As Kubernetes evolves to support the AI/ML revolution and high-performance computing, the work happening within WG Device Management is becoming the foundation for how modern workloads are scheduled and operated at scale.

From the graduation of Dynamic Resource Allocation (DRA) to the next frontiers of health monitoring and topology-aware scheduling, this group is effectively rewriting the "handshake" between software and hardware.

If you're interested in shaping the future of hardware-aware orchestration, now is the perfect time to get involved. Whether you want to help refine the API, build out drivers, or improve documentation, the working group welcomes all levels of experience and perspectives from across the community.

24 Jun 2026 6:00pm GMT

15 Jun 2026

feedKubernetes Blog

Spotlight on SIG Storage

In our ongoing SIG Spotlight series, we shine a light on the groups that keep the Kubernetes project moving forward. This time, we catch up with SIG Storage, the group responsible for persistent data, volume management, and the interfaces that connect Kubernetes workloads to the storage systems beneath them.

We spoke with Xing Yang, Co-Chair of SIG Storage and Software Engineer at VMware by Broadcom, about the SIG's history, the features shipping in recent Kubernetes releases, and where storage in Kubernetes is headed as AI workloads become the norm.

Introductions

Could you introduce yourself and share your role(s) within SIG Storage?

My name is Xing Yang, a software engineer at VMware by Broadcom. I'm a co-chair in SIG Storage, alongside another co-chair Saad Ali from Google. There are also two Tech Leads in SIG Storage: Michelle Au from Google and Jan Šafránek from Red Hat.

What first drew you to storage in Kubernetes, and how did you start contributing?

I have always been working in the storage domain, so SIG Storage was a natural place for me to get started when I began to learn Kubernetes. I started attending SIG Storage meetings, trying to figure out what I could do to help. This was before the first Container Storage Interface (CSI) release - lots of things were still evolving. It was a very exciting time.

What subprojects or areas do you actively maintain or review today?

I'm a maintainer in Kubernetes CSI. There are multiple CSI sidecars - such as csi-provisioner, csi-attacher, csi-resizer, and csi-snapshotter - that we need to release following every Kubernetes release. I'm also a co-chair for a Data Protection Working Group co-sponsored by SIG Storage and SIG Apps. Several features have come out of that WG aimed at filling gaps in data protection support within Kubernetes. One is Volume Group Snapshot, which provides crash-consistent group snapshots for multiple volumes used by an application. Changed Block Tracking (CBT) is another critical feature from the DP WG designed to support efficient backups.

About SIG Storage

For folks who are new: what is SIG Storage, in your own words? What problems in Kubernetes are you trying to solve?

SIG Storage is a Special Interest Group focused on how to provide storage to containers running in your Kubernetes cluster. We define standard interfaces so that a storage vendor can write a driver and have its underlying storage system consumed by containers in Kubernetes.

Why does Kubernetes need a dedicated storage SIG? What makes storage hard in a distributed system?

When Kubernetes was first introduced, it was meant for stateless workloads only. Container applications were regarded as ephemeral and therefore did not need to persist data. However, that changed drastically. Stateful workloads started running in Kubernetes, and we needed a dedicated SIG to tackle the associated storage challenges. PersistentVolumeClaims, PersistentVolumes, and StorageClasses were all introduced to provision data volumes for applications running in Kubernetes.

How did SIG Storage originally form, and how has its mission changed over time?

SIG Storage was formed to address the challenges of handling persistent data within Kubernetes. Initially, PersistentVolumes were implemented as in-tree plugins, and the SIG managed those plugins while developing core storage primitives like PersistentVolumes and PersistentVolumeClaims.

Container Storage Interface (CSI) was introduced later and played a crucial role in simplifying storage integration, enabling third-party storage providers to develop and maintain their own out-of-tree plugins without modifying Kubernetes core code.

With basic integration addressed by CSI, the SIG's mission expanded to include advanced storage features that leverage the new interface. The SIG has also expanded its scope to support object storage through the Container Object Storage Interface (COSI).

Current work and roadmap

What are the top features SIG Storage is actively working on right now?

The Data Protection WG has been working on a couple of exciting features:

Another feature worth highlighting is Container Object Storage Interface (COSI). COSI provides a standard interface for provisioning and consuming object storage buckets in Kubernetes - standardizing object storage for containerized applications much like CSI did for block and file storage. COSI is now transitioning to v1alpha2, with plans for promotion to Beta in a future release.

What recent work from SIG Storage do you consider a "win" for users?

The graduation of VolumeAttributesClass to GA in Kubernetes v1.34 is a major win for users managing stateful workloads. Previously, changing volume attributes like IOPS or throughput required out-of-band actions or disruptive operations. Now, users can dynamically tune storage properties such as IOPS or throughput directly through the Kubernetes API - scaling up for peak loads or down to optimize costs - without external processes or downtime.

VolumeAttributesClass enables dynamic modification of storage characteristics without recreating the volume. This completes the picture by allowing users to tune both capacity and other storage properties dynamically, just as they can now tune both CPU and memory for compute.

Looking ahead one or two releases, what's on the roadmap that people should watch for?

I'd like to draw attention to the Volume Health feature. This feature is designed to offer critical visibility into the operational status and integrity of persistent volumes. By enabling storage drivers and the Kubernetes control plane to report issues, it allows for proactive monitoring and identification of volume-related problems.

Currently, volume health information is reported via non-persistent events. We are actively investigating enhancements to this feature with the goal of supporting automated remediation capabilities in the future.

Are there areas where you'd really like more discussion or help from the community?

We always need help from the community to fix bugs, add tests, and help with reviews.

We'd also like to get feedback on the Alpha feature Mutable PV Affinity, which was introduced in Kubernetes v1.35. Use cases include migrating volumes from zonal to regional storage or migrating from one disk type to another.

Another topic is volume replication. It was raised at KubeCon Atlanta and has been discussed in the Data Protection WG. Community members interested in this topic are encouraged to join the DP WG meetings.

What are the biggest challenges users face today when running stateful workloads on Kubernetes?

While Kubernetes has moved stateful workloads - like databases and AI pipelines - into the mainstream, managing "state" in a system designed for ephemerality remains difficult:

Storage and AI

How do you see storage evolving in Kubernetes over the next few years, especially as AI/ML workloads grow?

I see several trends shaping storage in Kubernetes as it evolves from a container orchestrator into the "Operating System" for AI:


SIG Storage continues to tackle some of the hardest problems in Kubernetes: keeping stateful applications running reliably, making storage operations transparent and composable, and now scaling up to meet the demands of AI-era workloads. Whether you're a user managing databases in production or a developer curious about storage internals, there's a place for you in SIG Storage.

If you'd like to get involved, check out the SIG Storage community page and join the bi-weekly meetings. You can also find the SIG on Slack at #sig-storage.

15 Jun 2026 12:00am GMT

01 Jun 2026

feedKubernetes Blog

From Kubernetes Dashboard to Headlamp: Understanding the Transition

For many people, Kubernetes Dashboard was their first window into Kubernetes. It offered a simple visual way to see what was running in a cluster, inspect resources, and build confidence without relying on the command line. For years, it helped developers, students, and operators make sense of Kubernetes, and it served as an important onramp into the ecosystem.

The Kubernetes Dashboard project has now been archived. We deeply respect the work the team did and the role Dashboard played in making Kubernetes more approachable for so many users.

Headlamp builds on that foundation and carries it forward. It keeps the clarity of a visual interface while adding capabilities that match how Kubernetes is used today. This includes multi-cluster visibility, application-centric views, extensibility through plugins, and flexible deployment options that work both in-cluster and on the desktop.

This guide is meant to help you navigate that transition with confidence. Before diving into the mechanics of migration, we start with familiar ground by looking at how common Kubernetes Dashboard workflows map to Headlamp. We also cover what stays the same and what improves after the switch. The goal is not just to replace a tool, but to honor a user-centered legacy and help you land in a UI that can grow with you as your Kubernetes usage evolves.

Mapping Kubernetes Dashboard workloads to Headlamp

If you have used Kubernetes Dashboard before, many workflows in Headlamp will feel familiar. Headlamp does not introduce a new way of thinking. Instead, it builds on workloads users already know and extends them in practical ways. The focus is continuity. What worked before still works, with more room to grow.

Viewing workloads and resources

In Kubernetes Dashboard, most users started by browsing workloads like pods, deployments, services, and namespaces. Headlamp keeps this same starting point. Workloads are easy to find and inspect, and moving between namespaces and clusters is simpler. Resources are still organized in familiar ways, and navigation feels smoother, especially when you work across multiple environments.

Viewing Kubernetes workloads and resources in the Headlamp interface

Editing and interacting with resources

Like Kubernetes Dashboard, Headlamp lets you view and edit manifests directly in the UI based on your permissions. You can delete resources, scale workloads, or update configurations from the interface. All actions follow standard Kubernetes RBAC. If you could perform an action in Dashboard, you will find the same capability in Headlamp, with the same respect for access controls.

Editing and interacting with Kubernetes resources in the Headlamp user interface

Understanding relationships

Where Headlamp begins to expand the experience is in how it presents relationships between resources. In addition to list views, Headlamp offers visual ways to see how workloads, services, and configurations connect. This helps provide context without changing the underlying workloads users already rely on.

Visualizing relationships between Kubernetes workloads and services in Headlamp

At a high level, the tasks you performed in Kubernetes Dashboard are still there. Headlamp keeps familiar workflows while making it easier to scale as clusters, teams, and applications grow.

Where Headlamp goes beyond Kubernetes Dashboard

Expanding from single cluster to multi-cluster workflows

Kubernetes Dashboard was designed to work with one cluster at a time. That model worked well for simple setups, but it became limiting as teams adopted multiple environments. Headlamp expands this view by letting you work with multiple clusters from a single interface without switching tools or losing context. This makes it easier to manage development, staging, and production environments side by side.

Expanding from single cluster to multi-cluster workflows using Headlamp

For teams running Kubernetes in more than one place, this shift reduces friction. You can stay oriented and move between clusters with confidence.

From resource lists to application context with Projects

Projects give you an application-centered way to view Kubernetes. Instead of jumping between lists, you can group related workloads, services, and supporting resources in one place. This makes applications easier to understand. You can see what belongs together, track changes in context, and troubleshoot without scanning the cluster piece by piece.

Projects are built on native Kubernetes concepts. Namespaces, labels, and RBAC continue to work the same way they always have. Headlamp adds a visual layer that brings related resources together.

Projects are optional. You can still work at the individual resource level when that fits your task. When you need more context, Projects help you step back and see the bigger picture.

Application Projects view in Headlamp grouping related Kubernetes resources

Extend the Headlamp UI with plugins

Headlamp can be extended through plugins that bring common workflows directly into the UI. Instead of switching tools, you work in one place with the same context.

Adding plugins from the plugin catalog in the Headlamp interface

For example, the Flux plugin brings GitOps workflows into Headlamp. It allows teams to view application state alongside the Kubernetes resources that Flux manages, making it easier to understand how changes in Git relate to what is running in the cluster.

Viewing and managing GitOps resources in Headlamp using the Flux plugin

The AI Assistant follows a similar pattern. It adds a conversational layer to the UI that helps users understand what they are seeing, troubleshoot issues, or take action. All of this happens in the same screen where the problem appears.

Using the AI assistant in Headlamp to understand and troubleshoot Kubernetes resources

Building your own plugins

Plugins are optional and not limited to community-built extensions. Platform and project teams can also create their own plugins. This allows organizations to add custom integrations that match their specific workflows and internal tooling, while keeping the user experience consistent.

Choosing how and where Headlamp runs

Headlamp gives teams flexibility in how they use a Kubernetes UI. You can run it directly in a cluster, use it as a desktop application, or combine both approaches based on your needs.

Running Headlamp in-cluster works well for shared environments. It provides a centrally managed UI with controlled access and fits naturally into Kubernetes setups, following the same authentication and RBAC rules as other in-cluster components.

Running Headlamp as an in-cluster browser-based application

The desktop application is often a better fit for local development and onboarding. It also works well when you need to manage multiple clusters from one place. Users can connect using their existing kubeconfig without deploying anything into the cluster.

Using Headlamp as a desktop application to manage Kubernetes clusters locally

These options are not mutually exclusive. Many teams use the desktop app for day-to-day work, while relying on an in-cluster deployment for shared or production environments.

Preparing for the Migration

Before moving from Kubernetes Dashboard to Headlamp, it can be helpful to pause and take stock of how you use the Dashboard today. A little reflection up front can go a long way toward making the transition feel smooth and familiar.

Start by noting which clusters and namespaces you access and how authentication works. Headlamp relies on standard Kubernetes authentication and RBAC. In most cases, existing access models carry over without change. If users already connect using kubeconfig files or service accounts, they will be able to access the same resources in Headlamp.

It is also useful to think about the workflows that matter most to your team. Some users rely on Dashboard for quick inspection or troubleshooting, while others use it for lightweight edits or validation. Headlamp supports these same workflows and adds optional capabilities on top. Knowing what you rely on today helps the transition feel predictable and confidence building.

If you would like to explore Headlamp or try it out before migrating, you can learn more at headlamp.dev.

This blog focused on understanding the transition and what to expect. A step by step migration guide is coming soon and will walk through installation and migration in detail.

01 Jun 2026 6:00pm GMT

26 May 2026

feedKubernetes Blog

Reconciling the Past: Correcting Records for Unfixed Kubernetes CVEs

The Kubernetes project relies on transparency to empower cluster administrators and security researchers. One important way we do that is by publishing CVE records into the Common Vulnerabilities and Exposures database. As part of our ongoing effort to mature the official Kubernetes CVE Feed, we have identified some discrepancies. CVE records for a few older, unfixed issues incorrectly include a fixed version field.

The Kubernetes Security Response Committee (SRC) will correct the affected CVE records on June 1, 2026. This may result in vulnerability scanners identifying these vulnerabilities in places where they were previously not detected.

To help reduce confusion, this post provides a technical update on three vulnerabilities that were disclosed in previous years but remain unfixed: CVE-2020-8561, CVE-2020-8562, and CVE-2021-25740.

Why we are updating these records now

While these vulnerabilities have been public for several years, the recent work to generate official Open Source Vulnerabilities (OSV) files revealed that their corresponding CVE records did not accurately reflect their status. Specifically, some records suggested a fixed version existed, when in reality, these issues are architectural design trade-offs that cannot be fully remediated through code without breaking fundamental Kubernetes functionality.

Correcting these records is vital for the community for:

For completeness, we should also mention that CVE-2020-8554 is an unfixed CVE with a correct CVE record stating that it affects all versions. That record will also be updated to use a more-standardized version number format.

Technical analysis of unfixed architectural risks

The following vulnerabilities will not be fixed by the Kubernetes project. GitHub issues remain the best reference for the technical mechanics of these flaws.

CVE-2020-8561: Webhook redirect in kube-apiserver

CVE-2020-8562: Proxy bypass via DNS TOCTOU

CVE-2021-25740: Cross-namespace forwarding via Endpoints

Note:

On June 1, 2026, these CVE records will be updated to correctly reflect the fact that all versions are affected. You may see them begin to appear in vulnerability scanner results.

Required actions for administrators

The Kubernetes project recommends a secure by configuration approach to manage these persistent risks:

Vulnerability Action item Severity score (Rating) Command / configuration
CVE-2020-8561 Restrict Log Verbosity 4.1 (Medium) Ensure --v is set to < 10 and --profiling=false.
CVE-2020-8562 Enforce DNS Consistency 3.1 (Low) Deploy dnsmasq or a similar caching resolver on control plane nodes.
CVE-2021-25740 Hardened RBAC 3.1 (Low) kubectl auth reconcile to remove Endpoints write access from broad roles.

The RBAC action for CVE-2021-25740 applies when your cluster uses RBAC authorization mode, which is the default for clusters created with standard Kubernetes tooling. Administrators should independently test and validate these configurations in a non-production environment, assessing the architectural risks against their specific threat model and risk tolerance.

Conclusion: maturity through transparency

The effort to reconcile these records is a sign of a maturing security ecosystem. By moving away from the "patch-only" mindset and accurately documenting architectural debt, the Kubernetes project provides the community with the high-fidelity data needed to secure modern cloud native infrastructure.

We would like to thank the security researchers-QiQi Xu, Javier Provecho, and others-who identified these risks, and the SIG Security Tooling contributors who continue to refine our official feeds. Special shoutout to Rory McCune for sharing information around these CVEs through his blog posts.

Update 2026/06/01: Today, the Kubernetes SRC has updated the CVE records for CVE-2020-8554, CVE-2020-8561, CVE-2020-8562, and CVE-2021-25740.

26 May 2026 5:30pm GMT

20 May 2026

feedKubernetes Blog

Announcing etcd 3.7.0-beta.0

SIG-Etcd announces the availability of the first beta release of etcd v3.7.0. This new version of the popular distributed database and key Kubernetes component includes the long-requested RangeStream feature, as well as a refactoring and cleanup of multiple legacy components and interfaces. v3.7 will deliver improved security, better operational reliability, and an improved experience for working with large resultsets.

First, however, the project needs users to test the beta. You can find v3.7.0-beta.0 here:

Please try it out and report issues in the etcd repo.

This beta also determines the EOL of version 3.4.

RangeStream

In etcd v3.6 and earlier, it is challenging to work with requests that return large resultsets. The client or requesting application is forced to wait for the full result set, leading to unpredictable latency and memory usage. The RangeStream RPC lets calling applications accept result sets in chunks, reducing latency and making buffering memory usage more predictable.

Much of the work on RangeStream was done by a relatively new contributor to etcd, Jeffrey Ying, a software engineer at Google. New contributors can have a substantial impact on etcd development.

"I've always been fascinated by database internals, and building RangeStream was a great opportunity to solve a bottleneck we were hitting in production with Kubernetes. It was the perfect opportunity to collaborate across projects and improve the ecosystem as a whole. Jumping into etcd as a new contributor had a bit of a learning curve, but the community is incredibly welcoming. The leads were very receptive to my ideas and helped me iterate quickly, while maintaining the project's high bar for reliability and code quality," said Jeffrey.

Instructions on how to use RangeStream in gRPC calls and in etcdctl can be found in the etcd documentation. Users should try it out for their own applications.

Removal of v2store

The last vestiges of etcd v2store have been removed in v3.7, making this the first release that is 100% on v3store. This includes discovery, bootstrap, v2 requests, and the v2 client. Our team has also removed multiple deprecated experimental flags.

All of these changes may create some breakage for users, particularly those who have not already updated to v3.6.11. We are interested in hearing about blockers encountered by users and dependent applications; please report anything you find that can't be remedied or needs better upgrade documentation.

etcd v3.7.0-beta.0 also includes bbolt v1.5.0 and raft v3.7.0.

3.4 EOL

According to our community support policy, we typically maintain only the latest two minor versions, currently v3.6 and v3.5. Etcd v3.5 will be supported for 1 year after v3.7.0 final release.

As mentioned in extended support for v3.4 in the etcd v3.6.0 release announcement, etcd v3.4 has been EOL since May 15, 2026. SIG-etcd may release one more security patch for that version at the end of May, if warranted by patched vulnerabilities. In any case, it will cease being updated after the end of May. Users on v3.4 should be planning to upgrade their clusters.

Feedback and Future Betas

Reach the etcd contributors with your feedback about v3.7.0-beta.0 in any of the following places:

SIG-etcd may release additional betas of version v3.7.0 with additional refactoring, particularly of our use of protobuf libraries. Release candidates and the final release will probably happen through June, possibly into early July.

20 May 2026 12:00am GMT

15 May 2026

feedKubernetes Blog

Kubernetes v1.36: New Metric for Route Sync in the Cloud Controller Manager

This article was originally published with the wrong date. It was later republished, dated the 15th of May 2026.

Kubernetes v1.36 introduces a new alpha counter metric route_controller_route_sync_total to the Cloud Controller Manager (CCM) route controller implementation at k8s.io/cloud-provider. This metric increments each time routes are synced with the cloud provider.

A/B testing watch-based route reconciliation

This metric was added to help operators validate the CloudControllerManagerWatchBasedRoutesReconciliation feature gate introduced in Kubernetes v1.35. That feature gate switches the route controller from a fixed-interval loop to a watch-based approach that only reconciles when nodes actually change. This reduces unnecessary API calls to the infrastructure provider, lowering pressure on rate-limited APIs and allowing operators to make more efficient use of their available quota.

To A/B test this, compare route_controller_route_sync_total with the feature gate disabled (default) versus enabled. In clusters where node changes are infrequent, you should see a significant drop in the sync rate with the feature gate turned on.

Example: expected behavior

With the feature gate disabled (the default fixed-interval loop), the counter increments steadily regardless of whether any node changes occurred:

# After 10 minutes with no node changes
route_controller_route_sync_total 60
# After 20 minutes, still no node changes
route_controller_route_sync_total 120

With the feature gate enabled (watch-based reconciliation), the counter only increments when nodes are actually added, removed, or updated:

# After 10 minutes with no node changes
route_controller_route_sync_total 1
# After 20 minutes, still no node changes - counter unchanged
route_controller_route_sync_total 1
# A new node joins the cluster - counter increments
route_controller_route_sync_total 2

The difference is especially visible in stable clusters where nodes rarely change.

Where can I give feedback?

If you have feedback, feel free to reach out through any of the following channels:

How can I learn more?

For more details, refer to KEP-5237.

15 May 2026 6:35pm GMT

Kubernetes v1.36: Mixed Version Proxy Graduates to Beta

Back in Kubernetes 1.28, we introduced the Mixed Version Proxy (MVP) as an Alpha feature (under the feature gate UnknownVersionInteroperabilityProxy) in a previous blog post. The goal was simple but critical: make cluster upgrades safer by ensuring that requests for resources not yet known to an older API server are correctly routed to a newer peer API server, instead of returning an incorrect 404 Not Found.

We are excited to announce that the Mixed Version Proxy is moving to Beta in Kubernetes 1.36 and will be enabled by default! The feature has evolved significantly since its initial release, addressing key gaps and modernizing its architecture.

Here is a look at how the feature has evolved and what you need to know to leverage it in your clusters.

What problem are we solving?

In a highly available control plane undergoing an upgrade, you often have API servers running different versions. These servers might serve different sets of APIs (Groups, Versions, Resources). Without MVP, if a client request lands on an API server that does not serve the requested resource (e.g., a new API version introduced in the upgrade), that server returns a 404 Not Found. This is technically incorrect because the resource is available in the cluster, just not on that specific server. This can lead to serious side effects, such as mistaken garbage collection or blocked namespace deletions. MVP solves this by proxying the request to a peer API server that can serve it.

sequenceDiagram
participant Client
participant API_Server_A as API Server A (Older/Different)
participant API_Server_B as API Server B (Newer/Capable)
Client->>API_Server_A: 1. Request for Resource (e.g., v2)
Note over API_Server_A: Determines it cannot serve locally
API_Server_A->>API_Server_A: 2. Looks up capable peer in Discovery Cache
API_Server_A->>API_Server_B: 3. Proxies request (adds x-kubernetes-peer-proxied header)
API_Server_B->>API_Server_B: 4. Processes request locally
API_Server_B-->>API_Server_A: 5. Returns Response
API_Server_A-->>Client: 6. Forwards Response

How has it evolved since 1.28

The initial Alpha implementation was a great proof of concept, but it had some limitations and relied on older mechanisms. Here is how we have modernized it for Beta:

  1. From StorageVersion API to Aggregated Discovery In the Alpha version, API servers relied on the StorageVersion API to figure out which peers served which resources. While functional, this approach had a significant limitation: the StorageVersion API is not yet supported for CRDs and aggregated APIs. For Beta, we have replaced the reliance on StorageVersion API calls with the use of Aggregated Discovery. API servers now use the aggregated discovery data to dynamically understand the capabilities of their peers.

  2. The Missing Piece: Peer-Aggregated Discovery The 1.28 blog post noted a significant gap: while we could proxy resource requests, discovery requests still only showed what the local API server knew about. In 1.36, we have added Peer-Aggregated Discovery support! Now, when a client performs discovery (e.g., listing available APIs), the API server merges its local view with the discovery data from all active peers. This provides clients with a complete, unified view of all APIs available across the entire cluster, regardless of which API server they connected to.

sequenceDiagram
participant Client
participant API_Server_A as API Server A
participant API_Server_B as API Server B
Client->>API_Server_A: 1. Request Discovery Document
API_Server_A->>API_Server_A: 2. Gets Local APIs
API_Server_A->>API_Server_B: 3. Gets Peer APIs (Cached or Direct)
API_Server_A->>API_Server_A: 4. Merges and sorts lists deterministically
API_Server_A-->>Client: 5. Returns Unified Discovery Document

While peer-aggregated discovery will be the default behavior (note that peer-aggregated discovery is enabled if the --peer-ca-file flag is set, otherwise the server will fallback to showing only its local APIs), there may be cases where you need to inspect only the resources served by the specific API server you are connected to. You can request this non-aggregated view by including the profile=nopeer parameter in your request's Accept header (e.g., Accept: application/json;g=apidiscovery.k8s.io;v=v2;as=APIGroupDiscoveryList;profile=nopeer).

Required configuration

While the feature gate will be enabled by default, it requires certain flags to be set to allow for secure communication between peer API servers. To function correctly, make sure your API server is configured with the following flags:

Configuring with kubeadm

If you manage your cluster with kubeadm, you can configure these flags in your ClusterConfiguration file:

apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration
apiServer:
 extraArgs:
 peer-ca-file: "/etc/kubernetes/pki/ca.crt"
 # peer-advertise-ip and port if needed

Call to action

If you are running multi-master clusters and upgrading them regularly, the Mixed Version Proxy is a major safety improvement. With it becoming default in 1.36, we encourage you to:

  1. Review your API server flags to ensure --peer-ca-file is set properly.
  2. Test the feature in your staging environments as you prepare for the 1.36 upgrade.
  3. Provide feedback to SIG API Machinery (Slack, mailing list, or by attending SIG API Machinery meetings) on your experience.

15 May 2026 6:00pm GMT

14 May 2026

feedKubernetes Blog

Kubernetes v1.36: Deprecation and removal of Service ExternalIPs

The .spec.externalIPs field for Service was an early attempt to provide cloud-load-balancer-like functionality for non-cloud clusters. Unfortunately, the API assumes that every user in the cluster is fully trusted, and in any situation where that is not the case, it enables various security exploits, as described in CVE-2020-8554.

Since Kubernetes 1.21, the Kubernetes project has recommended that all users disable .spec.externalIPs. To make that easier, Kubernetes also added an admission controller (DenyServiceExternalIPs) that can be enabled to do this. At the time, SIG Network felt that blocking the functionality by default was too large a breaking change to consider.

However, the security problems are still there, and as a project we're increasingly unhappy with the "insecure by default" state of the feature. Additionally, there are now several better alternatives for non-cloud clusters wanting load-balancer-like functionality.

As a result, the .spec.externalIPs field for Service is now formally deprecated in Kubernetes 1.36. We expect that a future minor release of Kubernetes will drop implementation of the behavior from kube-proxy, and will update the Kubernetes conformance criteria to require that conforming implementations do not provide support.

A note on terminology, and what hasn't been deprecated

The phrase external IP is somewhat overloaded in Kubernetes:

This deprecation is about the first of those. If you are not setting the field externalIPs in any of your Services, then it does not apply to you.

That said, as a precaution, you may still want to enable the DenyServiceExternalIPs admission controller to block any future use of the externalIPs field.

Alternatives to externalIPs

If you are using .spec.externalIPs, then there are several alternatives.

Consider a Service like the following:

apiVersion: v1
kind: Service
metadata:
 name: my-example-service
spec:
 type: ClusterIP
 selector:
 app.kubernetes.io/name: my-example-app
 ports:
 - protocol: TCP
 port: 80
 targetPort: 8080
 externalIPs:
 - "192.0.2.4"

Using manually-managed LoadBalancer Services instead of externalIPs

The easiest (but also worst) option is to just switch from using externalIPs to using a type: LoadBalancer service, and assigning a load balancer IP by hand. This is, essentially, exactly the same as externalIPs, with one important difference: the load balancer IP is part of the Service's .status, not its .spec, and in a cluster with RBAC enabled, it can't be edited by ordinary users by default. Thus, this replacement for externalIPs would only be available to users who were given permission by the admins (although those users would then be fully empowered to replicate CVE-2020-8554; there would still not be any further checks to ensure that one user wasn't stealing another user's IPs, etc.)

Because of the way that .status works in Kubernetes, you must create the Service without a load balancer IP, and then add the IP as a second step:

$ cat loadbalancer-service.yaml
apiVersion: v1
kind: Service
metadata:
 name: my-example-service
spec:
 # prevent any real load balancer controllers from managing this service
 # by using a non-existent loadBalancerClass
 loadBalancerClass: non-existent-class
 type: LoadBalancer
 selector:
 app.kubernetes.io/name: my-example-app
 ports:
 - protocol: TCP
 port: 80
 targetPort: 8080
$ kubectl apply -f loadbalancer-service.yaml
service/my-example-service created
$ kubectl patch service my-example-service --subresource=status --type=merge -p '{"status":{"loadBalancer":{"ingress":[{"ip":"192.0.2.4"}]}}}'

Using a non-cloud based load balancer controller

Although LoadBalancer services were originally designed to be backed by cloud load balancers, Kubernetes can also support them on non-cloud platforms by using a third-party load balancer controller such as MetalLB. This solves the security problems associated with externalIPs because the administrator can configure what ranges of IP addresses the controller will assign to services, and the controller will ensure that two services can't both use the same IP.

So, for example, after installing and configuring MetalLB, a cluster administrator could configure a pool of IP addresses for use in the cluster:

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
 name: production
 namespace: metallb-system
spec:
 addresses:
 - 192.0.2.0/24
 autoAssign: true
 avoidBuggyIPs: false

After which a user can create a type: LoadBalancer Service and MetalLB will handle the assignment of the IP address. MetalLB even supports the deprecated loadBalancerIP field in Service, so the end user can request a specific IP (assuming it is available) for backward-compatibility with the externalIPs approach, rather than being assigned one at random:

apiVersion: v1
kind: Service
metadata:
 name: my-example-service
spec:
 type: LoadBalancer
 selector:
 app.kubernetes.io/name: my-example-app
 ports:
 - protocol: TCP
 port: 80
 targetPort: 8080
 loadBalancerIP: "192.0.2.4"

Similar approaches would work with other load balancer controllers. This approach can allow cluster administrators to have control over which IP addresses are assigned, rather than users.

Using Gateway API

Another potential solution is to use an implementation of the Gateway API.

Gateway API allows cluster administrators to define a Gateway resource, which can have an IP address attached to it via the .spec.addresses field. Since Gateway resources are designed to be managed by cluster administrators, RBAC rules can be put in place to only allow privileged users to manage them.

An example of how this could look is:

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
 name: example-gateway
spec:
 gatewayClassName: example-gateway-class
 addresses:
 - type: IPAddress
 value: "192.0.2.4"
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
 name: example-route
spec:
 parentRefs:
 - name: example-gateway
 rules:
 - backendRefs:
 - name: example-svc
 port: 80
---
apiVersion: v1
kind: Service
metadata:
 name: example-svc
spec:
 type: ClusterIP
 selector:
 app.kubernetes.io/name: example-app
 ports:
 - protocol: TCP
 port: 80
 targetPort: 8080

The Gateway API project is the next generation of Kubernetes Ingress, Load Balancing, and Service Mesh APIs within Kubernetes. Gateway API was designed to fix the shortcomings of the Service and Ingress resource, making it a very reliable robust solution that is under active development.

Timeline for externalIPs deprecation

The rough timeline for this deprecation is as follows:

  1. With the release of Kubernetes 1.36, the field was deprecated; Kubernetes now emits warnings when a user uses this field
  2. About a year later (v1.40 at the earliest) support for .spec.externalIPs will be disabled in kube-proxy, but users will have a way to opt back in should they require more time to migrate away
  3. About another year later - (v1.43 at the earliest) support will be disabled completely; users won't have a way to opt back in

14 May 2026 6:35pm GMT

13 May 2026

feedKubernetes Blog

Kubernetes v1.36: Advancing Workload-Aware Scheduling

AI/ML and batch workloads introduce unique scheduling challenges that go beyond simple Pod-by-Pod scheduling. In Kubernetes v1.35, we introduced the first tranche of workload-aware scheduling improvements, featuring the foundational Workload API alongside basic gang scheduling support built on a Pod-based framework, and an opportunistic batching feature to efficiently process identical Pods.

Kubernetes v1.36 introduces a significant architectural evolution by cleanly separating API concerns: the Workload API acts as a static template, while the new PodGroup API handles the runtime state. To support this, the kube-scheduler features a new PodGroup scheduling cycle that enables atomic workload processing and paves the way for future enhancements. This release also debuts the first iterations of topology-aware scheduling and workload-aware preemption to advance scheduling capabilities. Additionally, ResourceClaim support for workloads unlocks Dynamic Resource Allocation (DRA) for PodGroups. Finally, to demonstrate real-world readiness, v1.36 delivers the first phase of integration between the Job controller and the new API.

Workload and PodGroup API updates

The Workload API now serves as a static template, while the new PodGroup API describes the runtime object. Kubernetes v1.36 introduces the Workload and PodGroup APIs as part of the scheduling.k8s.io/v1alpha2 API group, completely replacing the previous v1alpha1 API version.

In v1.35, Pod groups and their runtime states were embedded within the Workload resource. The new model decouples these concepts: the Workload now serves as a static template object, while the PodGroup manages the runtime state. This separation also improves performance and scalability as the PodGroup API allows per-replica sharding of status updates.

Because the Workload API acts merely as a template, the kube-scheduler's logic is streamlined. The scheduler can directly read the PodGroup, which contains all the information required by the scheduler, without needing to watch or parse the Workload object itself.

Here is what the updated configuration looks like. Workload controllers (such as the Job controller) define the Workload object, which now acts as a static template for your Pod groups:

apiVersion: scheduling.k8s.io/v1alpha2
kind: Workload
metadata:
 name: training-job-workload
 namespace: some-ns
spec:
 # Pod groups are now defined as templates,
 # which contains the PodGroup objects' spec fields.
 podGroupTemplates:
 - name: workers
 schedulingPolicy:
 gang:
 # The gang is schedulable only if 4 pods can run at once
 minCount: 4

Controllers then stamp out runtime PodGroup instances based on those templates. The PodGroup runtime object holds the actual scheduling policy and references the template from which it was created. It also has a status containing conditions that mirror the states of individual Pods, reflecting the overall scheduling state of the group:

apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
 name: training-job-workers-pg
 namespace: some-ns
spec:
 # The PodGroup references the Workload template it originated from.
 # In comparison, .metadata.ownerReferences points to the "true" workload object,
 # e.g., a Job. 
 podGroupTemplateRef:
 workload:
 workloadName: training-job-workload
 podGroupTemplateName: workers
 # The actual scheduling policy is placed inside the runtime PodGroup
 schedulingPolicy:
 gang:
 minCount: 4
status:
 # The status contains conditions mirroring individual Pod conditions.
 conditions:
 - type: PodGroupScheduled
 status: "True"
 lastTransitionTime: 2026-04-03T00:00:00Z

Finally, to bridge this new architecture with individual Pods, the workloadRef field in the Pod API has been replaced with the schedulingGroup field. When creating Pods, you link them directly to the runtime PodGroup:

apiVersion: v1
kind: Pod
metadata:
 name: worker-0
 namespace: some-ns
spec:
 # The workloadRef field has been replaced by schedulingGroup
 schedulingGroup:
 podGroupName: training-job-workers-pg
 ...

By keeping the Workload as a static template and elevating the PodGroup to a first-class, standalone API, we establish a robust foundation for building advanced workload scheduling capabilities in future Kubernetes releases.

PodGroup scheduling cycle and gang scheduling

To efficiently manage these workloads, the kube-scheduler now features a dedicated PodGroup scheduling cycle. Instead of evaluating and reserving resources sequentially Pod-by-Pod, which risks scheduling deadlocks, the scheduler evaluates the group as a unified operation.

When the scheduler pops a PodGroup member from the scheduling queue, regardless of the group's specific policy, it fetches the rest of the queued Pods for that group, sorts them deterministically, and executes an atomic scheduling cycle as follows:

  1. The scheduler takes a single snapshot of the cluster state to prevent race conditions and ensure consistency while evaluating the entire group.

  2. It then attempts to find valid Node placements for all Pods in the group using a PodGroup scheduling algorithm, which leverages the standard Pod-based filtering and scoring phases.

  3. Based on the algorithm's outcome, the scheduling decision is applied atomically for the entire PodGroup.

    • Success: If the placement is found and group constraints are met, the schedulable member Pods are moved directly to the binding phase together. Any remaining unschedulable Pods are returned to the scheduling queue to wait for available resources so they can join the already scheduled Pods.

      (Note: If new Pods are added to a PodGroup after others are already scheduled, the cycle evaluates the new Pods while accounting for the existing ones. Crucially, Pods already assigned to Nodes remain running. The scheduler will not unassign or evict them, even if the group fails to meet its requirements in subsequent cycles.)

    • Failure: If the group fails to meet its requirements, the entire group is considered unschedulable. None of the Pods are bound, and they are returned to the scheduling queue to retry later after a backoff period.

This cycle acts as the foundation for gang scheduling. When your workload requires strict all-or-nothing placement, the gang policy leverages this cycle to prevent partial deployments that lead to resource wastage and potential deadlocks.

While the scheduler still holds the Pods in the PreEnqueue until the minCount requirement is met, the actual scheduling phase now relies entirely on the new PodGroup cycle. Specifically, during the algorithm's execution, the scheduler verifies that the number of schedulable Pods satisfies the minCount. If the cluster cannot accommodate the required minimum, none of the pods are bound. The group fails and waits for sufficient resources to free up.

Limitations

The first version of the PodGroup scheduling cycle comes with certain limitations:

In addition to the above, for cases involving intra-group dependencies (e.g., when the schedulability of one Pod depends on another group member via inter-Pod affinity), this algorithm may fail to find a placement regardless of cluster state due to its deterministic processing order.

Topology-aware scheduling

For complex distributed workloads like AI/ML training or batch processing, placing Pods randomly across a cluster can introduce significant network latency and bottleneck overall performance.

Topology-aware scheduling addresses this problem by allowing you to define topology constraints directly on a PodGroup, ensuring its Pods are co-located within specific physical or logical domains:

apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
 name: topology-aware-workers-pg
spec:
 schedulingPolicy:
 gang:
 minCount: 4
 # Enforce that the pods are co-located based on the rack topology
 schedulingConstraints:
 topology:
 - key: topology.kubernetes.io/rack

In this example, the kube-scheduler attempts to schedule the Pods across various combinations of Nodes that match the rack topology constraint. It then selects the optimal placement based on how efficiently the PodGroup utilizes resources and how many Pods can successfully be scheduled within that domain.

To achieve this, the scheduler extends the PodGroup scheduling cycle with a dedicated placement-based algorithm consisting of three phases:

  1. Generate candidate placements (subsets of Nodes that are theoretically feasible for the PodGroup's assignment) based on the group's scheduling constraints. The topology-aware scheduling plugin uses the new PlacementGenerate extension point to create these placements.

  2. Evaluate each proposed placement to confirm whether the entire PodGroup can actually fit there.

  3. Score all feasible placements to select the best fit for the PodGroup. The topology-aware scheduling plugins use the new PlacementScore extension point to score these placements.

Currently, topology-aware scheduling does not trigger Pod preemption to satisfy constraints. However, we plan to integrate workload-aware preemption with topology constraints in the upcoming release.

While Kubernetes v1.36 delivers this foundational topology-aware scheduling, the Kubernetes project is planning expand its capabilities soon. Future updates will introduce support for multiple topology levels, soft constraints (preferences), deeper integration with Dynamic Resource Allocation (DRA), and more robust behavior when paired with the basic scheduling policy.

Workload-aware preemption

To support the new PodGroup scheduling cycle, Kubernetes v1.36 introduces a new type of preemption mechanism called workload-aware preemption. When a PodGroup cannot be scheduled, the scheduler utilizes this mechanism to try making a scheduling of this PodGroup possible.

Compared to the default preemption used in the standard Pod-by-Pod scheduling cycle, this new mechanism treats the entire PodGroup as a single preemptor unit. Instead of evaluating preemption victims on each Node separately, it searches across the entire cluster. This allows the scheduler to preempt Pods from multiple Nodes simultaneously, making enough space to schedule the whole PodGroup afterwards.

Workload-aware preemption also introduces two additional concepts directly to the PodGroup API:

In Kubernetes v1.36, these fields are only respected by the workload-aware preemption mechanism. The people working on this set of features are hoping to extend support for these fields to other disruption sources, including default preemption used in the Pod-by-Pod scheduling cycle, in future releases.

apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
 name: victim-pg
spec:
 priorityClassName: high-priority
 priority: 1000
 disruptionMode: PodGroup

In this example, when the scheduler evaluates victim-pg as a potential preemption victim during a workload-aware preemption cycle, it will use 1000 as its priority and preempt the PodGroup in a strictly all-or-nothing fashion.

DRA ResourceClaim support for workloads

Since its general availability in Kubernetes v1.34, DRA has enabled Pods to make detailed requests for devices like GPUs, TPUs, and NICs. Requested devices can be shared by multiple Pods requesting the same ResourceClaim by name. Other requests can be replicated through a ResourceClaimTemplate, in which Kubernetes generates one ResourceClaim with a non-deterministic name for each Pod referencing the template. However, large-scale workloads that require certain Pods to share certain devices are currently left to manage creating individual ResourceClaims themselves.

Now, in addition to Pods, PodGroups can represent the replicable unit for a ResourceClaimTemplate. For ResourceClaimTemplates referenced by one of a PodGroup's spec.resourceClaims, Kubernetes generates one ResourceClaim for the entire PodGroup, no matter how many Pods are in the group. When one of a Pod's spec.resourceClaims for a ResourceClaimTemplate matches one of its PodGroup's spec.resourceClaims, the Pod's claim resolves to the ResourceClaim generated for the PodGroup and a ResourceClaim will not be generated for that individual Pod. A single PodGroupTemplate in a Workload object can express resource requests which are both copied for each distinct PodGroup and shareable by the Pods within each group.

The following example shows two Pods requesting the same ResourceClaim generated from a ResourceClaimTemplate for their PodGroup:

apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
 name: training-job-workers-pg
spec:
 ...
 resourceClaims:
 - name: pg-claim
 resourceClaimTemplateName: my-claim-template
---
apiVersion: v1
kind: Pod
metadata:
 name: topology-aware-workers-pg-pod-1
spec:
 ...
 schedulingGroup:
 podGroupName: training-job-workers-pg
 resourceClaims:
 - name: pg-claim
 resourceClaimTemplateName: my-claim-template
---
apiVersion: v1
kind: Pod
metadata:
 name: topology-aware-workers-pg-pod-2
spec:
 ...
 schedulingGroup:
 podGroupName: training-job-workers-pg
 resourceClaims:
 - name: pg-claim
 resourceClaimTemplateName: my-claim-template

In addition, ResourceClaims referenced by PodGroups, either through resourceClaimName or the claim generated from resourceClaimTemplateName, become reserved for the entire PodGroup. Previously, kube-scheduler could only list individual Pods in a ResourceClaim's status.reservedFor field which is limited to 256 items. Now, a single PodGroup reference in status.reservedFor can represent many more than 256 Pods, allowing high-cardinality sharing of devices.

Together, these changes enable massive workloads with complex topologies to utilize DRA for scalable device management.

Integration with the Job controller

In Kubernetes v1.36, the Job controller can create and manage Workload and PodGroup objects on your behalf, so that Jobs representing a tightly coupled parallel application, such as distributed AI training, are gang-scheduled without any additional tooling. Without this integration, you would have to create the Workload and PodGroup yourself and wire their references into the Pod template. Now, the Job controller automates this process natively.

When the WorkloadWithJob feature gate is enabled, the Job controller automatically:

When does the integration kick in?

To keep the first feature iteration predictable, the Job controller only creates a Workload and PodGroup when the Job has a well-defined, fixed shape:

These conditions describe the class of Jobs that gang scheduling can reason about: each Pod has a stable identity (Indexed), the gang size is known and fixed at admission time (parallelism == completions), and no other controller has already claimed scheduling responsibility (schedulingGroup field is unset). Jobs that do not meet these conditions are scheduled Pod-by-Pod, exactly as before.

If you set schedulingGroup on the Pod template yourself (for example, because a higher-level controller is managing the workload), the Job controller leaves the Pod template alone and does not create its own Workload or PodGroup. This makes the feature safe to enable in clusters that already use an external batch system.

Here is an example of a Job that qualifies for gang scheduling:

apiVersion: batch/v1
kind: Job
metadata:
 name: training-job
 namespace: job-ns
spec:
 completionMode: Indexed
 parallelism: 4
 completions: 4
 template:
 spec:
 restartPolicy: Never
 containers:
 - name: worker
 image: registry.example/trainer:latest

The Job controller creates a Workload and a PodGroup owned by this Job, and every Pod it creates carries a .spec.schedulingGroup that points at the generated PodGroup. The Pods are then scheduled together once all four can be placed at the same time using the PodGroup scheduling cycle described earlier in this post.

What's not covered yet

The current constraints limit this integration to static, indexed, fully-parallel Jobs. Support for additional workload shapes, including elastic Jobs and other built-in controllers, is tracked in KEP-5547.

In future Kubernetes releases, this integration will expand to support additional workload controllers, and the current constraints for Jobs may be relaxed.

What's next?

The journey for workload-aware scheduling doesn't stop here. For v1.37, the community is actively working on:

The priority and implementation order of these focus areas are subject to change. Stay tuned for further updates.

Getting started

All below workload-aware scheduling improvements are available as Alpha features in v1.36. To try them out, you must configure the following:

Once the prerequisite is met, you can enable specific features:

We encourage you to try out workload-aware scheduling in your test clusters and share your experiences to help shape the future of Kubernetes scheduling. You can send your feedback by:

Learn more

To dive deeper into the architecture and design of these features, read the KEPs:

13 May 2026 6:35pm GMT

12 May 2026

feedKubernetes Blog

Kubernetes v1.36: PSI Metrics for Kubernetes Graduates to GA

Since its original implementation in the Linux kernel in 2018, Pressure Stall Information (PSI) has provided users with the high-fidelity signals needed to identify resource saturation before it becomes an outage. Unlike traditional utilization metrics, PSI tells the story of tasks stalled and time lost, all in nicely-packaged percentages of time across the CPU, memory, and I/O.

With the recent release of Kubernetes v1.36, users across the ecosystem have a stable, reliable interface to observe resource contention at the node, pod, and container levels. In this post, we will dive into the improvements and performance testing that proved its readiness for production.

Beyond utilization: why PSI?

Monitoring CPU or memory usage alone can be misleading. A node may report XX% (below 100%) CPU utilization while certain tasks are experiencing severe latency due to scheduling delays. PSI fills this gap by providing:

Proving stability: performance testing at scale

A common concern when graduating telemetry features is the resource overhead required to collect and serve the metrics. To address this, SIG Node conducted extensive performance validation on high-density workloads (80+ pods) across various machine types.

Our testing focused on two primary scenarios to isolate the impact of the Kubelet and kernel-level collection respectively:

  1. Kernel PSI ON / Kubelet Feature OFF vs Kernel PSI ON / Kubelet Feature ON (Kubelet overhead)
  2. Kernel PSI OFF / Kubelet Feature ON vs Kernel PSI ON / Kubelet Feature ON (Kernel overhead)

Scenario 1: The Kubelet Overhead

First, we looked at the kubelet usage on 4 core machines (Case 1). For these, the Linux kernel was already tracking pressure on both clusters by default(psi=1), but we toggled the KubeletPSI feature gate to see if the Kubelet actively querying and exposing these metrics impacted the resource usage. The synchronized bursts seen in the graph are practically identical in both magnitude and frequency, confirming that the Kubelet's collection logic is highly lightweight and blends seamlessly into standard housekeeping cycles. There is no issue about the feature affecting the pre-existing resource use, staying within the normal 0.1 cores or 2.5% of the total node capacity, and is therefore safe for production-scale deployments.

A line graph comparing the kubelet CPU usage rate over elapsed time with the Kubelet PSI feature turned off versus on and kernel PSI always on.

(Case 1) Kubelet CPU Usage Rate Comparison

Figure 2: Kubelet CPU Usage Rate Comparison.

Next, we evaluated the system overhead in the same run. As seen in the following graph, the System CPU usage lines for the Kubelet PSI-enabled (red) follows the same pattern as the Kubelet PSI-disabled (blue) clusters, with a slight expected increase from the baseline. This visualizes that once the OS is tracking PSI, at around 2.5 cores, the act of Kubernetes reading those cgroup metrics is negligible to performance.

A line graph comparing the system CPU usage rate over elapsed time with the PSI feature turned off versus on and kernel PSI default ON.

(Case 1) System CPU Usage Rate Comparison

Figure 1: Node System CPU Usage Rate Comparison.

Scenario 2: The Kernel Overhead

Shifting gears, we evaluated the underlying overhead of enabling PSI on the Linux kernel also on a 4 core machine. By comparing a cluster booted with psi=1 (COS default) against a cluster with psi=0, we isolated the exact cost of the OS-level bookkeeping. Even under heavy I/O and CPU load at an 80-pod density, the System CPU delta between the kernel-enabled and kernel-disabled clusters remained consistently between 0.037 cores and 0.125 cores or 0.925% - 3.125% of the total node capacity. There was a single spike to 0.225 cores, or 5.6%, but was controlled back down within a few seconds. This confirms that the internal kernel tracking is highly efficient under load.

A line graph comparing the Node System (Kernel) CPU usage rate with Kernel PSI ON and OFF over elapsed time.

(Case 2) Node System CPU Usage Rate Comparison

Figure 3: Node System CPU Usage Rate Comparison.

Figure 4 zooms in on the kubelet process itself, which serves as the primary collector for these metrics. . The results show that even while the kubelet performs periodic sweeps to aggregate data from the cgroup hierarchy, its CPU usage remains remarkably low with interchangeable spikes and nothing exceeding 0.25 cores or 6.25% of total capacity for longer than a second.

A line graph comparing the kubelet CPU usage rate over elapsed time with the Kernel PSI feature turned off versus on.

(Case 2) Kubelet CPU Usage Rate Comparison

Figure 4: Kubelet CPU Usage Rate Comparison.

Improvements between beta (1.34) and stable (1.36)

Getting started

To use PSI metrics in your Kubernetes cluster, your nodes must meet the following requirements:

  1. Ensure your nodes are running a Linux kernel version 4.20 or later and are using cgroup v2.
  2. Ensure PSI is enabled at the OS level (your kernel must be compiled with CONFIG_PSI=y and must not be booted with the psi=0 parameter).

As of v1.36, Kubelet PSI metrics are generally available and you do not need to opt in to any feature gate.

Once the OS prerequisites are met, you can start scraping the /metrics/cadvisor endpoint with your Prometheus-compatible monitoring solution or query the Summary API to collect and visualize the new PSI metrics. Note that PSI is a Linux-kernel feature, so these metrics are not available on Windows nodes. Your cluster can contain a mix of Linux and Windows nodes, and on the Windows nodes, the kubelet will simply omit the PSI metrics.

If your cluster is running a recent enough version of Kubernetes and you are a privileged node administrator, you can also proxy to the kubelet's HTTP API via the control plane's API server to see real-time pressure data from the Summary API.

Caution: Proxying to the kubelet is a privileged operation. Granting access to it is a security risk, so ensure you have the appropriate administrative permissions before executing these commands.

CONTAINER_NAME="example-container"
kubectl get --raw "/api/v1/nodes/$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')/proxy/stats/summary" | jq '.pods[].containers[] | select(.name=="'"$CONTAINER_NAME"'") | {name, cpu: .cpu.psi, memory: .memory.psi, io: .io.psi}'

Further reading

If you want to dive deeper into how these metrics are calculated and exposed, check out these resources:

  1. The official Kernel documentation
  2. Understanding PSI in the Kubernetes documentation
  3. cAdvisor Metrics Implementation

Acknowledgements

Support for PSI metrics was developed through the collaborative efforts of SIG Node. Special thanks to all contributors who helped design, implement, test, review, and document this feature across its journey from alpha in v1.33, through beta in v1.34, to GA in v1.36.

To provide feedback on this feature, join the Kubernetes Node Special Interest Group, participate in discussions on the public Slack channel (#sig-node), or file an issue on GitHub.

Feedback

If you have feedback and want to share your experience using this feature, join the discussion:

SIG Node would love to hear about your experiences using this feature in production!

12 May 2026 6:35pm GMT

08 May 2026

feedKubernetes Blog

Kubernetes v1.36: Moving Volume Group Snapshots to GA

Volume group snapshots were introduced as an Alpha feature with the Kubernetes v1.27 release, moved to Beta in v1.32, and to a second Beta in v1.34. We are excited to announce that in the Kubernetes v1.36 release, support for volume group snapshots has reached General Availability (GA).

The support for volume group snapshots relies on a set of extension APIs for group snapshots. These APIs allow users to take crash-consistent snapshots for a set of volumes. Behind the scenes, Kubernetes uses a label selector to group multiple PersistentVolumeClaim objects for snapshotting. A key aim is to allow you to restore that set of snapshots to new volumes and recover your workload based on a crash-consistent recovery point.

This feature is only supported for CSI volume drivers.

An overview of volume group snapshots

Some storage systems provide the ability to create a crash-consistent snapshot of multiple volumes. A group snapshot represents copies made from multiple volumes that are taken at the same point-in-time. A group snapshot can be used either to rehydrate new volumes (pre-populated with the snapshot data) or to restore existing volumes to a previous state (represented by the snapshots).

Why add volume group snapshots to Kubernetes?

The Kubernetes volume plugin system already provides a powerful abstraction that automates the provisioning, attaching, mounting, resizing, and snapshotting of block and file storage. Underpinning all these features is the Kubernetes goal of workload portability.

There was already a VolumeSnapshot API that provides the ability to take a snapshot of a persistent volume to protect against data loss or data corruption. However, some storage systems support consistent group snapshots that allow a snapshot to be taken from multiple volumes at the same point-in-time to achieve write order consistency. This is extremely useful for applications that contain multiple volumes. For example, an application may have data stored in one volume and logs stored in another. If snapshots for these volumes are taken at different times, the application will not be consistent and will not function properly if restored from those snapshots.

While you can quiesce the application first and take individual snapshots sequentially, this process can be time-consuming or sometimes impossible. Consistent group support provides crash consistency across all volumes in the group without the need for application quiescence.

Kubernetes APIs for volume group snapshots

Kubernetes' support for volume group snapshots relies on three API kinds that are used for managing snapshots:

VolumeGroupSnapshot
Created by a Kubernetes user (or automation) to request creation of a volume group snapshot for multiple persistent volume claims.
VolumeGroupSnapshotContent
Created by the snapshot controller for a dynamically created VolumeGroupSnapshot. It contains information about the provisioned cluster resource (a group snapshot). The object binds to the VolumeGroupSnapshot for which it was created with a one-to-one mapping.
VolumeGroupSnapshotClass
Created by cluster administrators to describe how volume group snapshots should be created, including the driver information, the deletion policy, etc.

These three API kinds are defined as CustomResourceDefinitions (CRDs). For the GA release, the API version has been promoted to v1.

What's new in GA?

How do I use Kubernetes volume group snapshots

Creating a new group snapshot with Kubernetes

Once a VolumeGroupSnapshotClass object is defined and you have volumes you want to snapshot together, you may request a new group snapshot by creating a VolumeGroupSnapshot object.

Label the PVCs you wish to group:

% kubectl label pvc pvc-0 group=myGroup
persistentvolumeclaim/pvc-0 labeled

% kubectl label pvc pvc-1 group=myGroup
persistentvolumeclaim/pvc-1 labeled

For dynamic provisioning, a selector must be set so that the snapshot controller can find PVCs with the matching labels to be snapshotted together.

apiVersion: groupsnapshot.storage.k8s.io/v1
kind: VolumeGroupSnapshot
metadata:
 name: snapshot-daily-20260422
 namespace: demo-namespace
spec:
 volumeGroupSnapshotClassName: csi-groupSnapclass
 source:
 selector:
 matchLabels:
 group: myGroup

The VolumeGroupSnapshotClass is required for dynamic provisioning:

apiVersion: groupsnapshot.storage.k8s.io/v1
kind: VolumeGroupSnapshotClass
metadata:
 name: csi-groupSnapclass
driver: example.csi.k8s.io
deletionPolicy: Delete

How to use group snapshot for restore

At restore time, request a new PersistentVolumeClaim to be created from a VolumeSnapshot object that is part of a VolumeGroupSnapshot. Repeat this for all volumes that are part of the group snapshot.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: examplepvc-restored-2026-04-22
 namespace: demo-namespace
spec:
 storageClassName: example-sc
 dataSource:
 name: snapshot-0962a745b2bf930bb385b7b50c9b08af471f1a16780726de19429dd9c94eaca0
 kind: VolumeSnapshot
 apiGroup: snapshot.storage.k8s.io
 accessModes:
 - ReadWriteOncePod
 resources:
 requests:
 storage: 100Mi

As a storage vendor, how do I add support for group snapshots?

To implement the volume group snapshot feature, a CSI driver must:

See the CSI spec and the Kubernetes-CSI Driver Developer Guide for more details.

How can I learn more?

How do I get involved?

This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. On behalf of SIG Storage, I would like to offer a huge thank you to all the contributors who stepped up over the years to help the project reach GA:

For those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We always welcome new contributors.

We also hold regular Data Protection Working Group meetings. New attendees are welcome to join our discussions.

08 May 2026 6:35pm GMT

07 May 2026

feedKubernetes Blog

Kubernetes v1.36: More Drivers, New Features, and the Next Era of DRA

Dynamic Resource Allocation (DRA) has fundamentally changed how platform administrators handle hardware accelerators and specialized resources in Kubernetes. In the v1.36 release, DRA continues to mature, bringing a wave of feature graduations, critical usability improvements, and new capabilities that extend the flexibility of DRA to native resources like memory and CPU, and support for ResourceClaims in PodGroups.

Driver availability continues to expand. Beyond specialized compute accelerators, the ecosystem includes support for networking and other hardware types, reflecting a move toward a more robust, hardware-agnostic infrastructure.

Whether you are managing massive fleets of GPUs, need better handling of failures, or simply looking for better ways to define resource fallback options, the upgrades to DRA in 1.36 have something for you. Let's dive into the new features and graduations!

Feature graduations

The community has been hard at work stabilizing core DRA concepts. In Kubernetes 1.36, several highly anticipated features have graduated to Beta and Stable.

Prioritized list (stable)

Hardware heterogeneity is a reality in most clusters. With the Prioritized list feature, you can confidently define fallback preferences when requesting devices. Instead of hardcoding a request for a specific device model, you can specify an ordered list of preferences (e.g., "Give me an H100, but if none are available, fall back to an A100"). The scheduler will evaluate these requests in order, drastically improving scheduling flexibility and cluster utilization.

Extended resource support (beta)

As DRA becomes the standard for resource allocation, bridging the gap with legacy systems is crucial. The DRA Extended resource feature allows users to request resources via traditional extended resources on a Pod. This allows for a gradual transition to DRA, meaning cluster operators can migrate clusters to DRA but let application developers adopt the ResourceClaim API on their own schedule.

Partitionable devices (beta)

Hardware accelerators are powerful, and sometimes a single workload doesn't need an entire device. The Partitionable devices feature, provides native DRA support for dynamically carving physical hardware into smaller, logical instances (such as Multi-Instance GPUs) based on workload demands. This allows administrators to safely and efficiently share expensive accelerators across multiple Pods.

Device taints (beta)

Just as you can taint a Kubernetes Node, you can apply taints directly to specific DRA devices. Device taints and tolerations empower cluster administrators to manage hardware more effectively. You can taint faulty devices to prevent them from being allocated to standard claims, or reserve specific hardware for dedicated teams, specialized workloads, and experiments. Ultimately, only Pods with matching tolerations are permitted to claim these tainted devices.

Device binding conditions (beta)

To improve scheduling reliability, the Kubernetes scheduler can use the Binding conditions feature to delay committing a Pod to a Node until its required external resources-such as attachable devices or FPGAs-are fully prepared. By explicitly modeling resource readiness, this prevents premature assignments that can lead to Pod failures, ensuring a much more robust and predictable deployment process.

Resource health status (beta)

Knowing when a device has failed or become unhealthy is critical for workloads running on specialized hardware. With Resource health status, Kubernetes expose device health information directly in the Pod status, giving users and controllers crucial visibility to quickly identify and react to hardware failures. The feature includes support for human-readable health status messages, making it significantly easier to diagnose issues without the need to dive into complex driver logs.

New Features

Beyond stabilizing existing capabilities, v1.36 introduces foundational new features that expand what DRA can do. These are alpha features, so they are behind feature gates that are disabled by default.

ResourceClaim support for workloads

To optimize large-scale AI/ML workloads that rely on strict topological scheduling, the ResourceClaim support for workloads feature enables Kubernetes to seamlessly manage shared resources across massive sets of Pods. By associating ResourceClaims or ResourceClaimTemplates with PodGroups, this feature eliminates previous scaling bottlenecks, such as the limit on the number of pods that can share a claim, and removes the burden of manual claim management from specialized orchestrators.

Node allocatable resources

Why should DRA only be for external accelerators? In v1.36, we are introducing the first iteration of using the DRA APIs to manage node allocatable infrastructure resources (like CPU and memory). By bringing CPU and memory allocation under the DRA umbrella with the DRA Node allocatable resources feature, users can leverage DRA's advanced placement, NUMA-awareness, and prioritization semantics for standard compute resources, paving the way for incredibly fine-grained performance tuning.

DRA resource availability visibility

One of the most requested features from cluster administrators has been better visibility into hardware capacity. The new Resource pool status feature allows you to query the availability of devices in DRA resource pools. By creating a ResourcePoolStatusRequest object, you get a point-in-time snapshot of device counts - total, allocated, available, and unavailable - for each pool managed by a given driver. This enables better integration with dashboards and capacity planning tools.

List types for attributes

ResourceClaim constraint evaluation has changed to work better with scalar and list values: matchAttribute now checks for a non-empty intersection, and distinctAttribute checks for pairwise disjoint values.

An includes() function in CEL has also been introduced, that lets device selectors keep working more easily when an attribute changes between scalar and list representations. (The includes() function is only available in DRA contexts for expression evaluation).

Deterministic device selection

The Kubernetes scheduler has been updated to evaluate devices using lexicographical ordering based on resource pool and ResourceSlice names. This change empowers drivers to proactively influence the scheduling process, leading to improved throughput and more optimal scheduling decisions. The ResourceSlice controller toolkit automatically generates names that reflect the exact device ordering specified by the driver author.

Discoverable device metadata in containers

Workloads running on nodes with DRA devices often need to discover details about their allocated devices, such as PCI bus addresses or network interface configuration, without querying the Kubernetes API. With Device metadata, Kubernetes defines a standard protocol for how DRA drivers expose device attributes to containers as versioned JSON files at well-known paths. Drivers built with the DRA kubelet plugin library get this behavior transparently; they just provide the metadata and the library handles file layout, CDI bind-mounts, versioning, and lifecycle. This gives applications a consistent, driver-independent way to discover and consume device metadata, eliminating the need for custom controllers or looking up ResourceSlice objects to get metadata via attributes.

What's next?

This release introduced a wealth of new Dynamic Resource Allocation (DRA) features, and the momentum is only building. As we look ahead, our roadmap focuses on maturing existing features toward beta and stable releases while hardening DRA's performance, scalability, and reliability. A key priority over the coming cycles will be deep integration with workload aware and topology aware scheduling.

A big goal for us is to migrate users from Device Plugin to DRA, and we want you involved. Whether you are currently maintaining a driver or are just beginning to explore the possibilities, your input is vital. Partner with us to shape the next generation of resource management. Reach out today to collaborate on development, share feedback, or start building your first DRA driver.

Getting involved

A good starting point is joining the WG Device Management Slack channel and meetings, which happen at Americas/EMEA and EMEA/APAC friendly time slots.

Not all enhancement ideas are tracked as issues yet, so come talk to us if you want to help or have some ideas yourself! We have work to do at all levels, from difficult core changes to usability enhancements in kubectl, which could be picked up by newcomers.

07 May 2026 6:35pm GMT