24 Jun 2026
Kubernetes Blog
Spotlight on WG Device Management
The rising popularity of AI, Edge, and Telecommunications workloads on Kubernetes has led to new requirements for hardware management. We now need hardware specification beyond CPU time and memory allocations. This includes allocating GPUs, TPUs, network interfaces, and other hardware, sometimes after pod start and occasionally through time-sharing.
Efficiently managing this specialized hardware is the mission of the Device Management Working Group. Their cornerstone project, Dynamic Resource Allocation (DRA), recently graduated to GA, marking a fundamental shift in how the project handles hardware-intensive workloads at scale.
In this spotlight, we sit down with working group chairs Kevin Klues, Patrick Ohly, and John Belamaric to discuss the limitations of the legacy device model, the NP-hard challenges of scheduling, and how they're building a more programmable, hardware-aware future for Kubernetes.
Introducing Device Management
Natalie Fisher: Can you introduce yourself, your role, and how you got involved in the Device Management Working Group?
Kevin Klues: My name is Kevin Klues. I am a Distinguished Engineer at NVIDIA. I have been a co-chair of the device management working group since its inception at Kubecon EU 2024. I have also been involved with DRA (the working group's primary deliverable) since its inception in 2019 / 2020. I have also been a kubelet maintainer since 2019, with a focus on its device manager, CPU manager, and topology manager subcomponents. The challenges we saw with using these components for workloads that relied on external accelerators (e.g., GPUs) are what triggered us to start working on DRA in the first place.
Patrick Ohly: I am a Principal Engineer at Intel. In Kubernetes, I am a Tech Lead for SIG Testing and SIG Instrumentation and co-chair of the Device Management WG. I was co-chair of the WG Structured Logging and a member of the Steering Committee. Some of my early contributions to Kubernetes include ephemeral CSI volumes and storage capacity tracking, so I had some experience with API design, implementation, and scheduling. We knew that introducing a major new API for accelerators would be hard. Somewhat foolishly, I accepted that challenge in 2020, wrote the initial DRA KEP (now known as "classic DRA") and implemented most of it, then started over with a second KEP for today's "structured parameters DRA". Initially, it was an uphill battle to convince maintainers that this work was necessary. It was only around 2023 that interest in DRA picked up, leading to the formation of the working group.
John Belamaric: I am a Senior Staff SWE at Google, and the third co-chair of WG Device Management, also since its inception. I am also a co-chair of SIG Architecture since 2019. As Patrick mentioned, in late 2023, interest in DRA really picked up. The initial implementation, made autoscaling very challenging, and so there was some concern in the community about advancing it to beta. I got involved to try to help address some of those concerns, and the three of us, along with Tim Hockin, worked hard over the next few months to build a consensus around a new design. To facilitate this collaboration, we formed the working group after discussion at KubeCon in Paris in 2024.
The problem and the solution
The working group emerged from a fundamental rethink of how Kubernetes interacts with specialized hardware. At the heart of this evolution is Dynamic Resource Allocation (DRA). Rather than treating devices as simple integers, DRA provides a structured framework that breaks device management into four distinct stages:
- Modeling: Vendors use the ResourceSlice API to advertise the granular capabilities and capacity of their hardware.
- Requesting: Users define their specific hardware needs-such as GPU memory or interconnect requirements-through the
ResourceClaimAPI. - Scheduling: The Kubernetes scheduler uses these APIs to match workload requirements against available hardware intelligently.
- Actuation: Once a match is made, the system handles the "handshake" that prepares and secures the device for the Pod's use.
NF: For readers who may not be familiar, what is the Device Management Working Group, and what problems is it trying to solve?
KK: The Device Management Working Group was chartered to enable simple and efficient configuration, sharing, and allocation of accelerators and other specialized hardware across Kubernetes workloads. Think GPUs, TPUs, FPGAs, and similar devices that don't fit neatly into Kubernetes' traditional resource model.
The problem we set out to solve is that the legacy Device Plugin API (which has been the primary mechanism for exposing hardware accelerators in Kubernetes) is fundamentally limited. It treats devices as opaque integers: you can request "2 GPUs," but you can't say anything meaningful about which GPUs you need, how they should be connected to each other, whether they can be shared, or how they should be partitioned. That was fine for simple cases, but modern AI/ML workloads are anything but simple. They span multiple nodes, require specific interconnect topologies, and increasingly need to share or partition hardware dynamically.
The working group's primary deliverable is Dynamic Resource Allocation (DRA), a new framework that replaces the rigid device plugin model with a flexible, declarative API. With DRA, workloads can describe their hardware requirements (e.g., GPU type, memory capacity, interconnect topology, desired partitioning) and drivers can publish fine-grained device attributes that the scheduler can act on. DRA graduated to GA in Kubernetes 1.34, and the ecosystem around it (e.g., drivers, tooling, and new API extensions) is growing rapidly.
PO: As Kevin said, the working group was formed around the existing effort to develop DRA. The initial work was done with only a handful of people actively involved, and perhaps also could only be done successfully in such a setup. But because it touches on so many different areas of Kubernetes, we also needed a place to discuss that and get the broader community of Kubernetes maintainers, device vendors, and, to a lesser extent, also end-users involved. The working group provides that place, with regular meetings online (one slot for Americas/EMEA, one for EMEA/Asia) and at KubeCon.
JB: DRA is the first problem the WG has addressed. It is focused on selection, allocation, and configuration of the devices. We broke the problem down into four parts: how does the vendor model the device and advertise capacity, how does the user request it, how do we schedule that request on top of the advertised capacity, and how do we actuate that result (that is, how do we make the device ready and available to the Pod).
One thing that is fundamental to the approach we took is an awareness of the incredible diversity of hardware and the rapid rate of change in the hardware industry. We knew that we couldn't keep up with the change if the Kubernetes APIs had to change for every type of hardware. Instead, we created a general approach where we address the hardware aspects that are important to Kubernetes. What we have done so far is focus on the scheduling and configuration aspects of devices. We build a device modeling API (the ResourceSlice API) that vendors use to model the scheduling characteristics of their devices, and allow users to pass through arbitrary configurations to those devices. By doing this, Kubernetes can be "programmed" to understand these aspects of the devices, without needing to be modified.
But DRA, as it stands right now, is very focused on scheduling. There are other aspects of Device Management that are in scope for the WG. In particular, we are looking into device failure detection and mitigation, and whether there is some better support we can build into Kubernetes to help.
Also, as Kevin alluded to, devices are often allocated and used in groups, rather than individually. Choosing the right devices to work together in a group depends on how they are interconnected; for example, NVIDIA GPUs may be in an any-to-any fabric arrangement in an NVLINK domain, whereas TPUs may have a 3D torus interconnect. This affects the "selection, allocation and configuration" of devices, and we have a lot more work to do to address these use cases.
A cross-SIG effort
Because device management touches scheduling, node operations, autoscaling, networking, and API design, the work naturally spans multiple SIGs across the Kubernetes project.
NF: How does collaboration across these SIGs work in practice, and why is it necessary?
KK: Device management touches nearly every layer of the Kubernetes stack, which is why the working group was chartered as a cross-SIG effort from the start. We have five stakeholder SIGs: sig-node, sig-scheduling, sig-autoscaling, sig-network, and sig-architecture.
In practice, the working group serves as a coordination layer. We don't own code directly; instead, our deliverables take the form of KEPs and implementations that live in the respective SIGs. What we provide is a unified forum where the people building the scheduler, the kubelet, the autoscaler, and the network plane can design together rather than in isolation.
Why is this necessary? Consider a simple example: a user requests a set of GPUs that need to communicate via NVLink. That requirement involves the scheduler (place the pods on the right nodes), the kubelet (configure the devices and expose them to the container), and potentially autoscaling (provision the right node type if none exists).
If those three groups design independently, you end up with inconsistent abstractions, duplicated logic, and integration bugs that only surface in production. The working group ensures that a single coherent API and data model flows through all of these components.
The cross-SIG model also means that design decisions are reviewed from multiple angles. Someone from sig-scheduling will catch scheduler complexity that a sig-node contributor might overlook, and vice versa. It slows down individual decisions slightly, but produces much more robust outcomes.
Current focus areas
With DRA now generally available, the working group's focus has expanded to enable more advanced scheduling models, shared semantics, operational visibility, and support for increasingly complex hardware topologies.
NF: What are some of the key initiatives or deliverables the working group is currently focused on?
KK: We maintain a project board at Kubernetes Project Board with real-time tracking of our initiatives and their progress.
PO: The scope and feature set of core DRA were intentionally limited to enable graduation to GA within a reasonable time. Additional KEPs add more features, on their own schedule. Those fall roughly into three categories:
- Extend the expressiveness of DRA to support more complex devices and scheduling scenarios.
- Support day two operations like health monitoring.
- Improve multi-node support, primarily by integrating with workload-aware scheduling.
In addition to the project board, we also maintain a table which summarizes all the KEPs which are currently in flight. This is the status for 1.36; more are likely to be added for 1.37:
| KEP | Description | Release | ||||
|---|---|---|---|---|---|---|
| 1.32 | 1.33 | 1.34 | 1.35 | 1.36 | ||
| 4381 | DRA: Structured Parameters | Beta | Beta | Stable | ||
| 5004 | DRA: Extended Resource Requests via DRA | Alpha | Alpha | Beta | ||
| 4817 | DRA: Resource Claim Status | Alpha | Beta | Beta | Beta | Beta |
| 5018 | DRA: Namespace Controlled Admin Access | Alpha | Beta | Beta | Stable | |
| 5055 | DRA: Device Taints and Tolerations | Alpha | Alpha | Alpha | Beta | |
| 4816 | DRA: Prioritized Alternatives in Device Requests | Alpha | Beta | Beta | Stable | |
| 5075 | DRA: Consumable Capacity | Alpha | Alpha | Beta | ||
| 4815 | DRA: Partitionable Devices | Alpha | Alpha | Alpha | Beta | |
| 5304 | DRA: Attributes Downward API | Alpha | ||||
| 5729 | DRA: ResourceClaim Support for Workloads | Alpha | ||||
| 4680 | Resource Health Status in Pod Status | Alpha | Alpha | Alpha | Alpha | Beta |
| 5517 | DRA: Native Resource Requests | Alpha | ||||
| 5677 | DRA: Resource Availability Visibility | Alpha | ||||
| 5007 | DRA: Device Binding Conditions | Alpha | Alpha | Beta | ||
| 5491 | DRA: List Types for Attributes | Alpha |
NF: One of the core challenges is efficient device utilization and sharing. What progress is being made in this area?
JB: Good question. One way to think about it is what we are doing in the two primary APIs: ResourceClaim and ResourceSlice.
The ResourceClaim API is how the user asks for devices. We have built some features that allow the user to be more flexible in their requests. For example, instead of asking for a specific model of GPU, they can ask for a GPU with at least a certain amount of memory. Or they can ask for a list of alternatives: "I'd like one A100 (80GB) GPU, but if you don't have it, I'll take 2 A100 (40 GB) GPUs." This gives the scheduler some options to satisfy the request, which can lead to better obtainability and utilization of hardware that otherwise would not be selected.
The ResourceClaim API allows users to explicitly share devices. You can point multiple containers (in the same or different Pods) at a ResourceClaim; this allows the devices allocated by that claim to be used in all of those containers, if the device supports it.
The ResourceSlice API is how vendors model and advertise their devices. This is where we implement support for other sharing models. For example, we have a way to represent "overlapping partitions", enabling the scheduler to dynamically select a MIG partition, and make any overlapping MIG partitions unavailable automatically. This works well in combination with a request like "give me any GPU with 20GB or more of memory" - the scheduler can satisfy that with a MIG or a real GPU.
Some features require changes in both. We have another sharing method we call "consumable capacity". In the explicit sharing case described above, a user needs to point containers at the same ResourceClaim; there is one ResourceClaim shared amongst several containers and Pods. With consumable capacity, the device sharing works more like how Pods share a Node. The user creates a ResourceClaim that asks for a certain amount of resources, for example, "I need a NIC with 2Gbps of bandwidth". The scheduler knows that there is a NIC with 40Gbps of bandwidth available, and so it allocates 2Gbps out of that 40Gbps and gives it to that ResourceClaim. In this case, each Pod has its own ResourceClaim, but the underlying device is shared between those claims. It's up to the on-node DRA driver to properly set up the device for this sort of sharing (in the NIC case, likely by creating a subinterface). We call this "platform-mediated sharing" to differentiate it from the explicit "user-mediated sharing".
Real-world impact
While much of the work is deeply technical, the underlying goal is practical: enabling Kubernetes to better support real-world AI/ML and hardware-intensive workloads at scale.
NF: What are the biggest challenges users face today when running hardware-intensive workloads (like AI/ML) on Kubernetes?
PO: Such workloads depart from traditional container workloads in several ways: they may consist of multiple communicating pods which all need to run at the same time ("gang scheduling"). They are often long-running and expensive to initialize, and their performance is sensitive to where they run (topology within a node and interconnects between nodes for multiple pods). The Kubernetes scheduler traditionally has not supported either of this well because it schedules one pod at a time and is unaware of the topology within a node. Several external schedulers try to fill this gap, which often isn't ideal, in particular when the Kubernetes scheduler schedules other pods to the same cluster.
NF: How should platform engineers think about device management when designing their Kubernetes platforms?
JB: We're still learning here, but one idea of DRA is to enable a shift to more "requirements driven" specifications. This can allow less coupling between end users that write the workload specification and the cluster administrators that set up the clusters. Instead of agreeing on labeling conventions and requiring users to understand the cluster topology, the users can specify what their workload needs, and the scheduler can figure out how to satisfy it. If we can make this work, it can make even complex workloads more portable across clusters.
Challenges and trade-offs
As with many areas of Kubernetes, increasing flexibility and expressiveness also introduces new layers of complexity, particularly around scheduling and optimization.
NF: What are some of the hardest technical challenges the working group is tackling today?
PO: There's an inherent conflict between flexibility and scheduling complexity. The current implementation is focused on finding some solution that satisfies the requested resources, but it's not necessarily the best one, whatever "best" means, which is also not always clear. The other big challenge is exposing node-allocatable resources (RAM, CPU) as devices with additional metadata; this is necessary to fine-tune scheduling of workloads which need perfect alignment on a node for optimal performance.
JB: Patrick's list is good. Complex device modeling is hard, and making sure that we build the right semantics such that they apply to lots of different hardware is always tricky.
On top of that, scheduling in general is very complex and is an NP-hard problem. All the metadata and flexibility DRA adds gives the scheduler more options, which has pros and cons. More options are helpful if you are constrained in your choices, as it means you can schedule something that you otherwise could not. But it also means it is even harder to find an optimal solution when there are many possibilities in a given cluster. DRA works well in our common use cases so far, but we have a lot of work to do to improve the optimality of the chosen scheduling solution and ensure the performance of making that choice.
Looking ahead
Despite the challenges, contributors across the working group remain excited about the pace of innovation and the growing community forming around device management in Kubernetes.
NF: Looking ahead, what are you most excited about in the future of device management in Kubernetes?
KK: NVIDIA recently donated its DRA driver for GPUs to the Kubernetes project. I'm personally excited for more community members to start contributing to the project and defining its future direction.
PO: For me, it's primarily the number of new contributors and people stepping up to help out. This poses new challenges around reviewing proposals and helping developers get those implemented and merged. It's nice and rewarding to see others succeed, and it bodes well for the future because more people are familiar with the topic.
JB: I am excited about a lot of things. The community really has grown and has so many interesting features in the works to enable modeling of more complex devices, and to better model multi-node devices.
I am really excited to see the creative ways people will use these APIs. They were primarily designed to address "devices", but just like how "everything is a file" in Unix/Linux, the APIs themselves are quite flexible as to what they model. They really build out a more programmable scheduler, which can have interesting applications. For example, I recently prototyped using DRA to schedule pods to nodes where a large AI model is already locally cached. It's really quite flexible, and I have great confidence in the creativity of our community, so I think we'll see some unexpected solutions in the ecosystem.
Getting involved
NF: How can contributors get involved with the Device Management Working Group?
KK: The easiest first step is to join our mailing list at wg-device-management@kubernetes.io. Subscribing will automatically add calendar invites for our biweekly meetings to your calendar.
We have two meeting slots to accommodate different time zones:
- Europe/Americas: Tuesdays at 8:30 AM PT (biweekly)
- Asia/Europe: Wednesdays at 9:00 AM CET (biweekly)
Meeting notes, agendas, and recordings are all publicly accessible (links available from Device Management page). You can get a feel for the work in progress before attending your first meeting.
On Slack, find us in #wg-device-management on the Kubernetes Slack workspace. That's the best place for quick questions or to introduce yourself.
For more hands-on contributions, the DRA Driver for NVIDIA GPUs is now a community project and a great place to start. It's a real-world, production-grade implementation that the broader community is now shaping together.
We welcome contributors at all levels - whether you're interested in the API design, the scheduler internals, driver development, or documentation. Come say hello.
Summary
As Kubernetes evolves to support the AI/ML revolution and high-performance computing, the work happening within WG Device Management is becoming the foundation for how modern workloads are scheduled and operated at scale.
From the graduation of Dynamic Resource Allocation (DRA) to the next frontiers of health monitoring and topology-aware scheduling, this group is effectively rewriting the "handshake" between software and hardware.
If you're interested in shaping the future of hardware-aware orchestration, now is the perfect time to get involved. Whether you want to help refine the API, build out drivers, or improve documentation, the working group welcomes all levels of experience and perspectives from across the community.
24 Jun 2026 6:00pm GMT
15 Jun 2026
Kubernetes Blog
Spotlight on SIG Storage
In our ongoing SIG Spotlight series, we shine a light on the groups that keep the Kubernetes project moving forward. This time, we catch up with SIG Storage, the group responsible for persistent data, volume management, and the interfaces that connect Kubernetes workloads to the storage systems beneath them.
We spoke with Xing Yang, Co-Chair of SIG Storage and Software Engineer at VMware by Broadcom, about the SIG's history, the features shipping in recent Kubernetes releases, and where storage in Kubernetes is headed as AI workloads become the norm.
Introductions
Could you introduce yourself and share your role(s) within SIG Storage?
My name is Xing Yang, a software engineer at VMware by Broadcom. I'm a co-chair in SIG Storage, alongside another co-chair Saad Ali from Google. There are also two Tech Leads in SIG Storage: Michelle Au from Google and Jan Šafránek from Red Hat.
What first drew you to storage in Kubernetes, and how did you start contributing?
I have always been working in the storage domain, so SIG Storage was a natural place for me to get started when I began to learn Kubernetes. I started attending SIG Storage meetings, trying to figure out what I could do to help. This was before the first Container Storage Interface (CSI) release - lots of things were still evolving. It was a very exciting time.
What subprojects or areas do you actively maintain or review today?
I'm a maintainer in Kubernetes CSI. There are multiple CSI sidecars - such as csi-provisioner, csi-attacher, csi-resizer, and csi-snapshotter - that we need to release following every Kubernetes release. I'm also a co-chair for a Data Protection Working Group co-sponsored by SIG Storage and SIG Apps. Several features have come out of that WG aimed at filling gaps in data protection support within Kubernetes. One is Volume Group Snapshot, which provides crash-consistent group snapshots for multiple volumes used by an application. Changed Block Tracking (CBT) is another critical feature from the DP WG designed to support efficient backups.
About SIG Storage
For folks who are new: what is SIG Storage, in your own words? What problems in Kubernetes are you trying to solve?
SIG Storage is a Special Interest Group focused on how to provide storage to containers running in your Kubernetes cluster. We define standard interfaces so that a storage vendor can write a driver and have its underlying storage system consumed by containers in Kubernetes.
Why does Kubernetes need a dedicated storage SIG? What makes storage hard in a distributed system?
When Kubernetes was first introduced, it was meant for stateless workloads only. Container applications were regarded as ephemeral and therefore did not need to persist data. However, that changed drastically. Stateful workloads started running in Kubernetes, and we needed a dedicated SIG to tackle the associated storage challenges. PersistentVolumeClaims, PersistentVolumes, and StorageClasses were all introduced to provision data volumes for applications running in Kubernetes.
How did SIG Storage originally form, and how has its mission changed over time?
SIG Storage was formed to address the challenges of handling persistent data within Kubernetes. Initially, PersistentVolumes were implemented as in-tree plugins, and the SIG managed those plugins while developing core storage primitives like PersistentVolumes and PersistentVolumeClaims.
Container Storage Interface (CSI) was introduced later and played a crucial role in simplifying storage integration, enabling third-party storage providers to develop and maintain their own out-of-tree plugins without modifying Kubernetes core code.
With basic integration addressed by CSI, the SIG's mission expanded to include advanced storage features that leverage the new interface. The SIG has also expanded its scope to support object storage through the Container Object Storage Interface (COSI).
Current work and roadmap
What are the top features SIG Storage is actively working on right now?
The Data Protection WG has been working on a couple of exciting features:
-
VolumeGroupSnapshot is a Kubernetes feature enabling a crash-consistent, point-in-time snapshot of multiple PersistentVolumes simultaneously. This ensures data integrity for applications - like databases - that rely on multiple volumes by capturing all volumes in the group atomically, at the exact same point in time. It just moved to GA in Kubernetes v1.36.
-
CSI Changed Block Tracking (CBT) enables efficient, incremental backups. By allowing storage systems to report only the blocks that have changed since the last snapshot, it significantly reduces the amount of data that needs to be transferred. It just moved to Beta in Kubernetes v1.36.
Another feature worth highlighting is Container Object Storage Interface (COSI). COSI provides a standard interface for provisioning and consuming object storage buckets in Kubernetes - standardizing object storage for containerized applications much like CSI did for block and file storage. COSI is now transitioning to v1alpha2, with plans for promotion to Beta in a future release.
What recent work from SIG Storage do you consider a "win" for users?
The graduation of VolumeAttributesClass to GA in Kubernetes v1.34 is a major win for users managing stateful workloads. Previously, changing volume attributes like IOPS or throughput required out-of-band actions or disruptive operations. Now, users can dynamically tune storage properties such as IOPS or throughput directly through the Kubernetes API - scaling up for peak loads or down to optimize costs - without external processes or downtime.
VolumeAttributesClass enables dynamic modification of storage characteristics without recreating the volume. This completes the picture by allowing users to tune both capacity and other storage properties dynamically, just as they can now tune both CPU and memory for compute.
Looking ahead one or two releases, what's on the roadmap that people should watch for?
I'd like to draw attention to the Volume Health feature. This feature is designed to offer critical visibility into the operational status and integrity of persistent volumes. By enabling storage drivers and the Kubernetes control plane to report issues, it allows for proactive monitoring and identification of volume-related problems.
Currently, volume health information is reported via non-persistent events. We are actively investigating enhancements to this feature with the goal of supporting automated remediation capabilities in the future.
Are there areas where you'd really like more discussion or help from the community?
We always need help from the community to fix bugs, add tests, and help with reviews.
We'd also like to get feedback on the Alpha feature Mutable PV Affinity, which was introduced in Kubernetes v1.35. Use cases include migrating volumes from zonal to regional storage or migrating from one disk type to another.
Another topic is volume replication. It was raised at KubeCon Atlanta and has been discussed in the Data Protection WG. Community members interested in this topic are encouraged to join the DP WG meetings.
What are the biggest challenges users face today when running stateful workloads on Kubernetes?
While Kubernetes has moved stateful workloads - like databases and AI pipelines - into the mainstream, managing "state" in a system designed for ephemerality remains difficult:
-
Data Gravity and Storage Locality: Pods move in seconds, but data has gravity. If a node fails, a pod using local storage is stuck. Operators must decide whether the failure is transient or permanent - a high-stakes call. This is why we are enhancing the Volume Health feature to provide the visibility needed to automate recovery choices.
-
Day 2 Complexity: Setting up a database is easy; maintaining its health over time is the real challenge. Standard Kubernetes objects like StatefulSets offer a baseline, but they lack the operational logic needed for tasks such as schema upgrades, engine patching, or cluster-wide Kubernetes upgrades.
-
Data Mobility: Moving persistent data remains a significant hurdle - whether migrating between storage tiers, shifting workloads across availability zones, or moving to a different cluster. This challenge includes ongoing synchronization and replication for high availability and disaster recovery across a distributed system.
Storage and AI
How do you see storage evolving in Kubernetes over the next few years, especially as AI/ML workloads grow?
I see several trends shaping storage in Kubernetes as it evolves from a container orchestrator into the "Operating System" for AI:
-
More Intelligent Data Management: We'll see a shift toward smarter CSI drivers and data management tools offering advanced features like automatic tiering, snapshots, migration, and replication - optimized specifically for high-performance AI/ML workflows and large data platforms.
-
Object Storage as a First-Class Citizen: AI datasets now frequently reach exabyte scale, making object storage the preferred choice for AI workloads. COSI is standardizing bucket management just as CSI did for disks, allowing data scientists to use a BucketClaim to provision S3-compatible storage natively and unifying object, file, and block storage into a single workflow.
-
Performance and Low Latency: For AI/ML, storage needs to keep up with GPU processing speeds. This will accelerate adoption of high-performance parallel file systems and NVMe-over-Fabrics (NVMe-oF) technologies managed natively via Kubernetes. The line between traditional block/file and memory-speed storage will continue to blur.
-
Data-Aware Scheduling: Instead of just considering CPU and RAM, the Kubernetes scheduler will increasingly prioritize placing Pods based on data locality - calculating the cost of moving data versus moving compute to keep massive data platforms performant.
SIG Storage continues to tackle some of the hardest problems in Kubernetes: keeping stateful applications running reliably, making storage operations transparent and composable, and now scaling up to meet the demands of AI-era workloads. Whether you're a user managing databases in production or a developer curious about storage internals, there's a place for you in SIG Storage.
If you'd like to get involved, check out the SIG Storage community page and join the bi-weekly meetings. You can also find the SIG on Slack at #sig-storage.
15 Jun 2026 12:00am GMT
01 Jun 2026
Kubernetes Blog
From Kubernetes Dashboard to Headlamp: Understanding the Transition
For many people, Kubernetes Dashboard was their first window into Kubernetes. It offered a simple visual way to see what was running in a cluster, inspect resources, and build confidence without relying on the command line. For years, it helped developers, students, and operators make sense of Kubernetes, and it served as an important onramp into the ecosystem.
The Kubernetes Dashboard project has now been archived. We deeply respect the work the team did and the role Dashboard played in making Kubernetes more approachable for so many users.
Headlamp builds on that foundation and carries it forward. It keeps the clarity of a visual interface while adding capabilities that match how Kubernetes is used today. This includes multi-cluster visibility, application-centric views, extensibility through plugins, and flexible deployment options that work both in-cluster and on the desktop.
This guide is meant to help you navigate that transition with confidence. Before diving into the mechanics of migration, we start with familiar ground by looking at how common Kubernetes Dashboard workflows map to Headlamp. We also cover what stays the same and what improves after the switch. The goal is not just to replace a tool, but to honor a user-centered legacy and help you land in a UI that can grow with you as your Kubernetes usage evolves.
Mapping Kubernetes Dashboard workloads to Headlamp
If you have used Kubernetes Dashboard before, many workflows in Headlamp will feel familiar. Headlamp does not introduce a new way of thinking. Instead, it builds on workloads users already know and extends them in practical ways. The focus is continuity. What worked before still works, with more room to grow.
Viewing workloads and resources
In Kubernetes Dashboard, most users started by browsing workloads like pods, deployments, services, and namespaces. Headlamp keeps this same starting point. Workloads are easy to find and inspect, and moving between namespaces and clusters is simpler. Resources are still organized in familiar ways, and navigation feels smoother, especially when you work across multiple environments.

Editing and interacting with resources
Like Kubernetes Dashboard, Headlamp lets you view and edit manifests directly in the UI based on your permissions. You can delete resources, scale workloads, or update configurations from the interface. All actions follow standard Kubernetes RBAC. If you could perform an action in Dashboard, you will find the same capability in Headlamp, with the same respect for access controls.

Understanding relationships
Where Headlamp begins to expand the experience is in how it presents relationships between resources. In addition to list views, Headlamp offers visual ways to see how workloads, services, and configurations connect. This helps provide context without changing the underlying workloads users already rely on.

At a high level, the tasks you performed in Kubernetes Dashboard are still there. Headlamp keeps familiar workflows while making it easier to scale as clusters, teams, and applications grow.
Where Headlamp goes beyond Kubernetes Dashboard
Expanding from single cluster to multi-cluster workflows
Kubernetes Dashboard was designed to work with one cluster at a time. That model worked well for simple setups, but it became limiting as teams adopted multiple environments. Headlamp expands this view by letting you work with multiple clusters from a single interface without switching tools or losing context. This makes it easier to manage development, staging, and production environments side by side.

For teams running Kubernetes in more than one place, this shift reduces friction. You can stay oriented and move between clusters with confidence.
From resource lists to application context with Projects
Projects give you an application-centered way to view Kubernetes. Instead of jumping between lists, you can group related workloads, services, and supporting resources in one place. This makes applications easier to understand. You can see what belongs together, track changes in context, and troubleshoot without scanning the cluster piece by piece.
Projects are built on native Kubernetes concepts. Namespaces, labels, and RBAC continue to work the same way they always have. Headlamp adds a visual layer that brings related resources together.
Projects are optional. You can still work at the individual resource level when that fits your task. When you need more context, Projects help you step back and see the bigger picture.

Extend the Headlamp UI with plugins
Headlamp can be extended through plugins that bring common workflows directly into the UI. Instead of switching tools, you work in one place with the same context.

For example, the Flux plugin brings GitOps workflows into Headlamp. It allows teams to view application state alongside the Kubernetes resources that Flux manages, making it easier to understand how changes in Git relate to what is running in the cluster.

The AI Assistant follows a similar pattern. It adds a conversational layer to the UI that helps users understand what they are seeing, troubleshoot issues, or take action. All of this happens in the same screen where the problem appears.

Building your own plugins
Plugins are optional and not limited to community-built extensions. Platform and project teams can also create their own plugins. This allows organizations to add custom integrations that match their specific workflows and internal tooling, while keeping the user experience consistent.
Choosing how and where Headlamp runs
Headlamp gives teams flexibility in how they use a Kubernetes UI. You can run it directly in a cluster, use it as a desktop application, or combine both approaches based on your needs.
Running Headlamp in-cluster works well for shared environments. It provides a centrally managed UI with controlled access and fits naturally into Kubernetes setups, following the same authentication and RBAC rules as other in-cluster components.

The desktop application is often a better fit for local development and onboarding. It also works well when you need to manage multiple clusters from one place. Users can connect using their existing kubeconfig without deploying anything into the cluster.

These options are not mutually exclusive. Many teams use the desktop app for day-to-day work, while relying on an in-cluster deployment for shared or production environments.
Preparing for the Migration
Before moving from Kubernetes Dashboard to Headlamp, it can be helpful to pause and take stock of how you use the Dashboard today. A little reflection up front can go a long way toward making the transition feel smooth and familiar.
Start by noting which clusters and namespaces you access and how authentication works. Headlamp relies on standard Kubernetes authentication and RBAC. In most cases, existing access models carry over without change. If users already connect using kubeconfig files or service accounts, they will be able to access the same resources in Headlamp.
It is also useful to think about the workflows that matter most to your team. Some users rely on Dashboard for quick inspection or troubleshooting, while others use it for lightweight edits or validation. Headlamp supports these same workflows and adds optional capabilities on top. Knowing what you rely on today helps the transition feel predictable and confidence building.
If you would like to explore Headlamp or try it out before migrating, you can learn more at headlamp.dev.
This blog focused on understanding the transition and what to expect. A step by step migration guide is coming soon and will walk through installation and migration in detail.
01 Jun 2026 6:00pm GMT