15 May 2026

feedKubernetes Blog

Kubernetes v1.36: New Metric for Route Sync in the Cloud Controller Manager

This article was originally published with the wrong date. It was later republished, dated the 15th of May 2026.

Kubernetes v1.36 introduces a new alpha counter metric route_controller_route_sync_total to the Cloud Controller Manager (CCM) route controller implementation at k8s.io/cloud-provider. This metric increments each time routes are synced with the cloud provider.

A/B testing watch-based route reconciliation

This metric was added to help operators validate the CloudControllerManagerWatchBasedRoutesReconciliation feature gate introduced in Kubernetes v1.35. That feature gate switches the route controller from a fixed-interval loop to a watch-based approach that only reconciles when nodes actually change. This reduces unnecessary API calls to the infrastructure provider, lowering pressure on rate-limited APIs and allowing operators to make more efficient use of their available quota.

To A/B test this, compare route_controller_route_sync_total with the feature gate disabled (default) versus enabled. In clusters where node changes are infrequent, you should see a significant drop in the sync rate with the feature gate turned on.

Example: expected behavior

With the feature gate disabled (the default fixed-interval loop), the counter increments steadily regardless of whether any node changes occurred:

# After 10 minutes with no node changes
route_controller_route_sync_total 60
# After 20 minutes, still no node changes
route_controller_route_sync_total 120

With the feature gate enabled (watch-based reconciliation), the counter only increments when nodes are actually added, removed, or updated:

# After 10 minutes with no node changes
route_controller_route_sync_total 1
# After 20 minutes, still no node changes - counter unchanged
route_controller_route_sync_total 1
# A new node joins the cluster - counter increments
route_controller_route_sync_total 2

The difference is especially visible in stable clusters where nodes rarely change.

Where can I give feedback?

If you have feedback, feel free to reach out through any of the following channels:

How can I learn more?

For more details, refer to KEP-5237.

15 May 2026 6:35pm GMT

Kubernetes v1.36: Mixed Version Proxy Graduates to Beta

Back in Kubernetes 1.28, we introduced the Mixed Version Proxy (MVP) as an Alpha feature (under the feature gate UnknownVersionInteroperabilityProxy) in a previous blog post. The goal was simple but critical: make cluster upgrades safer by ensuring that requests for resources not yet known to an older API server are correctly routed to a newer peer API server, instead of returning an incorrect 404 Not Found.

We are excited to announce that the Mixed Version Proxy is moving to Beta in Kubernetes 1.36 and will be enabled by default! The feature has evolved significantly since its initial release, addressing key gaps and modernizing its architecture.

Here is a look at how the feature has evolved and what you need to know to leverage it in your clusters.

What problem are we solving?

In a highly available control plane undergoing an upgrade, you often have API servers running different versions. These servers might serve different sets of APIs (Groups, Versions, Resources). Without MVP, if a client request lands on an API server that does not serve the requested resource (e.g., a new API version introduced in the upgrade), that server returns a 404 Not Found. This is technically incorrect because the resource is available in the cluster, just not on that specific server. This can lead to serious side effects, such as mistaken garbage collection or blocked namespace deletions. MVP solves this by proxying the request to a peer API server that can serve it.

sequenceDiagram participant Client participant API_Server_A as API Server A (Older/Different) participant API_Server_B as API Server B (Newer/Capable) Client->>API_Server_A: 1. Request for Resource (e.g., v2) Note over API_Server_A: Determines it cannot serve locally API_Server_A->>API_Server_A: 2. Looks up capable peer in Discovery Cache API_Server_A->>API_Server_B: 3. Proxies request (adds x-kubernetes-peer-proxied header) API_Server_B->>API_Server_B: 4. Processes request locally API_Server_B-->>API_Server_A: 5. Returns Response API_Server_A-->>Client: 6. Forwards Response
JavaScript must be enabled to view this content

How has it evolved since 1.28

The initial Alpha implementation was a great proof of concept, but it had some limitations and relied on older mechanisms. Here is how we have modernized it for Beta:

  1. From StorageVersion API to Aggregated Discovery In the Alpha version, API servers relied on the StorageVersion API to figure out which peers served which resources. While functional, this approach had a significant limitation: the StorageVersion API is not yet supported for CRDs and aggregated APIs. For Beta, we have replaced the reliance on StorageVersion API calls with the use of Aggregated Discovery. API servers now use the aggregated discovery data to dynamically understand the capabilities of their peers.

  2. The Missing Piece: Peer-Aggregated Discovery The 1.28 blog post noted a significant gap: while we could proxy resource requests, discovery requests still only showed what the local API server knew about. In 1.36, we have added Peer-Aggregated Discovery support! Now, when a client performs discovery (e.g., listing available APIs), the API server merges its local view with the discovery data from all active peers. This provides clients with a complete, unified view of all APIs available across the entire cluster, regardless of which API server they connected to.

sequenceDiagram participant Client participant API_Server_A as API Server A participant API_Server_B as API Server B Client->>API_Server_A: 1. Request Discovery Document API_Server_A->>API_Server_A: 2. Gets Local APIs API_Server_A->>API_Server_B: 3. Gets Peer APIs (Cached or Direct) API_Server_A->>API_Server_A: 4. Merges and sorts lists deterministically API_Server_A-->>Client: 5. Returns Unified Discovery Document
JavaScript must be enabled to view this content

While peer-aggregated discovery will be the default behavior (note that peer-aggregated discovery is enabled if the --peer-ca-file flag is set, otherwise the server will fallback to showing only its local APIs), there may be cases where you need to inspect only the resources served by the specific API server you are connected to. You can request this non-aggregated view by including the profile=nopeer parameter in your request's Accept header (e.g., Accept: application/json;g=apidiscovery.k8s.io;v=v2;as=APIGroupDiscoveryList;profile=nopeer).

Required configuration

While the feature gate will be enabled by default, it requires certain flags to be set to allow for secure communication between peer API servers. To function correctly, make sure your API server is configured with the following flags:

Configuring with kubeadm

If you manage your cluster with kubeadm, you can configure these flags in your ClusterConfiguration file:

apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration
apiServer:
 extraArgs:
 peer-ca-file: "/etc/kubernetes/pki/ca.crt"
 # peer-advertise-ip and port if needed

Call to action

If you are running multi-master clusters and upgrading them regularly, the Mixed Version Proxy is a major safety improvement. With it becoming default in 1.36, we encourage you to:

  1. Review your API server flags to ensure --peer-ca-file is set properly.
  2. Test the feature in your staging environments as you prepare for the 1.36 upgrade.
  3. Provide feedback to SIG API Machinery (Slack, mailing list, or by attending SIG API Machinery meetings) on your experience.

15 May 2026 6:00pm GMT

14 May 2026

feedKubernetes Blog

Kubernetes v1.36: Deprecation and removal of Service ExternalIPs

The .spec.externalIPs field for Service was an early attempt to provide cloud-load-balancer-like functionality for non-cloud clusters. Unfortunately, the API assumes that every user in the cluster is fully trusted, and in any situation where that is not the case, it enables various security exploits, as described in CVE-2020-8554.

Since Kubernetes 1.21, the Kubernetes project has recommended that all users disable .spec.externalIPs. To make that easier, Kubernetes also added an admission controller (DenyServiceExternalIPs) that can be enabled to do this. At the time, SIG Network felt that blocking the functionality by default was too large a breaking change to consider.

However, the security problems are still there, and as a project we're increasingly unhappy with the "insecure by default" state of the feature. Additionally, there are now several better alternatives for non-cloud clusters wanting load-balancer-like functionality.

As a result, the .spec.externalIPs field for Service is now formally deprecated in Kubernetes 1.36. We expect that a future minor release of Kubernetes will drop implementation of the behavior from kube-proxy, and will update the Kubernetes conformance criteria to require that conforming implementations do not provide support.

A note on terminology, and what hasn't been deprecated

The phrase external IP is somewhat overloaded in Kubernetes:

This deprecation is about the first of those. If you are not setting the field externalIPs in any of your Services, then it does not apply to you.

That said, as a precaution, you may still want to enable the DenyServiceExternalIPs admission controller to block any future use of the externalIPs field.

Alternatives to externalIPs

If you are using .spec.externalIPs, then there are several alternatives.

Consider a Service like the following:

apiVersion: v1
kind: Service
metadata:
 name: my-example-service
spec:
 type: ClusterIP
 selector:
 app.kubernetes.io/name: my-example-app
 ports:
 - protocol: TCP
 port: 80
 targetPort: 8080
 externalIPs:
 - "192.0.2.4"

Using manually-managed LoadBalancer Services instead of externalIPs

The easiest (but also worst) option is to just switch from using externalIPs to using a type: LoadBalancer service, and assigning a load balancer IP by hand. This is, essentially, exactly the same as externalIPs, with one important difference: the load balancer IP is part of the Service's .status, not its .spec, and in a cluster with RBAC enabled, it can't be edited by ordinary users by default. Thus, this replacement for externalIPs would only be available to users who were given permission by the admins (although those users would then be fully empowered to replicate CVE-2020-8554; there would still not be any further checks to ensure that one user wasn't stealing another user's IPs, etc.)

Because of the way that .status works in Kubernetes, you must create the Service without a load balancer IP, and then add the IP as a second step:

$ cat loadbalancer-service.yaml
apiVersion: v1
kind: Service
metadata:
 name: my-example-service
spec:
 # prevent any real load balancer controllers from managing this service
 # by using a non-existent loadBalancerClass
 loadBalancerClass: non-existent-class
 type: LoadBalancer
 selector:
 app.kubernetes.io/name: my-example-app
 ports:
 - protocol: TCP
 port: 80
 targetPort: 8080
$ kubectl apply -f loadbalancer-service.yaml
service/my-example-service created
$ kubectl patch service my-example-service --subresource=status --type=merge -p '{"status":{"loadBalancer":{"ingress":[{"ip":"192.0.2.4"}]}}}'

Using a non-cloud based load balancer controller

Although LoadBalancer services were originally designed to be backed by cloud load balancers, Kubernetes can also support them on non-cloud platforms by using a third-party load balancer controller such as MetalLB. This solves the security problems associated with externalIPs because the administrator can configure what ranges of IP addresses the controller will assign to services, and the controller will ensure that two services can't both use the same IP.

So, for example, after installing and configuring MetalLB, a cluster administrator could configure a pool of IP addresses for use in the cluster:

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
 name: production
 namespace: metallb-system
spec:
 addresses:
 - 192.0.2.0/24
 autoAssign: true
 avoidBuggyIPs: false

After which a user can create a type: LoadBalancer Service and MetalLB will handle the assignment of the IP address. MetalLB even supports the deprecated loadBalancerIP field in Service, so the end user can request a specific IP (assuming it is available) for backward-compatibility with the externalIPs approach, rather than being assigned one at random:

apiVersion: v1
kind: Service
metadata:
 name: my-example-service
spec:
 type: LoadBalancer
 selector:
 app.kubernetes.io/name: my-example-app
 ports:
 - protocol: TCP
 port: 80
 targetPort: 8080
 loadBalancerIP: "192.0.2.4"

Similar approaches would work with other load balancer controllers. This approach can allow cluster administrators to have control over which IP addresses are assigned, rather than users.

Using Gateway API

Another potential solution is to use an implementation of the Gateway API.

Gateway API allows cluster administrators to define a Gateway resource, which can have an IP address attached to it via the .spec.addresses field. Since Gateway resources are designed to be managed by cluster administrators, RBAC rules can be put in place to only allow privileged users to manage them.

An example of how this could look is:

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
 name: example-gateway
spec:
 gatewayClassName: example-gateway-class
 addresses:
 - type: IPAddress
 value: "192.0.2.4"
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
 name: example-route
spec:
 parentRefs:
 - name: example-gateway
 rules:
 - backendRefs:
 - name: example-svc
 port: 80
---
apiVersion: v1
kind: Service
metadata:
 name: example-svc
spec:
 type: ClusterIP
 selector:
 app.kubernetes.io/name: example-app
 ports:
 - protocol: TCP
 port: 80
 targetPort: 8080

The Gateway API project is the next generation of Kubernetes Ingress, Load Balancing, and Service Mesh APIs within Kubernetes. Gateway API was designed to fix the shortcomings of the Service and Ingress resource, making it a very reliable robust solution that is under active development.

Timeline for externalIPs deprecation

The rough timeline for this deprecation is as follows:

  1. With the release of Kubernetes 1.36, the field was deprecated; Kubernetes now emits warnings when a user uses this field
  2. About a year later (v1.40 at the earliest) support for .spec.externalIPs will be disabled in kube-proxy, but users will have a way to opt back in should they require more time to migrate away
  3. About another year later - (v1.43 at the earliest) support will be disabled completely; users won't have a way to opt back in

14 May 2026 6:35pm GMT

13 May 2026

feedKubernetes Blog

Kubernetes v1.36: Advancing Workload-Aware Scheduling

AI/ML and batch workloads introduce unique scheduling challenges that go beyond simple Pod-by-Pod scheduling. In Kubernetes v1.35, we introduced the first tranche of workload-aware scheduling improvements, featuring the foundational Workload API alongside basic gang scheduling support built on a Pod-based framework, and an opportunistic batching feature to efficiently process identical Pods.

Kubernetes v1.36 introduces a significant architectural evolution by cleanly separating API concerns: the Workload API acts as a static template, while the new PodGroup API handles the runtime state. To support this, the kube-scheduler features a new PodGroup scheduling cycle that enables atomic workload processing and paves the way for future enhancements. This release also debuts the first iterations of topology-aware scheduling and workload-aware preemption to advance scheduling capabilities. Additionally, ResourceClaim support for workloads unlocks Dynamic Resource Allocation (DRA) for PodGroups. Finally, to demonstrate real-world readiness, v1.36 delivers the first phase of integration between the Job controller and the new API.

Workload and PodGroup API updates

The Workload API now serves as a static template, while the new PodGroup API describes the runtime object. Kubernetes v1.36 introduces the Workload and PodGroup APIs as part of the scheduling.k8s.io/v1alpha2 API group, completely replacing the previous v1alpha1 API version.

In v1.35, Pod groups and their runtime states were embedded within the Workload resource. The new model decouples these concepts: the Workload now serves as a static template object, while the PodGroup manages the runtime state. This separation also improves performance and scalability as the PodGroup API allows per-replica sharding of status updates.

Because the Workload API acts merely as a template, the kube-scheduler's logic is streamlined. The scheduler can directly read the PodGroup, which contains all the information required by the scheduler, without needing to watch or parse the Workload object itself.

Here is what the updated configuration looks like. Workload controllers (such as the Job controller) define the Workload object, which now acts as a static template for your Pod groups:

apiVersion: scheduling.k8s.io/v1alpha2
kind: Workload
metadata:
 name: training-job-workload
 namespace: some-ns
spec:
 # Pod groups are now defined as templates,
 # which contains the PodGroup objects' spec fields.
 podGroupTemplates:
 - name: workers
 schedulingPolicy:
 gang:
 # The gang is schedulable only if 4 pods can run at once
 minCount: 4

Controllers then stamp out runtime PodGroup instances based on those templates. The PodGroup runtime object holds the actual scheduling policy and references the template from which it was created. It also has a status containing conditions that mirror the states of individual Pods, reflecting the overall scheduling state of the group:

apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
 name: training-job-workers-pg
 namespace: some-ns
spec:
 # The PodGroup references the Workload template it originated from.
 # In comparison, .metadata.ownerReferences points to the "true" workload object,
 # e.g., a Job. 
 podGroupTemplateRef:
 workload:
 workloadName: training-job-workload
 podGroupTemplateName: workers
 # The actual scheduling policy is placed inside the runtime PodGroup
 schedulingPolicy:
 gang:
 minCount: 4
status:
 # The status contains conditions mirroring individual Pod conditions.
 conditions:
 - type: PodGroupScheduled
 status: "True"
 lastTransitionTime: 2026-04-03T00:00:00Z

Finally, to bridge this new architecture with individual Pods, the workloadRef field in the Pod API has been replaced with the schedulingGroup field. When creating Pods, you link them directly to the runtime PodGroup:

apiVersion: v1
kind: Pod
metadata:
 name: worker-0
 namespace: some-ns
spec:
 # The workloadRef field has been replaced by schedulingGroup
 schedulingGroup:
 podGroupName: training-job-workers-pg
 ...

By keeping the Workload as a static template and elevating the PodGroup to a first-class, standalone API, we establish a robust foundation for building advanced workload scheduling capabilities in future Kubernetes releases.

PodGroup scheduling cycle and gang scheduling

To efficiently manage these workloads, the kube-scheduler now features a dedicated PodGroup scheduling cycle. Instead of evaluating and reserving resources sequentially Pod-by-Pod, which risks scheduling deadlocks, the scheduler evaluates the group as a unified operation.

When the scheduler pops a PodGroup member from the scheduling queue, regardless of the group's specific policy, it fetches the rest of the queued Pods for that group, sorts them deterministically, and executes an atomic scheduling cycle as follows:

  1. The scheduler takes a single snapshot of the cluster state to prevent race conditions and ensure consistency while evaluating the entire group.

  2. It then attempts to find valid Node placements for all Pods in the group using a PodGroup scheduling algorithm, which leverages the standard Pod-based filtering and scoring phases.

  3. Based on the algorithm's outcome, the scheduling decision is applied atomically for the entire PodGroup.

    • Success: If the placement is found and group constraints are met, the schedulable member Pods are moved directly to the binding phase together. Any remaining unschedulable Pods are returned to the scheduling queue to wait for available resources so they can join the already scheduled Pods.

      (Note: If new Pods are added to a PodGroup after others are already scheduled, the cycle evaluates the new Pods while accounting for the existing ones. Crucially, Pods already assigned to Nodes remain running. The scheduler will not unassign or evict them, even if the group fails to meet its requirements in subsequent cycles.)

    • Failure: If the group fails to meet its requirements, the entire group is considered unschedulable. None of the Pods are bound, and they are returned to the scheduling queue to retry later after a backoff period.

This cycle acts as the foundation for gang scheduling. When your workload requires strict all-or-nothing placement, the gang policy leverages this cycle to prevent partial deployments that lead to resource wastage and potential deadlocks.

While the scheduler still holds the Pods in the PreEnqueue until the minCount requirement is met, the actual scheduling phase now relies entirely on the new PodGroup cycle. Specifically, during the algorithm's execution, the scheduler verifies that the number of schedulable Pods satisfies the minCount. If the cluster cannot accommodate the required minimum, none of the pods are bound. The group fails and waits for sufficient resources to free up.

Limitations

The first version of the PodGroup scheduling cycle comes with certain limitations:

In addition to the above, for cases involving intra-group dependencies (e.g., when the schedulability of one Pod depends on another group member via inter-Pod affinity), this algorithm may fail to find a placement regardless of cluster state due to its deterministic processing order.

Topology-aware scheduling

For complex distributed workloads like AI/ML training or batch processing, placing Pods randomly across a cluster can introduce significant network latency and bottleneck overall performance.

Topology-aware scheduling addresses this problem by allowing you to define topology constraints directly on a PodGroup, ensuring its Pods are co-located within specific physical or logical domains:

apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
 name: topology-aware-workers-pg
spec:
 schedulingPolicy:
 gang:
 minCount: 4
 # Enforce that the pods are co-located based on the rack topology
 schedulingConstraints:
 topology:
 - key: topology.kubernetes.io/rack

In this example, the kube-scheduler attempts to schedule the Pods across various combinations of Nodes that match the rack topology constraint. It then selects the optimal placement based on how efficiently the PodGroup utilizes resources and how many Pods can successfully be scheduled within that domain.

To achieve this, the scheduler extends the PodGroup scheduling cycle with a dedicated placement-based algorithm consisting of three phases:

  1. Generate candidate placements (subsets of Nodes that are theoretically feasible for the PodGroup's assignment) based on the group's scheduling constraints. The topology-aware scheduling plugin uses the new PlacementGenerate extension point to create these placements.

  2. Evaluate each proposed placement to confirm whether the entire PodGroup can actually fit there.

  3. Score all feasible placements to select the best fit for the PodGroup. The topology-aware scheduling plugins use the new PlacementScore extension point to score these placements.

Currently, topology-aware scheduling does not trigger Pod preemption to satisfy constraints. However, we plan to integrate workload-aware preemption with topology constraints in the upcoming release.

While Kubernetes v1.36 delivers this foundational topology-aware scheduling, the Kubernetes project is planning expand its capabilities soon. Future updates will introduce support for multiple topology levels, soft constraints (preferences), deeper integration with Dynamic Resource Allocation (DRA), and more robust behavior when paired with the basic scheduling policy.

Workload-aware preemption

To support the new PodGroup scheduling cycle, Kubernetes v1.36 introduces a new type of preemption mechanism called workload-aware preemption. When a PodGroup cannot be scheduled, the scheduler utilizes this mechanism to try making a scheduling of this PodGroup possible.

Compared to the default preemption used in the standard Pod-by-Pod scheduling cycle, this new mechanism treats the entire PodGroup as a single preemptor unit. Instead of evaluating preemption victims on each Node separately, it searches across the entire cluster. This allows the scheduler to preempt Pods from multiple Nodes simultaneously, making enough space to schedule the whole PodGroup afterwards.

Workload-aware preemption also introduces two additional concepts directly to the PodGroup API:

In Kubernetes v1.36, these fields are only respected by the workload-aware preemption mechanism. The people working on this set of features are hoping to extend support for these fields to other disruption sources, including default preemption used in the Pod-by-Pod scheduling cycle, in future releases.

apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
 name: victim-pg
spec:
 priorityClassName: high-priority
 priority: 1000
 disruptionMode: PodGroup

In this example, when the scheduler evaluates victim-pg as a potential preemption victim during a workload-aware preemption cycle, it will use 1000 as its priority and preempt the PodGroup in a strictly all-or-nothing fashion.

DRA ResourceClaim support for workloads

Since its general availability in Kubernetes v1.34, DRA has enabled Pods to make detailed requests for devices like GPUs, TPUs, and NICs. Requested devices can be shared by multiple Pods requesting the same ResourceClaim by name. Other requests can be replicated through a ResourceClaimTemplate, in which Kubernetes generates one ResourceClaim with a non-deterministic name for each Pod referencing the template. However, large-scale workloads that require certain Pods to share certain devices are currently left to manage creating individual ResourceClaims themselves.

Now, in addition to Pods, PodGroups can represent the replicable unit for a ResourceClaimTemplate. For ResourceClaimTemplates referenced by one of a PodGroup's spec.resourceClaims, Kubernetes generates one ResourceClaim for the entire PodGroup, no matter how many Pods are in the group. When one of a Pod's spec.resourceClaims for a ResourceClaimTemplate matches one of its PodGroup's spec.resourceClaims, the Pod's claim resolves to the ResourceClaim generated for the PodGroup and a ResourceClaim will not be generated for that individual Pod. A single PodGroupTemplate in a Workload object can express resource requests which are both copied for each distinct PodGroup and shareable by the Pods within each group.

The following example shows two Pods requesting the same ResourceClaim generated from a ResourceClaimTemplate for their PodGroup:

apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
 name: training-job-workers-pg
spec:
 ...
 resourceClaims:
 - name: pg-claim
 resourceClaimTemplateName: my-claim-template
---
apiVersion: v1
kind: Pod
metadata:
 name: topology-aware-workers-pg-pod-1
spec:
 ...
 schedulingGroup:
 podGroupName: training-job-workers-pg
 resourceClaims:
 - name: pg-claim
 resourceClaimTemplateName: my-claim-template
---
apiVersion: v1
kind: Pod
metadata:
 name: topology-aware-workers-pg-pod-2
spec:
 ...
 schedulingGroup:
 podGroupName: training-job-workers-pg
 resourceClaims:
 - name: pg-claim
 resourceClaimTemplateName: my-claim-template

In addition, ResourceClaims referenced by PodGroups, either through resourceClaimName or the claim generated from resourceClaimTemplateName, become reserved for the entire PodGroup. Previously, kube-scheduler could only list individual Pods in a ResourceClaim's status.reservedFor field which is limited to 256 items. Now, a single PodGroup reference in status.reservedFor can represent many more than 256 Pods, allowing high-cardinality sharing of devices.

Together, these changes enable massive workloads with complex topologies to utilize DRA for scalable device management.

Integration with the Job controller

In Kubernetes v1.36, the Job controller can create and manage Workload and PodGroup objects on your behalf, so that Jobs representing a tightly coupled parallel application, such as distributed AI training, are gang-scheduled without any additional tooling. Without this integration, you would have to create the Workload and PodGroup yourself and wire their references into the Pod template. Now, the Job controller automates this process natively.

When the WorkloadWithJob feature gate is enabled, the Job controller automatically:

When does the integration kick in?

To keep the first feature iteration predictable, the Job controller only creates a Workload and PodGroup when the Job has a well-defined, fixed shape:

These conditions describe the class of Jobs that gang scheduling can reason about: each Pod has a stable identity (Indexed), the gang size is known and fixed at admission time (parallelism == completions), and no other controller has already claimed scheduling responsibility (schedulingGroup field is unset). Jobs that do not meet these conditions are scheduled Pod-by-Pod, exactly as before.

If you set schedulingGroup on the Pod template yourself (for example, because a higher-level controller is managing the workload), the Job controller leaves the Pod template alone and does not create its own Workload or PodGroup. This makes the feature safe to enable in clusters that already use an external batch system.

Here is an example of a Job that qualifies for gang scheduling:

apiVersion: batch/v1
kind: Job
metadata:
 name: training-job
 namespace: job-ns
spec:
 completionMode: Indexed
 parallelism: 4
 completions: 4
 template:
 spec:
 restartPolicy: Never
 containers:
 - name: worker
 image: registry.example/trainer:latest

The Job controller creates a Workload and a PodGroup owned by this Job, and every Pod it creates carries a .spec.schedulingGroup that points at the generated PodGroup. The Pods are then scheduled together once all four can be placed at the same time using the PodGroup scheduling cycle described earlier in this post.

What's not covered yet

The current constraints limit this integration to static, indexed, fully-parallel Jobs. Support for additional workload shapes, including elastic Jobs and other built-in controllers, is tracked in KEP-5547.

In future Kubernetes releases, this integration will expand to support additional workload controllers, and the current constraints for Jobs may be relaxed.

What's next?

The journey for workload-aware scheduling doesn't stop here. For v1.37, the community is actively working on:

The priority and implementation order of these focus areas are subject to change. Stay tuned for further updates.

Getting started

All below workload-aware scheduling improvements are available as Alpha features in v1.36. To try them out, you must configure the following:

Once the prerequisite is met, you can enable specific features:

We encourage you to try out workload-aware scheduling in your test clusters and share your experiences to help shape the future of Kubernetes scheduling. You can send your feedback by:

Learn more

To dive deeper into the architecture and design of these features, read the KEPs:

13 May 2026 6:35pm GMT

12 May 2026

feedKubernetes Blog

Kubernetes v1.36: PSI Metrics for Kubernetes Graduates to GA

Since its original implementation in the Linux kernel in 2018, Pressure Stall Information (PSI) has provided users with the high-fidelity signals needed to identify resource saturation before it becomes an outage. Unlike traditional utilization metrics, PSI tells the story of tasks stalled and time lost, all in nicely-packaged percentages of time across the CPU, memory, and I/O.

With the recent release of Kubernetes v1.36, users across the ecosystem have a stable, reliable interface to observe resource contention at the node, pod, and container levels. In this post, we will dive into the improvements and performance testing that proved its readiness for production.

Beyond utilization: why PSI?

Monitoring CPU or memory usage alone can be misleading. A node may report XX% (below 100%) CPU utilization while certain tasks are experiencing severe latency due to scheduling delays. PSI fills this gap by providing:

Proving stability: performance testing at scale

A common concern when graduating telemetry features is the resource overhead required to collect and serve the metrics. To address this, SIG Node conducted extensive performance validation on high-density workloads (80+ pods) across various machine types.

Our testing focused on two primary scenarios to isolate the impact of the Kubelet and kernel-level collection respectively:

  1. Kernel PSI ON / Kubelet Feature OFF vs Kernel PSI ON / Kubelet Feature ON (Kubelet overhead)
  2. Kernel PSI OFF / Kubelet Feature ON vs Kernel PSI ON / Kubelet Feature ON (Kernel overhead)

Scenario 1: The Kubelet Overhead

First, we looked at the kubelet usage on 4 core machines (Case 1). For these, the Linux kernel was already tracking pressure on both clusters by default(psi=1), but we toggled the KubeletPSI feature gate to see if the Kubelet actively querying and exposing these metrics impacted the resource usage. The synchronized bursts seen in the graph are practically identical in both magnitude and frequency, confirming that the Kubelet's collection logic is highly lightweight and blends seamlessly into standard housekeeping cycles. There is no issue about the feature affecting the pre-existing resource use, staying within the normal 0.1 cores or 2.5% of the total node capacity, and is therefore safe for production-scale deployments.

A line graph comparing the kubelet CPU usage rate over elapsed time with the Kubelet PSI feature turned off versus on and kernel PSI always on.

(Case 1) Kubelet CPU Usage Rate Comparison

Figure 2: Kubelet CPU Usage Rate Comparison.

Next, we evaluated the system overhead in the same run. As seen in the following graph, the System CPU usage lines for the Kubelet PSI-enabled (red) follows the same pattern as the Kubelet PSI-disabled (blue) clusters, with a slight expected increase from the baseline. This visualizes that once the OS is tracking PSI, at around 2.5 cores, the act of Kubernetes reading those cgroup metrics is negligible to performance.

A line graph comparing the system CPU usage rate over elapsed time with the PSI feature turned off versus on and kernel PSI default ON.

(Case 1) System CPU Usage Rate Comparison

Figure 1: Node System CPU Usage Rate Comparison.

Scenario 2: The Kernel Overhead

Shifting gears, we evaluated the underlying overhead of enabling PSI on the Linux kernel also on a 4 core machine. By comparing a cluster booted with psi=1 (COS default) against a cluster with psi=0, we isolated the exact cost of the OS-level bookkeeping. Even under heavy I/O and CPU load at an 80-pod density, the System CPU delta between the kernel-enabled and kernel-disabled clusters remained consistently between 0.037 cores and 0.125 cores or 0.925% - 3.125% of the total node capacity. There was a single spike to 0.225 cores, or 5.6%, but was controlled back down within a few seconds. This confirms that the internal kernel tracking is highly efficient under load.

A line graph comparing the Node System (Kernel) CPU usage rate with Kernel PSI ON and OFF over elapsed time.

(Case 2) Node System CPU Usage Rate Comparison

Figure 3: Node System CPU Usage Rate Comparison.

Figure 4 zooms in on the kubelet process itself, which serves as the primary collector for these metrics. . The results show that even while the kubelet performs periodic sweeps to aggregate data from the cgroup hierarchy, its CPU usage remains remarkably low with interchangeable spikes and nothing exceeding 0.25 cores or 6.25% of total capacity for longer than a second.

A line graph comparing the kubelet CPU usage rate over elapsed time with the Kernel PSI feature turned off versus on.

(Case 2) Kubelet CPU Usage Rate Comparison

Figure 4: Kubelet CPU Usage Rate Comparison.

Improvements between beta (1.34) and stable (1.36)

Getting started

To use PSI metrics in your Kubernetes cluster, your nodes must meet the following requirements:

  1. Ensure your nodes are running a Linux kernel version 4.20 or later and are using cgroup v2.
  2. Ensure PSI is enabled at the OS level (your kernel must be compiled with CONFIG_PSI=y and must not be booted with the psi=0 parameter).

As of v1.36, Kubelet PSI metrics are generally available and you do not need to opt in to any feature gate.

Once the OS prerequisites are met, you can start scraping the /metrics/cadvisor endpoint with your Prometheus-compatible monitoring solution or query the Summary API to collect and visualize the new PSI metrics. Note that PSI is a Linux-kernel feature, so these metrics are not available on Windows nodes. Your cluster can contain a mix of Linux and Windows nodes, and on the Windows nodes, the kubelet will simply omit the PSI metrics.

If your cluster is running a recent enough version of Kubernetes and you are a privileged node administrator, you can also proxy to the kubelet's HTTP API via the control plane's API server to see real-time pressure data from the Summary API.

Caution: Proxying to the kubelet is a privileged operation. Granting access to it is a security risk, so ensure you have the appropriate administrative permissions before executing these commands.

CONTAINER_NAME="example-container"
kubectl get --raw "/api/v1/nodes/$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')/proxy/stats/summary" | jq '.pods[].containers[] | select(.name=="'"$CONTAINER_NAME"'") | {name, cpu: .cpu.psi, memory: .memory.psi, io: .io.psi}'

Further reading

If you want to dive deeper into how these metrics are calculated and exposed, check out these resources:

  1. The official Kernel documentation
  2. Understanding PSI in the Kubernetes documentation
  3. cAdvisor Metrics Implementation

Acknowledgements

Support for PSI metrics was developed through the collaborative efforts of SIG Node. Special thanks to all contributors who helped design, implement, test, review, and document this feature across its journey from alpha in v1.33, through beta in v1.34, to GA in v1.36.

To provide feedback on this feature, join the Kubernetes Node Special Interest Group, participate in discussions on the public Slack channel (#sig-node), or file an issue on GitHub.

Feedback

If you have feedback and want to share your experience using this feature, join the discussion:

SIG Node would love to hear about your experiences using this feature in production!

12 May 2026 6:35pm GMT

08 May 2026

feedKubernetes Blog

Kubernetes v1.36: Moving Volume Group Snapshots to GA

Volume group snapshots were introduced as an Alpha feature with the Kubernetes v1.27 release, moved to Beta in v1.32, and to a second Beta in v1.34. We are excited to announce that in the Kubernetes v1.36 release, support for volume group snapshots has reached General Availability (GA).

The support for volume group snapshots relies on a set of extension APIs for group snapshots. These APIs allow users to take crash-consistent snapshots for a set of volumes. Behind the scenes, Kubernetes uses a label selector to group multiple PersistentVolumeClaim objects for snapshotting. A key aim is to allow you to restore that set of snapshots to new volumes and recover your workload based on a crash-consistent recovery point.

This feature is only supported for CSI volume drivers.

An overview of volume group snapshots

Some storage systems provide the ability to create a crash-consistent snapshot of multiple volumes. A group snapshot represents copies made from multiple volumes that are taken at the same point-in-time. A group snapshot can be used either to rehydrate new volumes (pre-populated with the snapshot data) or to restore existing volumes to a previous state (represented by the snapshots).

Why add volume group snapshots to Kubernetes?

The Kubernetes volume plugin system already provides a powerful abstraction that automates the provisioning, attaching, mounting, resizing, and snapshotting of block and file storage. Underpinning all these features is the Kubernetes goal of workload portability.

There was already a VolumeSnapshot API that provides the ability to take a snapshot of a persistent volume to protect against data loss or data corruption. However, some storage systems support consistent group snapshots that allow a snapshot to be taken from multiple volumes at the same point-in-time to achieve write order consistency. This is extremely useful for applications that contain multiple volumes. For example, an application may have data stored in one volume and logs stored in another. If snapshots for these volumes are taken at different times, the application will not be consistent and will not function properly if restored from those snapshots.

While you can quiesce the application first and take individual snapshots sequentially, this process can be time-consuming or sometimes impossible. Consistent group support provides crash consistency across all volumes in the group without the need for application quiescence.

Kubernetes APIs for volume group snapshots

Kubernetes' support for volume group snapshots relies on three API kinds that are used for managing snapshots:

VolumeGroupSnapshot
Created by a Kubernetes user (or automation) to request creation of a volume group snapshot for multiple persistent volume claims.
VolumeGroupSnapshotContent
Created by the snapshot controller for a dynamically created VolumeGroupSnapshot. It contains information about the provisioned cluster resource (a group snapshot). The object binds to the VolumeGroupSnapshot for which it was created with a one-to-one mapping.
VolumeGroupSnapshotClass
Created by cluster administrators to describe how volume group snapshots should be created, including the driver information, the deletion policy, etc.

These three API kinds are defined as CustomResourceDefinitions (CRDs). For the GA release, the API version has been promoted to v1.

What's new in GA?

How do I use Kubernetes volume group snapshots

Creating a new group snapshot with Kubernetes

Once a VolumeGroupSnapshotClass object is defined and you have volumes you want to snapshot together, you may request a new group snapshot by creating a VolumeGroupSnapshot object.

Label the PVCs you wish to group:

% kubectl label pvc pvc-0 group=myGroup
persistentvolumeclaim/pvc-0 labeled

% kubectl label pvc pvc-1 group=myGroup
persistentvolumeclaim/pvc-1 labeled

For dynamic provisioning, a selector must be set so that the snapshot controller can find PVCs with the matching labels to be snapshotted together.

apiVersion: groupsnapshot.storage.k8s.io/v1
kind: VolumeGroupSnapshot
metadata:
 name: snapshot-daily-20260422
 namespace: demo-namespace
spec:
 volumeGroupSnapshotClassName: csi-groupSnapclass
 source:
 selector:
 matchLabels:
 group: myGroup

The VolumeGroupSnapshotClass is required for dynamic provisioning:

apiVersion: groupsnapshot.storage.k8s.io/v1
kind: VolumeGroupSnapshotClass
metadata:
 name: csi-groupSnapclass
driver: example.csi.k8s.io
deletionPolicy: Delete

How to use group snapshot for restore

At restore time, request a new PersistentVolumeClaim to be created from a VolumeSnapshot object that is part of a VolumeGroupSnapshot. Repeat this for all volumes that are part of the group snapshot.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: examplepvc-restored-2026-04-22
 namespace: demo-namespace
spec:
 storageClassName: example-sc
 dataSource:
 name: snapshot-0962a745b2bf930bb385b7b50c9b08af471f1a16780726de19429dd9c94eaca0
 kind: VolumeSnapshot
 apiGroup: snapshot.storage.k8s.io
 accessModes:
 - ReadWriteOncePod
 resources:
 requests:
 storage: 100Mi

As a storage vendor, how do I add support for group snapshots?

To implement the volume group snapshot feature, a CSI driver must:

See the CSI spec and the Kubernetes-CSI Driver Developer Guide for more details.

How can I learn more?

How do I get involved?

This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. On behalf of SIG Storage, I would like to offer a huge thank you to all the contributors who stepped up over the years to help the project reach GA:

For those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We always welcome new contributors.

We also hold regular Data Protection Working Group meetings. New attendees are welcome to join our discussions.

08 May 2026 6:35pm GMT

07 May 2026

feedKubernetes Blog

Kubernetes v1.36: More Drivers, New Features, and the Next Era of DRA

Dynamic Resource Allocation (DRA) has fundamentally changed how platform administrators handle hardware accelerators and specialized resources in Kubernetes. In the v1.36 release, DRA continues to mature, bringing a wave of feature graduations, critical usability improvements, and new capabilities that extend the flexibility of DRA to native resources like memory and CPU, and support for ResourceClaims in PodGroups.

Driver availability continues to expand. Beyond specialized compute accelerators, the ecosystem includes support for networking and other hardware types, reflecting a move toward a more robust, hardware-agnostic infrastructure.

Whether you are managing massive fleets of GPUs, need better handling of failures, or simply looking for better ways to define resource fallback options, the upgrades to DRA in 1.36 have something for you. Let's dive into the new features and graduations!

Feature graduations

The community has been hard at work stabilizing core DRA concepts. In Kubernetes 1.36, several highly anticipated features have graduated to Beta and Stable.

Prioritized list (stable)

Hardware heterogeneity is a reality in most clusters. With the Prioritized list feature, you can confidently define fallback preferences when requesting devices. Instead of hardcoding a request for a specific device model, you can specify an ordered list of preferences (e.g., "Give me an H100, but if none are available, fall back to an A100"). The scheduler will evaluate these requests in order, drastically improving scheduling flexibility and cluster utilization.

Extended resource support (beta)

As DRA becomes the standard for resource allocation, bridging the gap with legacy systems is crucial. The DRA Extended resource feature allows users to request resources via traditional extended resources on a Pod. This allows for a gradual transition to DRA, meaning cluster operators can migrate clusters to DRA but let application developers adopt the ResourceClaim API on their own schedule.

Partitionable devices (beta)

Hardware accelerators are powerful, and sometimes a single workload doesn't need an entire device. The Partitionable devices feature, provides native DRA support for dynamically carving physical hardware into smaller, logical instances (such as Multi-Instance GPUs) based on workload demands. This allows administrators to safely and efficiently share expensive accelerators across multiple Pods.

Device taints (beta)

Just as you can taint a Kubernetes Node, you can apply taints directly to specific DRA devices. Device taints and tolerations empower cluster administrators to manage hardware more effectively. You can taint faulty devices to prevent them from being allocated to standard claims, or reserve specific hardware for dedicated teams, specialized workloads, and experiments. Ultimately, only Pods with matching tolerations are permitted to claim these tainted devices.

Device binding conditions (beta)

To improve scheduling reliability, the Kubernetes scheduler can use the Binding conditions feature to delay committing a Pod to a Node until its required external resources-such as attachable devices or FPGAs-are fully prepared. By explicitly modeling resource readiness, this prevents premature assignments that can lead to Pod failures, ensuring a much more robust and predictable deployment process.

Resource health status (beta)

Knowing when a device has failed or become unhealthy is critical for workloads running on specialized hardware. With Resource health status, Kubernetes expose device health information directly in the Pod status, giving users and controllers crucial visibility to quickly identify and react to hardware failures. The feature includes support for human-readable health status messages, making it significantly easier to diagnose issues without the need to dive into complex driver logs.

New Features

Beyond stabilizing existing capabilities, v1.36 introduces foundational new features that expand what DRA can do. These are alpha features, so they are behind feature gates that are disabled by default.

ResourceClaim support for workloads

To optimize large-scale AI/ML workloads that rely on strict topological scheduling, the ResourceClaim support for workloads feature enables Kubernetes to seamlessly manage shared resources across massive sets of Pods. By associating ResourceClaims or ResourceClaimTemplates with PodGroups, this feature eliminates previous scaling bottlenecks, such as the limit on the number of pods that can share a claim, and removes the burden of manual claim management from specialized orchestrators.

Node allocatable resources

Why should DRA only be for external accelerators? In v1.36, we are introducing the first iteration of using the DRA APIs to manage node allocatable infrastructure resources (like CPU and memory). By bringing CPU and memory allocation under the DRA umbrella with the DRA Node allocatable resources feature, users can leverage DRA's advanced placement, NUMA-awareness, and prioritization semantics for standard compute resources, paving the way for incredibly fine-grained performance tuning.

DRA resource availability visibility

One of the most requested features from cluster administrators has been better visibility into hardware capacity. The new Resource pool status feature allows you to query the availability of devices in DRA resource pools. By creating a ResourcePoolStatusRequest object, you get a point-in-time snapshot of device counts - total, allocated, available, and unavailable - for each pool managed by a given driver. This enables better integration with dashboards and capacity planning tools.

List types for attributes

ResourceClaim constraint evaluation has changed to work better with scalar and list values: matchAttribute now checks for a non-empty intersection, and distinctAttribute checks for pairwise disjoint values.

An includes() function in CEL has also been introduced, that lets device selectors keep working more easily when an attribute changes between scalar and list representations. (The includes() function is only available in DRA contexts for expression evaluation).

Deterministic device selection

The Kubernetes scheduler has been updated to evaluate devices using lexicographical ordering based on resource pool and ResourceSlice names. This change empowers drivers to proactively influence the scheduling process, leading to improved throughput and more optimal scheduling decisions. The ResourceSlice controller toolkit automatically generates names that reflect the exact device ordering specified by the driver author.

Discoverable device metadata in containers

Workloads running on nodes with DRA devices often need to discover details about their allocated devices, such as PCI bus addresses or network interface configuration, without querying the Kubernetes API. With Device metadata, Kubernetes defines a standard protocol for how DRA drivers expose device attributes to containers as versioned JSON files at well-known paths. Drivers built with the DRA kubelet plugin library get this behavior transparently; they just provide the metadata and the library handles file layout, CDI bind-mounts, versioning, and lifecycle. This gives applications a consistent, driver-independent way to discover and consume device metadata, eliminating the need for custom controllers or looking up ResourceSlice objects to get metadata via attributes.

What's next?

This release introduced a wealth of new Dynamic Resource Allocation (DRA) features, and the momentum is only building. As we look ahead, our roadmap focuses on maturing existing features toward beta and stable releases while hardening DRA's performance, scalability, and reliability. A key priority over the coming cycles will be deep integration with workload aware and topology aware scheduling.

A big goal for us is to migrate users from Device Plugin to DRA, and we want you involved. Whether you are currently maintaining a driver or are just beginning to explore the possibilities, your input is vital. Partner with us to shape the next generation of resource management. Reach out today to collaborate on development, share feedback, or start building your first DRA driver.

Getting involved

A good starting point is joining the WG Device Management Slack channel and meetings, which happen at Americas/EMEA and EMEA/APAC friendly time slots.

Not all enhancement ideas are tracked as issues yet, so come talk to us if you want to help or have some ideas yourself! We have work to do at all levels, from difficult core changes to usability enhancements in kubectl, which could be picked up by newcomers.

07 May 2026 6:35pm GMT

06 May 2026

feedKubernetes Blog

Kubernetes v1.36: Server-Side Sharded List and Watch

As Kubernetes clusters grow to tens of thousands of nodes, controllers that watch high-cardinality resources like Pods face a scaling wall. Every replica of a horizontally scaled controller receives the full stream of events from the API server, paying the CPU, memory, and network cost to deserialize everything, only to discard the objects it is not responsible for. Scaling out the controller does not reduce per-replica cost; it multiplies it.

Kubernetes v1.36 introduces server-side sharded list and watch as an alpha feature (KEP-5866). With this feature enabled, the API server filters events at the source so that each controller replica receives only the slice of the resource collection it owns.

The problem with client-side sharding

Some controllers, such as kube-state-metrics, already support horizontal sharding. Each replica is assigned a portion of the keyspace and discards objects that do not belong to it. While this works functionally, it does not reduce the volume of data flowing from the API server:

Server-side sharded list and watch solves this by moving the filtering upstream into the API server. Each replica tells the API server which hash range it owns, and the API server only sends matching events.

How it works

The feature adds a shardSelector field to ListOptions. Clients specify a hash range using the shardRange() function:

shardRange(object.metadata.uid, '0x0000000000000000', '0x8000000000000000')

The API server computes a deterministic 64-bit FNV-1a hash of the specified field and returns only objects whose hash falls within the range [start, end). This applies to both list responses and watch event streams. The hash function produces the same result across all API server instances, so the feature is safe to use with multiple API server replicas.

Currently supported field paths are object.metadata.uid and object.metadata.namespace.

Using sharded watches in controllers

Controllers typically use informers to list and watch resources. To shard the workload, each replica injects the shardSelector into the ListOptions used by its informers via WithTweakListOptions:

import (
 metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 "k8s.io/client-go/informers"
)

shardSelector := "shardRange(object.metadata.uid, '0x0000000000000000', '0x8000000000000000')"

factory := informers.NewSharedInformerFactoryWithOptions(client, resyncPeriod,
 informers.WithTweakListOptions(func(opts *metav1.ListOptions) {
 opts.ShardSelector = shardSelector
 }),
)

For a 2-replica deployment, the selectors split the hash space in half:

// Replica 0: lower half of the hash space
"shardRange(object.metadata.uid, '0x0000000000000000', '0x8000000000000000')"

// Replica 1: upper half of the hash space
"shardRange(object.metadata.uid, '0x8000000000000000', '0x10000000000000000')"

A single replica can also cover non-contiguous ranges using ||:

"shardRange(object.metadata.uid, '0x0000000000000000', '0x4000000000000000') || " +
 "shardRange(object.metadata.uid, '0x8000000000000000', '0xc000000000000000')"

Verifying server support

When the API server honors a shard selector, the list response includes a shardInfo field in the response metadata that echoes back the applied selector:

{
 "kind": "PodList",
 "apiVersion": "v1",
 "metadata": {
 "resourceVersion": "10245",
 "shardInfo": {
 "selector": "shardRange(object.metadata.uid, '0x0000000000000000', '0x8000000000000000')"
 }
 },
 "items": [...]
}

If shardInfo is absent, the server did not honor the shard selector and the client received the complete, unfiltered collection. In this case, the client should be prepared to handle the full result set, for example by applying client-side filtering to discard objects outside its assigned shard range.

Getting involved

This feature is in alpha and requires enabling the ShardedListAndWatch feature gate on the API server. We are looking for feedback from controller authors and operators running large clusters.

If you have questions or feedback, join the #sig-api-machinery channel on Kubernetes Slack.

06 May 2026 6:35pm GMT

05 May 2026

feedKubernetes Blog

Kubernetes v1.36: Declarative Validation Graduates to GA

In Kubernetes v1.36, Declarative Validation for Kubernetes native types has reached General Availability (GA).

For users, this means more reliable, predictable, and better-documented APIs. By moving to a declarative model, the project also unlocks the future ability to publish validation rules via OpenAPI and integrate with ecosystem tools like Kubebuilder. For contributors and ecosystem developers, this replaces thousands of lines of handwritten validation code with a unified, maintainable framework.

This post covers why this migration was necessary, how the declarative validation framework works, and what new capabilities come with this GA release.

The Motivation: Escaping the "Handwritten" Technical Debt

For years, the validation of Kubernetes native APIs relied almost entirely on handwritten Go code. If a field needed to be bounded by a minimum value, or if two fields needed to be mutually exclusive, developers had to write explicit Go functions to enforce those constraints.

As the Kubernetes API surface expanded, this approach led to several systemic issues:

  1. Technical Debt: The project accumulated roughly 18,000 lines of boilerplate validation code. This code was difficult to maintain, error-prone, and required intense scrutiny during code reviews.
  2. Inconsistency: Without a centralized framework, validation rules were sometimes applied inconsistently across different resources.
  3. Opaque APIs: Handwritten validation logic was difficult to discover or analyze programmatically. This meant clients and tooling couldn't predictably know validation rules without consulting the source code or encountering errors at runtime.

The solution proposed by SIG API Machinery was Declarative Validation: using Interface Definition Language (IDL) tags (specifically +k8s: marker tags) directly within types.go files to define validation rules.

Enter validation-gen

At the core of the declarative validation feature is a new code generator called validation-gen. Just as Kubernetes uses generators for deep copies, conversions, and defaulting, validation-gen parses +k8s: tags and automatically generates the corresponding Go validation functions.

These generated functions are then registered seamlessly with the API scheme. The generator is designed as an extensible framework, allowing developers to plug in new "Validators" by describing the tags they parse and the Go logic they should produce.

A Comprehensive Suite of +k8s: Tags

The declarative validation framework introduces a comprehensive suite of marker tags that provide rich validation capabilities highly optimized for Go types. For a full list of supported tags, check out the official documentation. Here is a catalog of some of the most common tags you will now see in the Kubernetes codebase:

Example Usage:

type ReplicationControllerSpec struct {
 // +k8s:optional
 // +k8s:minimum=0
 Replicas *int32 `json:"replicas,omitempty"`
}

By placing these tags directly above the field definitions, the constraints are self-documenting and immediately visible to anyone reading the type definitions.

Advanced Capabilities: "Ambient Ratcheting"

One of the most substantial outcomes of this work is that validation ratcheting is now a standard, ambient part of the API. In the past, if we needed to tighten validation, we had to first add handwritten ratcheting code, wait a release, and then tighten the validation to avoid breaking existing objects.

With declarative validation, this safety mechanism is built-in. If a user updates an existing object, the validation framework compares the incoming object with the oldObject. If a specific field's value is semantically equivalent to its prior state (i.e., the user didn't change it), the new validation rule is bypassed. This "ambient ratcheting" means we can loosen or tighten validation immediately and in the least disruptive way possible.

Scaling API Reviews with kube-api-linter

Reaching GA required absolute confidence in the generated code, but our vision extends beyond just validation. Declarative validation is a key part of a comprehensive approach to making API review easier, more consistent, and highly scalable.

By moving validation rules out of opaque Go functions and into structured markers, we are empowering tools like kube-api-linter. This linter can now statically analyze API types and enforce API conventions automatically, significantly reducing the manual burden on SIG API Machinery reviewers and providing immediate feedback to contributors.

What's next?

With the release of Kubernetes v1.36, Declarative Validation graduates to General Availability (GA). As a stable feature, the associated DeclarativeValidation feature gate is now enabled by default. It has become the primary mechanism for adding new validation rules to Kubernetes native types.

Looking forward, the project is committed to adopting declarative validation even more extensively. This includes migrating the remaining legacy handwritten validation code for established APIs and requiring its use for all new APIs and new fields. This ongoing transition will continue to shrink the codebase's complexity while enhancing the consistency and reliability of the entire Kubernetes API surface.

Beyond the core migration, declarative validation also unlocks an exciting future for the broader ecosystem. Because validation rules are now defined as structured markers rather than opaque Go code, they can be parsed and reflected in the OpenAPI schemas published by the Kubernetes API server. This paves the way for tools like kubectl, client libraries, and IDEs to perform rich client-side validation before a request is ever sent to the cluster. The same declarative framework can also be consumed by ecosystem tools like Kubebuilder, enabling a more consistent developer experience for authors of Custom Resource Definitions (CRDs).

Getting involved

The migration to declarative validation is an ongoing effort. While the framework itself is GA, there is still work to be done migrating older APIs to the new declarative format.

If you are interested in contributing to the core of Kubernetes API Machinery, this is a fantastic place to start. Check out the validation-gen documentation, look for issues tagged with sig/api-machinery, and join the conversation in the #sig-api-machinery and #sig-api-machinery-dev-tools channels on Kubernetes Slack (for an invitation, visit https://slack.k8s.io/). You can also attend the SIG API Machinery meetings to get involved directly.

Acknowledgments

A huge thank you to everyone who helped bring this feature to GA:

And the many others across the Kubernetes community who contributed along the way.

Welcome to the declarative future of Kubernetes validation!

05 May 2026 6:35pm GMT

04 May 2026

feedKubernetes Blog

Kubernetes v1.36: Admission Policies That Can't Be Deleted

If you've ever tried to enforce a security policy across a fleet of Kubernetes clusters, you've probably run into a frustrating chicken-and-egg problem. Your admission policies are API objects, which means they don't exist until someone creates them, and they can be deleted by anyone with the right permissions. There's always a window during cluster bootstrap where your policies aren't active yet, and there's no way to prevent a privileged user from removing them.

Kubernetes v1.36 introduces an alpha feature that addresses this: manifest-based admission control. It lets you define admission webhooks and CEL-based policies as files on disk, loaded by the API server at startup, before it serves any requests.

The gap we're closing

Most Kubernetes policy enforcement today works through the API. You create a ValidatingAdmissionPolicy or a webhook configuration as an API object, and the admission controller picks it up. This works well in steady state, but it has some fundamental limitations.

During cluster bootstrap, there's a gap between when the API server starts serving requests and when your policies are created and active. If you're restoring from a backup or recovering from an etcd failure, that gap can be significant.

There's also a self-protection problem. Admission webhooks and policies can't intercept operations on their own configuration resources. Kubernetes skips invoking webhooks on types like ValidatingWebhookConfiguration to avoid circular dependencies. That means a sufficiently privileged user can delete your critical admission policies, and there's nothing in the admission chain to stop them.

We - Kubernetes SIG API Machinery - wanted a way to say "these policies are always on, full stop."

How it works

You add a staticManifestsDir field to the AdmissionConfiguration file that you already pass to the API server via --admission-control-config-file. Point it at a directory, drop your policy YAML files in there, and the API server loads them before it starts serving.

apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: ValidatingAdmissionPolicy
 configuration:
 apiVersion: apiserver.config.k8s.io/v1
 kind: ValidatingAdmissionPolicyConfiguration
 staticManifestsDir: "/etc/kubernetes/admission/validating-policies/"

The manifest files are standard Kubernetes resource definitions. The only requirement is that all the objects that these manifests define must have names ending in .static.k8s.io. This reserved suffix prevents collisions with API-based configurations and makes it easy to tell where an admission decision came from when you're looking at metrics or audit logs.

Here's a complete example that denies privileged containers outside kube-system:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
 name: "deny-privileged.static.k8s.io"
 annotations:
 kubernetes.io/description: "Deny launching privileged pods, anywhere this policy is applied"
spec:
 failurePolicy: Fail
 matchConstraints:
 resourceRules:
 - apiGroups: [""]
 apiVersions: ["v1"]
 operations: ["CREATE", "UPDATE"]
 resources: ["pods"]
 variables:
 - name: allContainers
 expression: >-
 object.spec.containers +
 (has(object.spec.initContainers) ? object.spec.initContainers : []) +
 (has(object.spec.ephemeralContainers) ? object.spec.ephemeralContainers : [])
 validations:
 - expression: >-
 !variables.allContainers.exists(c,
 has(c.securityContext) && has(c.securityContext.privileged) &&
 c.securityContext.privileged == true)
 message: "Privileged containers are not allowed"
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
 name: "deny-privileged-binding.static.k8s.io"
 annotations:
 kubernetes.io/description: "Bind deny-privileged policy to all namespaces except kube-system"
spec:
 policyName: "deny-privileged.static.k8s.io"
 validationActions:
 - Deny
 matchResources:
 namespaceSelector:
 matchExpressions:
 - key: "kubernetes.io/metadata.name"
 operator: NotIn
 values: ["kube-system"]

Protecting what couldn't be protected before

The part we're most excited about is the ability to intercept operations on admission configuration resources themselves.

With API-based admission, webhooks and policies are never invoked on types like ValidatingAdmissionPolicy or ValidatingWebhookConfiguration. That restriction exists for good reason: if a webhook could reject changes to its own configuration, you could end up locked out with no way to fix it through the API.

Manifest-based policies don't have that problem. If a bad policy is blocking something it shouldn't, you fix the file on disk and the API server picks up the change. There's no circular dependency because the recovery path doesn't go through the API.

This means you can write a manifest-based policy that prevents deletion of your critical API-based admission policies. For platform teams managing shared clusters, this is a significant improvement. You can now guarantee that your baseline security policies can't be removed by a cluster admin, accidentally or otherwise.

Here's what that looks like in practice. This policy prevents any modification or deletion of admission resources that carry the platform.example.com/protected: "true" label:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
 name: "protect-policies.static.k8s.io"
 annotations:
 kubernetes.io/description: "Prevent modification or deletion of protected admission resources"
spec:
 failurePolicy: Fail
 matchConstraints:
 resourceRules:
 - apiGroups: ["admissionregistration.k8s.io"]
 apiVersions: ["*"]
 operations: ["DELETE", "UPDATE"]
 resources:
 - "validatingadmissionpolicies"
 - "validatingadmissionpolicybindings"
 - "validatingwebhookconfigurations"
 - "mutatingwebhookconfigurations"
 validations:
 - expression: >-
 !has(oldObject.metadata.labels) ||
 !('platform.example.com/protected' in oldObject.metadata.labels) ||
 oldObject.metadata.labels['platform.example.com/protected'] != 'true'
 message: "Protected admission resources cannot be modified or deleted"
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
 name: "protect-policies-binding.static.k8s.io"
 annotations:
 kubernetes.io/description: "Bind protect-policies policy to all admission resources"
spec:
 policyName: "protect-policies.static.k8s.io"
 validationActions:
 - Deny

With this in place, any API-based admission policy or webhook configuration labeled platform.example.com/protected: "true" is shielded from tampering. The protection itself lives on disk and can't be removed through the API.

A few things to know

Manifest-based configurations are intentionally self-contained. They can't reference API resources, which means no paramKind for policies, no Service references for admission webhooks (instead they are URL-only), and bindings may only reference policies in the same manifest set. These restrictions exist because the configurations need to work without any cluster state, including at startup before etcd is available.

If you run multiple API server instances, each one loads its own manifest files independently. There's no cross-server synchronization built in. This is the same model as other file-based API server configurations like encryption at rest. When this feature is enabled, Kubernetes exposes a configuration hash as a label on relevant metrics, so you can detect drift.

Files are watched for changes at runtime, so you don't need to restart the API server to update policies. If you update a manifest file, the API server validates the new configuration and swaps it in atomically. If validation fails, it keeps the previous good configuration and logs the error. This means you can roll out policy changes across your fleet using standard configuration management tools (Ansible, Puppet, or even mounted ConfigMaps) without any API server downtime.

The initial load at startup is stricter: if any manifest is invalid, the API server won't start. This is intentional. At startup, failing fast is safer than running without your expected policies.

Try it out

To try this in Kubernetes v1.36:

  1. Enable the ManifestBasedAdmissionControlConfig feature gate for each kube-apiserver.
  2. Create a directory with your static manifest files. If you need to mount that in to the Pod where the API server runs, do that too. Read-only is fine.
  3. Configure staticManifestsDir in your AdmissionConfiguration with the directory path.
  4. Start the API server with --admission-control-config-file pointing to your AdmissionConfiguration file.

The full documentation is at Manifest-Based Admission Control, and you can follow KEP-5793 for ongoing progress.

We'd love to hear your feedback. Reach out on the #sig-api-machinery channel on Kubernetes Slack (for an invitation, visit https://slack.k8s.io/).

How to get involved

If you're interested in contributing to this feature or other SIG API Machinery projects, join us on #sig-api-machinery on Kubernetes Slack. You're also welcome to attend the SIG API Machinery meetings, held every other Wednesday.

04 May 2026 6:35pm GMT

01 May 2026

feedKubernetes Blog

Kubernetes v1.36: Pod-Level Resource Managers (Alpha)

Kubernetes v1.36 introduces Pod-Level Resource Managers as an alpha feature, bringing a more flexible and powerful resource management model to performance-sensitive workloads. This enhancement extends the kubelet's Topology, CPU, and Memory Managers to support pod-level resource specifications (.spec.resources), evolving them from a strictly per-container allocation model to a pod-centric one.

Why do we need pod-level resource managers?

When running performance-critical workloads such as machine learning (ML) training, high-frequency trading applications, or low-latency databases, you often need exclusive, NUMA-aligned resources for your primary application containers to ensure predictable performance.

However, modern Kubernetes pods rarely consist of just one container. They frequently include sidecar containers for logging, monitoring, service meshes, or data ingestion.

Before this feature, this created a trade-off, to get NUMA-aligned, exclusive resources for your main application, you had to allocate exclusive, integer-based CPU resources to every container in the pod. This might be wasteful for lightweight sidecars. If you didn't do this, you forfeited the pod's Guaranteed Quality of Service (QoS) class entirely, losing the performance benefits.

Introducing pod-level resource managers

Enabling pod-level resources support for the resource managers (via the PodLevelResourceManagers and PodLevelResources feature gates) allows the kubelet to create hybrid resource allocation models. This brings flexibility and efficiency to high-performance workloads without sacrificing NUMA alignment.

Real-world use cases

Here are a few practical scenarios demonstrating how this feature can be applied, depending on the configured Topology Manager scope:

1. Tightly-coupled database (Topology manager's pod scope)

Consider a latency-sensitive database pod that includes a main database container, a local metrics exporter, and a backup agent sidecar.

When configured with the pod Topology Manager scope, the kubelet performs a single NUMA alignment based on the entire pod's budget. The database container gets its exclusive CPU and memory slices from that NUMA node. The remaining resources from the pod's budget form a new pod shared pool. The metrics exporter and backup agent run in this pod shared pool. They share resources with each other, but they are strictly isolated from the database's exclusive slices and the rest of the node.

This allows you to safely co-locate auxiliary containers on the same NUMA node as your primary workload without wasting dedicated cores on them.

apiVersion: v1
kind: Pod
metadata:
 name: tightly-coupled-database
spec:
 # Pod-level resources establish the overall budget and NUMA alignment size.
 resources:
 requests:
 cpu: "8"
 memory: "16Gi"
 limits:
 cpu: "8"
 memory: "16Gi"
 initContainers:
 - name: metrics-exporter
 image: metrics-exporter:v1
 restartPolicy: Always
 - name: backup-agent
 image: backup-agent:v1
 restartPolicy: Always
 containers:
 - name: database
 image: database:v1
 # This Guaranteed container gets an exclusive 6 CPU slice from the pod's budget.
 # The remaining 2 CPUs and 4Gi memory form the pod shared pool for the sidecars.
 resources:
 requests:
 cpu: "6"
 memory: "12Gi"
 limits:
 cpu: "6"
 memory: "12Gi"

2. ML workload with infrastructure sidecars (Topology manager's container scope)

Imagine a pod running a GPU-accelerated ML training workload alongside a generic service mesh sidecar.

Under the container Topology Manager scope, the kubelet evaluates each container individually. You can grant the ML container exclusive, NUMA-aligned CPUs and Memory for maximum performance. Meanwhile, the service mesh sidecar doesn't need to be NUMA-aligned; it can run in the general node-wide shared pool. The collective resource consumption is still safely bounded by the overall pod limits, but you only allocate NUMA-aligned, exclusive resources to the specific containers that actually require them.

apiVersion: v1
kind: Pod
metadata:
 name: ml-workload
spec:
 # Pod-level resources establish the overall budget constraint.
 resources:
 requests:
 cpu: "4"
 memory: "8Gi"
 limits:
 cpu: "4"
 memory: "8Gi"
 initContainers:
 - name: service-mesh-sidecar
 image: service-mesh:v1
 restartPolicy: Always
 containers:
 - name: ml-training
 image: ml-training:v1
 # Under the 'container' scope, this Guaranteed container receives exclusive,
 # NUMA-aligned resources, while the sidecar runs in the node's shared pool.
 resources:
 requests:
 cpu: "3"
 memory: "6Gi"
 limits:
 cpu: "3"
 memory: "6Gi"

CPU quotas (CFS) and isolation

When running these mixed workloads within a pod, isolation is enforced differently depending on the allocation:

How to enable Pod-Level Resource Managers

Using this feature requires Kubernetes v1.36 or newer. To enable it, you must configure the kubelet with the appropriate feature gates and policies:

  1. Enable the PodLevelResources and PodLevelResourceManagers feature gates.
  2. Configure the Topology Manager with a policy other than none (i.e., best-effort, restricted, or single-numa-node).
  3. Set the Topology Manager scope to either pod or container using the topologyManagerScope field in the KubeletConfiguration.
  4. Configure the CPU Manager with the static policy.
  5. Configure the Memory Manager with the Static policy.

Observability

To help cluster administrators monitor and debug these new allocation models, we have introduced several new kubelet metrics when the feature gate is enabled:

Current limitations and caveats

While this feature opens up new possibilities, there are a few things to keep in mind during its alpha phase. Be sure to review the Limitations and caveats in the official documentation for full details on compatibility, requirements, and downgrade instructions.

Getting started and providing feedback

For a deep dive into the technical details and configuration of this feature, check out the official concept documentation:

To learn more about the overall pod-level resources feature and how to assign resources to pods, see:

As this feature moves through Alpha, your feedback is invaluable. Please report any issues or share your experiences via the standard Kubernetes communication channels:

01 May 2026 6:35pm GMT

30 Apr 2026

feedKubernetes Blog

Kubernetes v1.36: In-Place Vertical Scaling for Pod-Level Resources Graduates to Beta

Following the graduation of Pod-Level Resources to Beta in v1.34 and the General Availability (GA) of In-Place Pod Vertical Scaling in v1.35, the Kubernetes community is thrilled to announce that In-Place Pod-Level Resources Vertical Scaling has graduated to Beta in v1.36!

This feature is now enabled by default via the InPlacePodLevelResourcesVerticalScaling feature gate. It allows users to update the aggregate Pod resource budget (.spec.resources) for a running Pod, often without requiring a container restart.

Why Pod-level in-place resize?

The Pod-level resource model simplified management for complex Pods (such as those with sidecars) by allowing containers to share a collective pool of resources. In v1.36, you can now adjust this aggregate boundary on-the-fly.

This is particularly useful for Pods where containers do not have individual limits defined. These containers automatically scale their effective boundaries to fit the newly resized Pod-level dimensions, allowing you to expand the shared pool during peak demand without manual per-container recalculations.

Resource inheritance and the resizePolicy

When a Pod-level resize is initiated, the Kubelet treats the change as a resize event for every container that inherits its limits from the Pod-level budget. To determine whether a restart is required, the Kubelet consults the resizePolicy defined within individual containers:

Note: Currently, resizePolicy is not supported at the Pod level. The Kubelet always defers to individual container settings to decide if an update can be applied in-place or requires a restart.

Example: Scaling a shared resource pool

In this scenario, a Pod is defined with a 2 CPU pod-level limit. Because the individual containers do not have their own limits defined, they share this total pool.

1. Initial Pod specification

apiVersion: v1
kind: Pod
metadata:
 name: shared-pool-app
spec:
 resources: # Pod-level limits
 limits:
 cpu: "2"
 memory: "4Gi"
 containers:
 - name: main-app
 image: my-app:v1
 resizePolicy: [{resourceName: "cpu", restartPolicy: "NotRequired"}]
 - name: sidecar
 image: logger:v1
 resizePolicy: [{resourceName: "cpu", restartPolicy: "NotRequired"}]

2. The resize operation

To double the CPU capacity to 4 CPUs, apply a patch using the resize subresource:

kubectl patch pod shared-pool-app --subresource resize --patch \
 '{"spec":{"resources":{"limits":{"cpu":"4"}}}}'

Node-Level reality: feasibility and safety

Applying a resize patch is only the first step. The Kubelet performs several checks and follows a specific sequence to ensure node stability:

1. The feasibility check

Before admitting a resize, the Kubelet verifies if the new aggregate request fits within the Node's allocatable capacity. If the Node is overcommitted, the resize is not ignored; instead, the PodResizePending condition will reflect a Deferred or Infeasible status, providing immediate feedback on why the "envelope" hasn't grown.

2. Update sequencing

To prevent resource "overshoot", the Kubelet coordinates the cgroup updates in a specific order:

Observability: tracking resize status

With the move to Beta, Kubernetes uses Pod Conditions to track the lifecycle of a resize:

status:
 allocatedResources:
 cpu: "4"
 resources:
 limits:
 cpu: "4"
 conditions:
 - type: PodResizeInProgress
 status: "True"

Constraints and requirements

What's next?

As we move toward General Availability (GA), the community is focusing on Vertical Pod Autoscaler (VPA) Integration, enabling VPA to issue Pod-level resource recommendations and trigger in-place actuation automatically.

Getting started and providing feedback

We encourage you to test this feature and provide feedback via the standard Kubernetes communication channels:

30 Apr 2026 6:35pm GMT

29 Apr 2026

feedKubernetes Blog

Kubernetes v1.36: Tiered Memory Protection with Memory QoS

On behalf of SIG Node, we are pleased to announce updates to the Memory QoS feature (alpha) in Kubernetes v1.36. Memory QoS uses the cgroup v2 memory controller to give the kernel better guidance on how to treat container memory. It was first introduced in v1.22 and updated in v1.27. In Kubernetes v1.36, we're introducing: opt-in memory reservation, tiered protection by QoS class, observability metrics, and kernel-version warning for memory.high.

What's new in v1.36

Opt-in memory reservation with memoryReservationPolicy

v1.36 separates throttling from reservation. Enabling the feature gate turns on memory.high throttling (the kubelet sets memory.high based on memoryThrottlingFactor, default 0.9), but memory reservation is now controlled by a separate kubelet configuration field:

Guaranteed Pods get hard protection via memory.min. For example, a Guaranteed Pod requesting 512 MiB of memory results in:

$ cat /sys/fs/cgroup/kubepods.slice/kubepods-pod6a4f2e3b_1c9d_4a5e_8f7b_2d3e4f5a6b7c.slice/memory.min
536870912

The kernel will not reclaim this memory under any circumstances. If it cannot honor the guarantee, it invokes the OOM killer on other processes to free pages.

Burstable Pods get soft protection via memory.low. For the same 512 MiB request on a Burstable Pod:

$ cat /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8b3c7d2e_4f5a_6b7c_9d1e_3f4a5b6c7d8e.slice/memory.low
536870912

The kernel avoids reclaiming this memory under normal pressure, but may reclaim it if the alternative is a system-wide OOM.

BestEffort Pods get neither memory.min nor memory.low. Their memory remains fully reclaimable.

Comparison with v1.27 behavior

In earlier versions, enabling the MemoryQoS feature gate immediately set memory.min for every container with a memory request. memory.min is a hard reservation that the kernel will not reclaim, regardless of memory pressure.

Consider a node with 8 GiB of RAM where Burstable Pod requests total 7 GiB. In earlier versions, that 7 GiB would be locked as memory.min, leaving little headroom for the kernel, system daemons, or BestEffort workloads and increasing the risk of OOM kills.

With v1.36 tiered reservation, those Burstable requests map to memory.low instead of memory.min. Under normal pressure, the kernel still protects that memory, but under extreme pressure it can reclaim part of it to avoid system-wide OOM. Only Guaranteed Pods use memory.min, which keeps hard reservation lower.

With memoryReservationPolicy in v1.36, you can enable throttling first, observe workload behavior, and opt into reservation when your node has enough headroom.

Observability metrics

Two alpha-stability metrics are exposed on the kubelet /metrics endpoint:

Metric Description
kubelet_memory_qos_node_memory_min_bytes Total memory.min across Guaranteed Pods
kubelet_memory_qos_node_memory_low_bytes Total memory.low across Burstable Pods

These are useful for capacity planning. If kubelet_memory_qos_node_memory_min_bytes is creeping toward your node's physical memory, you know hard reservation is getting tight.

$ curl -sk https://localhost:10250/metrics | grep memory_qos
# HELP kubelet_memory_qos_node_memory_min_bytes [ALPHA] Total memory.min in bytes for Guaranteed pods
kubelet_memory_qos_node_memory_min_bytes 5.36870912e+08
# HELP kubelet_memory_qos_node_memory_low_bytes [ALPHA] Total memory.low in bytes for Burstable pods
kubelet_memory_qos_node_memory_low_bytes 2.147483648e+09

Kernel version check

On kernels older than 5.9, memory.high throttling can trigger the kernel livelock issue. The bug was fixed in kernel 5.9. In v1.36, when the feature gate is enabled, the kubelet checks the kernel version at startup and logs a warning if it is below 5.9. The feature continues to work - this is informational, not a hard block.

How Kubernetes maps Memory QoS to cgroup v2

Memory QoS uses four cgroup v2 memory controller interfaces:

The following table shows how Kubernetes container resources map to cgroup v2 interfaces when memoryReservationPolicy: TieredReservation is configured. With the default memoryReservationPolicy: None, no memory.min or memory.low values are set.

QoS Class memory.min memory.low memory.high memory.max
Guaranteed Set to requests.memory
(hard protection)
Not set Not set
(requests == limits, so throttling is not useful)
Set to limits.memory
Burstable Not set Set to requests.memory
(soft protection)
Calculated based on
formula with throttling factor
Set to limits.memory
(if specified)
BestEffort Not set Not set Calculated based on
node allocatable memory
Not set

Cgroup hierarchy

cgroup v2 requires that a parent cgroup's memory protection is at least as large as the sum of its children's. The kubelet maintains this by setting memory.min on the kubepods root cgroup to the sum of all Guaranteed and Burstable Pod memory requests, and memory.low on the Burstable QoS cgroup to the sum of all Burstable Pod memory requests. This way the kernel can enforce the per-container and per-pod protection values correctly.

The kubelet manages pod-level and QoS-class cgroups directly using the runc libcontainer library, while container-level cgroups are managed by the container runtime (containerd or CRI-O).

How do I use it?

Prerequisites

  1. Kubernetes v1.36 or later
  2. Linux with cgroup v2. Kernel 5.9 or higher is recommended - earlier kernels work but may experience the livelock issue. You can verify cgroup v2 is active by running mount | grep cgroup2.
  3. A container runtime that supports cgroup v2 (containerd 1.6+, CRI-O 1.22+)

Configuration

To enable Memory QoS with tiered protection:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
featureGates:
 MemoryQoS: true
memoryReservationPolicy: TieredReservation # Options: None (default), TieredReservation
memoryThrottlingFactor: 0.9 # Optional: default is 0.9

If you want memory.high throttling without memory protection, omit memoryReservationPolicy or set it to None:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
featureGates:
 MemoryQoS: true
memoryReservationPolicy: None  # This is the default

How can I learn more?

Getting involved

This feature is driven by SIG Node. If you are interested in contributing or have feedback, you can find us on Slack (#sig-node), the mailing list, or at the regular SIG Node meetings. Please file bugs at kubernetes/kubernetes and enhancement proposals at kubernetes/enhancements.

29 Apr 2026 6:35pm GMT

28 Apr 2026

feedKubernetes Blog

Kubernetes v1.36: Staleness Mitigation and Observability for Controllers

Staleness in Kubernetes controllers is a problem that affects many controllers, and is something may affect controller behavior in subtle ways. It is usually not until it is too late, when a controller in production has already taken incorrect action, that staleness is found to be an issue due to some underlying assumption made by the controller author. Some issues caused by staleness include controllers taking incorrect actions, controllers not taking action when they should, and controllers taking too long to take action. I am excited to announce that Kubernetes v1.36 includes new features that help mitigate staleness in controllers and provide better observability into controller behavior.

What is staleness?

Staleness in controllers comes from an outdated view of the world inside of the controller cache. In order to provide a fast user experience, controllers typically maintain a local cache of the state of the cluster. This cache is populated by watching the Kubernetes API server for changes to objects that the controller cares about. When the controller needs to take action, it will first check its cache to see if it has the latest information. If it does not, it will then update its cache by watching the API server for changes to objects that the controller cares about. This process is known as reconciliation.

However, there are some cases where the controller's cache may be outdated. For example, if the controller is restarted, it will need to rebuild its cache by watching the API server for changes to objects that the controller cares about. During this time, the controller's cache will be outdated, and it will not be able to take action. Additionally, if the API server is down, the controller's cache will not be updated, and it will not be able to take action. These are just a few examples of cases where the controller's cache may be outdated.

Improvements in 1.36

Kubernetes v1.36 includes improvements in both client-go as well as implementations of highly contended controllers in kube-controller-manager, using those client-go improvements.

client-go improvements

In client-go, the project added atomic FIFO processing (feature gate name AtomicFIFO), which is on top of the existing FIFO queue implementation. The new approach allows for the queue to atomically handle operations that are recieved in batches, such as the initial set of objects from a list operation that an informer uses to populate its cache. This ensures that the queue is always in a consistent state, even when events come out of order. Prior to this, events were added to the queue in the order that they were received, which could lead to an inconsistent state in the cache that does not accurately reflect the state of the cluster.

With this change, you can now ensure that the queue is always in a consistent state, even when events come out of order. To take advantage of this, clients using client-go can now introspect into the cache to determine the latest resource version that the controller cache has seen. This is done with the newly added function LastStoreSyncResourceVersion() implemented on the Store interface here. This function is the basis for the staleness mitigation features in kube-controller-manager.

kube-controller-manager improvements

In kube-controller-manager, the v1.36 release has added the ability for 4 different controllers to use this new capability. The controllers are:

  1. DaemonSet controller
  2. StatefulSet controller
  3. ReplicaSet controller
  4. Job controller

These controllers all act on pods, which in most cases are under the highest amount of contention in a cluster. The changes are on by default for these controllers, and can be disabled by setting the feature gates StaleControllerConsistency<API type> to false for the specific controller you wish to disable it for. For example, to disable the feature for the DaemonSet controller, you would set the feature gate StaleControllerConsistencyDaemonSet to false.

When the relevant feature gate is enabled, the controller will first check the latest resource version of the cache before taking action. If the latest resource version of the cache is lower than what the controller has written to the API server for the object it is trying to reconcile, the controller will not take action. This is because the controller's cache is outdated, and it does not have the latest information about the state of the cluster.

Use for informer authors

Informer authors using client-go can also immediately take advantage of these improvements. See an example of how to use this feature in the ReplicaSet informer. This PR shows how to use the new feature to check if the informer's cache is stale before taking action. The client-go library provides a ConsistencyStore data structure that queries the store and compares the latest resource version of the cache with the written resource version of the object.

The ReplicaSet controller tracks both the ReplicaSet's resource version and the resource version of the pods that the ReplicaSet manages. For a specific ReplicaSet, it tracks the latest written resource version of the pods that the ReplicaSet owns as well as any writes to the ReplicaSet itself. If the latest resource version of the cache is lower than what the controller has written to the API server for the object it is trying to reconcile, the controller will not take action. This is because the controller's cache is outdated, and it does not have the latest information about the state of the cluster.

An informer author can use the ConsistencyStore to track the latest resource version of the objects that the informer cares about. It provides 3 main functions:

type ConsistencyStore interface {
 // WroteAt records that the given object was written at the given resource version.
 WroteAt(owningObj runtime.Object, uid types.UID, groupResource schema.GroupResource, resourceVersion string)

 // EnsureReady returns true if the cache is up to date for the given object.
 // It is used prior to reconciliation to decide whether to reconcile or not.
 EnsureReady(namespacedName types.NamespacedName) bool

 // Clear removes the given object from the consistency store.
 // It is used when an object is deleted.
 Clear(namespacedName types.NamespacedName, uid types.UID)
}
  1. WroteAt: This function is called by the controller when it writes to the API server for an object. It is used to record the latest resource version of the object that the controller has written to the API server. The owningObj is the object that the controller is reconciling, and the uid is the UID of that object. The resource version and GroupResource are the resource version and GroupResource of the object that the controller has written to the API server. The object is not explicitly tracked, since the controller only cares about waiting to catch up to the latest resource version of the written object.
  2. EnsureReady: This function is called by the controller to ensure that the cache is up to date for the object. It is used prior to reconciliation to decide whether to reconcile or not. It returns true if the cache is up to date for the object, and false otherwise. It will use the information provided by WroteAt to determine if the cache is up to date.
  3. Clear: This function is called by the controller when an object is deleted. It is used to remove the object from the consistency store. This is mostly used for cleanup when an object is deleted to prevent the consistency store from growing indefinitely.

The UID is used to distinguish between different objects that have the same name, such as when an object is deleted and then recreated. It is not needed for EnsureReady because the consistency store is only concerned with catching up to the latest resource version of the object, not the specific object. It is primarily used to ensure that the controller doesn't delete the entry for an object when it is recreated with a new UID.

With these 3 functions, an informer author can implement staleness mitigation in their controller.

Observability

In addition to the staleness mitigation features, the Kubernetes project has also added related instrumentation to kube-controller-manager in 1.36. These metrics are also enabled by default, and are controlled using the same set of feature gates.

Metrics

The following alpha metrics have been added to kube-controller-manager in 1.36:

stale_sync_skips_total: The number of times the controller has skipped a sync due to stale cache. This metric is exposed for each controller that uses the staleness mitigation feature with the subsystem of the controller.

This metric is exposed by the kube-controller-manager metrics endpoint, and can be used to monitor the health of the controller.

Along with this metric, client-go also emits metrics that expose the latest resource version of every shared informer with the subsystem of the informer. This allows you to see the latest resource version of each informer, and use that to determine if the controller's cache is stale, especially great for comparing against the resource version of the API server.

This metric is named store_resource_version and has the Group, Version, and Resource as labels.

What's next?

Kubernetes SIG API Machinery is excited to continue working on this feature and hope to bring it to more controllers in the future. We are also interested in hearing your feedback on this feature. Please let us know what you think in the comments below or by opening an issue on the Kubernetes GitHub repository.

We are also working with controller-runtime to enable this set of semantics for all controllers built with controller-runtime. This will allow any controller built with controller-runtime to gain the benefits of read your own writes, without having to implement the logic themselves.

28 Apr 2026 6:35pm GMT

27 Apr 2026

feedKubernetes Blog

Kubernetes v1.36: Mutable Pod Resources for Suspended Jobs (beta)

Kubernetes v1.36 promotes the ability to modify container resource requests and limits in the pod template of a suspended Job to beta. First introduced as alpha in v1.35, this feature allows queue controllers and cluster administrators to adjust CPU, memory, GPU, and extended resource specifications on a Job while it is suspended, before it starts or resumes running.

Why mutable pod resources for suspended Jobs?

Batch and machine learning workloads often have resource requirements that are not precisely known at Job creation time. The optimal resource allocation depends on current cluster capacity, queue priorities, and the availability of specialized hardware like GPUs.

Before this feature, resource requirements in a Job's pod template were immutable once set. If a queue controller like Kueue determined that a suspended Job should run with different resources, the only option was to delete and recreate the Job, losing any associated metadata, status, or history. This feature also provides a way to let a specific Job instance for a CronJob progress slowly with reduced resources, rather than outright failing to run if the cluster is heavily loaded.

Consider a machine learning training Job initially requesting 4 GPUs:

apiVersion: batch/v1
kind: Job
metadata:
 name: training-job-example-abcd123
 labels:
 app.kubernetes.io/name: trainer
spec:
 suspend: true
 template:
 metadata:
 annotations:
 kubernetes.io/description: "ML training, ID abcd123"
 spec:
 containers:
 - name: trainer
 image: example-registry.example.com/training:2026-04-23T150405.678
 resources:
 requests:
 cpu: "8"
 memory: "32Gi"
 example-hardware-vendor.com/gpu: "4"
 limits:
 cpu: "8"
 memory: "32Gi"
 example-hardware-vendor.com/gpu: "4"
 restartPolicy: Never

A queue controller managing cluster resources might determine that only 2 GPUs are available. With this feature, the controller can update the Job's resource requests before resuming it:

apiVersion: batch/v1
kind: Job
metadata:
 name: training-job-example-abcd123
 labels:
 app.kubernetes.io/name: trainer
spec:
 suspend: true
 template:
 metadata:
 annotations:
 kubernetes.io/description: "ML training, ID abcd123"
 spec:
 containers:
 - name: trainer
 image: example-registry.example.com/training:2026-04-23T150405.678
 resources:
 requests:
 cpu: "4"
 memory: "16Gi"
 example-hardware-vendor.com/gpu: "2"
 limits:
 cpu: "4"
 memory: "16Gi"
 example-hardware-vendor.com/gpu: "2"
 restartPolicy: Never

Once the resources are updated, the controller resumes the Job by setting spec.suspend to false, and the new Pods are created with the adjusted resource specifications.

How it works

The Kubernetes API server relaxes the immutability constraint on pod template resource fields specifically for suspended Jobs. No new API types have been introduced; the existing Job and pod template structures accommodate the change through relaxed validation.

The mutable fields are:

Resource updates are permitted when the following conditions are met:

  1. The Job has spec.suspend set to true.
  2. For a Job that was previously running and then suspended, all active Pods must have terminated (status.active equals 0) before resource mutations are accepted.

Standard resource validation still applies. For example, resource limits must be greater than or equal to requests, and extended resources must be specified as whole numbers where required.

What's new in beta

With the promotion to beta in Kubernetes v1.36, the MutablePodResourcesForSuspendedJobs feature gate is enabled by default. This means clusters running v1.36 can use this feature without any additional configuration on the API server.

Try it out

If your cluster is running Kubernetes v1.36 or later, this feature is available by default. For v1.35 clusters, enable the MutablePodResourcesForSuspendedJobs feature gate on the kube-apiserver.

You can test it by creating a suspended Job, updating its container resources using kubectl edit or a controller, and then resuming the Job:

# Create a suspended Job
kubectl apply -f my-job.yaml --server-side

# Edit the resource requests
kubectl edit job training-job-example-abcd123

# Resume the Job
kubectl patch job training-job-example-abcd123 -p '{"spec":{"suspend":false}}'

Considerations

Running Jobs that are suspended

If you suspend a Job that was already running, you must wait for all of that Job's active Pods to terminate before modifying resources. The API server rejects resource mutations while status.active is greater than zero. This prevents inconsistency between running Pods and the updated pod template.

Pod replacement policy

When using this feature with Jobs that may have failed Pods, consider setting podReplacementPolicy: Failed. This ensures that replacement Pods are only created after the previous Pods have fully terminated, preventing resource contention from overlapping Pods.

ResourceClaims

Dynamic Resource Allocation (DRA) resourceClaimTemplates remain immutable. If your workload uses DRA, you must recreate the claim templates separately to match any resource changes.

Getting involved

This feature was developed by SIG Apps This feature was developed by SIG Apps with input from WG Batch. Both groups welcome feedback as the feature progresses toward stable.

You can reach out through:

27 Apr 2026 6:35pm GMT

24 Apr 2026

feedKubernetes Blog

Kubernetes v1.36: Fine-Grained Kubelet API Authorization Graduates to GA

On behalf of Kubernetes SIG Auth and SIG Node, we are pleased to announce the graduation of fine-grained kubelet API authorization to General Availability (GA) in Kubernetes v1.36!

The KubeletFineGrainedAuthz feature gate was introduced as an opt-in alpha feature in Kubernetes v1.32, then graduated to beta (enabled by default) in v1.33. Now, the feature is generally available and the feature gate is locked to enabled. This feature enables more precise, least-privilege access control over the kubelet's HTTPS API, replacing the need to grant the overly broad nodes/proxy permission for common monitoring and observability use cases.

Motivation: the nodes/proxy problem

The kubelet exposes an HTTPS endpoint with several APIs that give access to data of varying sensitivity, including pod listings, node metrics, container logs, and, critically, the ability to execute commands inside running containers.

Prior to this feature, kubelet authorization used a coarse-grained model. When webhook authorization was enabled, almost all kubelet API paths were mapped to a single nodes/proxy subresource. This meant that any workload needing to read metrics or health status from the kubelet required nodes/proxy permission, the same permission that also grants the ability to execute arbitrary commands in any container running on the node.

What's wrong with that?

Granting nodes/proxy to monitoring agents, log collectors, or health-checking tools violates the principle of least privilege. If any of those workloads were compromised, an attacker would gain the ability to run commands in every container on the node. The nodes/proxy permission is effectively a node-level superuser capability, and granting it broadly dramatically increases the blast radius of a security incident.

This problem has been well understood in the community for years (see kubernetes/kubernetes#83465), and was the driving motivation behind this enhancement KEP-2862.

The nodes/proxy GET WebSocket RCE risk

The situation is more severe than it might appear at first glance. Security researchers demonstrated in early 2026 that nodes/proxy GET alone, which is the minimal read-only permission routinely granted to monitoring tools, can be abused to execute commands in any pod on reachable nodes.

The root cause is a mismatch between how WebSocket connections work and how the kubelet maps HTTP methods to RBAC verbs. The WebSocket protocol (RFC 6455) requires an HTTP GET request for the initial connection handshake. The kubelet maps this GET to the RBAC get verb and authorizes the request without performing a secondary check to confirm that CREATE permission is also present for the write operation that follows. Using a WebSocket client like websocat, an attacker can reach the kubelet's /exec endpoint directly on port 10250 and execute arbitrary commands:

websocat --insecure \
 --header "Authorization: Bearer $TOKEN" \
 --protocol v4.channel.k8s.io \
 "wss://$NODE_IP:10250/exec/default/nginx/nginx?output=1&error=1&command=id"

uid=0(root) gid=0(root) groups=0(root)

Fine-grained kubelet authorization: how it works

With KubeletFineGrainedAuthz, the kubelet now performs an additional, more specific authorization check before falling back to the nodes/proxy subresource. Several commonly used kubelet API paths are mapped to their own dedicated subresources:

kubelet API Resource Subresource
/stats/* nodes stats
/metrics/* nodes metrics
/logs/* nodes log
/pods nodes pods, proxy
/runningPods/ nodes pods, proxy
/healthz nodes healthz, proxy
/configz nodes configz, proxy
/spec/* nodes spec
/checkpoint/* nodes checkpoint
all others nodes proxy

For the endpoints that now have fine-grained subresources (/pods, /runningPods/, /healthz, /configz), the kubelet first sends a SubjectAccessReview for the specific subresource. If that check succeeds, the request is authorized. If it fails, the kubelet retries with the coarse-grained nodes/proxy subresource for backward compatibility.

This dual-check approach ensures a smooth migration path. Existing workloads with nodes/proxy permissions continue to work, while new deployments can adopt least-privilege access from day one.

What this means in practice

Consider a Prometheus node exporter or a monitoring DaemonSet that needs to scrape /metrics from the kubelet. Previously, you would need an RBAC ClusterRole like this:

# Old approach: overly broad
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
 name: monitoring-agent
rules:
- apiGroups: [""]
 resources: ["nodes/proxy"]
 verbs: ["get"]

This grants the monitoring agent far more access than it needs. With fine-grained authorization, you can now scope the permissions precisely:

# New approach: least privilege
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
 name: monitoring-agent
rules:
- apiGroups: [""]
 resources: ["nodes/metrics", "nodes/stats"]
 verbs: ["get"]

The monitoring agent can now read metrics and stats from the kubelet without ever being able to execute commands in containers.

Updated system:kubelet-api-admin ClusterRole

When RBAC authorization is enabled, the built-in system:kubelet-api-admin ClusterRole is automatically updated to include permissions for all the new fine-grained subresources. This ensures that cluster administrators who already use this role, including the API server's kubelet client, continue to have full access without any manual configuration changes.

The role now includes permissions for:

Upgrade considerations

Because the kubelet performs a dual authorization check (fine-grained first, then falling back to nodes/proxy), upgrading to v1.36 should be seamless for most clusters:

Verifying the feature is enabled

You can confirm that the feature is active on a given node by checking the kubelet metrics endpoint. Since the metrics endpoint on port 10250 requires authorization, you'll first need to create appropriate RBAC bindings for the pod or ServiceAccount making the request.

Step 1: Create a ServiceAccount and ClusterRole

apiVersion: v1
kind: ServiceAccount
metadata:
 name: kubelet-metrics-checker
 namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
 name: kubelet-metrics-reader
rules:
- apiGroups: [""]
 resources: ["nodes/metrics"]
 verbs: ["get"]

Step 2: Bind the ClusterRole to the ServiceAccount

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
 name: kubelet-metrics-checker
subjects:
- kind: ServiceAccount
 name: kubelet-metrics-checker
 namespace: default
roleRef:
 kind: ClusterRole
 name: kubelet-metrics-reader
 apiGroup: rbac.authorization.k8s.io

Apply both manifests:

kubectl apply -f serviceaccount.yaml
kubectl apply -f clusterrole.yaml
kubectl apply -f clusterrolebinding.yaml

Step 3: Run a pod with the ServiceAccount and check the feature flag

kubectl run kubelet-check \
 --image=curlimages/curl \
 --serviceaccount=kubelet-metrics-checker \
 --restart=Never \
 --rm -it \
 -- sh

Then from within the pod, retrieve the node IP and query the metrics endpoint:

# Get the token
TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)

# Query the kubelet metrics and filter for the feature gate
curl -sk \
 --header "Authorization: Bearer $TOKEN" \
 https://$NODE_IP:10250/metrics \
 | grep kubernetes_feature_enabled \
 | grep KubeletFineGrainedAuthz

If the feature is enabled, you should see output like:

kubernetes_feature_enabled{name="KubeletFineGrainedAuthz",stage="GA"} 1

Note: Replace $NODE_IP with the IP address of the node you want to check. You can retrieve node IPs with kubectl get nodes -o wide.

The journey from alpha to GA

Release Stage Details
v1.32 Alpha Feature gate KubeletFineGrainedAuthz introduced, disabled by default
v1.33 Beta Enabled by default; fine-grained checks for /pods, /runningPods/, /healthz, /configz
v1.36 GA Feature gate locked to enabled; fine-grained kubelet authorization is always active

What's next?

With fine-grained kubelet authorization now GA, the Kubernetes community can begin recommending and eventually enforcing the use of specific subresources instead of nodes/proxy for monitoring and observability workloads. The urgency of this migration is underscored by research showing that nodes/proxy GET can be abused for unlogged remote code execution via the WebSocket protocol. This risk is present in the default RBAC configurations of dozens of widely deployed Helm charts. Over time, we expect:

Getting involved

This enhancement was driven by SIG Auth and SIG Node. If you are interested in contributing to the security and authorization features of Kubernetes, please join us:

We look forward to hearing your feedback and experiences with this feature!

24 Apr 2026 6:35pm GMT