30 Dec 2025
Kubernetes Blog
Kubernetes v1.35: Watch Based Route Reconciliation in the Cloud Controller Manager
Up to and including Kubernetes v1.34, the route controller in Cloud Controller Manager (CCM) implementations built using the k8s.io/cloud-provider library reconciles routes at a fixed interval. This causes unnecessary API requests to the cloud provider when there are no changes to routes. Other controllers implemented through the same library already use watch-based mechanisms, leveraging informers to avoid unnecessary API calls. A new feature gate is being introduced in v1.35 to allow changing the behavior of the route controller to use watch-based informers.
What's new?
The feature gate CloudControllerManagerWatchBasedRoutesReconciliation has been introduced to k8s.io/cloud-provider in alpha stage by SIG Cloud Provider. To enable this feature you can use --feature-gate=CloudControllerManagerWatchBasedRoutesReconciliation=true in the CCM implementation you are using.
About the feature gate
This feature gate will trigger the route reconciliation loop whenever a node is added, deleted, or the fields .spec.podCIDRs or .status.addresses are updated.
An additional reconcile is performed in a random interval between 12h and 24h, which is chosen at the controller's start time.
This feature gate does not modify the logic within the reconciliation loop. Therefore, users of a CCM implementation should not experience significant changes to their existing route configurations.
How can I learn more?
For more details, refer to the KEP-5237.
30 Dec 2025 6:30pm GMT
29 Dec 2025
Kubernetes Blog
Kubernetes v1.35: Introducing Workload Aware Scheduling
Scheduling large workloads is a much more complex and fragile operation than scheduling a single Pod, as it often requires considering all Pods together instead of scheduling each one independently. For example, when scheduling a machine learning batch job, you often need to place each worker strategically, such as on the same rack, to make the entire process as efficient as possible. At the same time, the Pods that are part of such a workload are very often identical from the scheduling perspective, which fundamentally changes how this process should look.
There are many custom schedulers adapted to perform workload scheduling efficiently, but considering how common and important workload scheduling is to Kubernetes users, especially in the AI era with the growing number of use cases, it is high time to make workloads a first-class citizen for kube-scheduler and support them natively.
Workload aware scheduling
The recent 1.35 release of Kubernetes delivered the first tranche of workload aware scheduling improvements. These are part of a wider effort that is aiming to improve scheduling and management of workloads. The effort will span over many SIGs and releases, and is supposed to gradually expand capabilities of the system toward reaching the north star goal, which is seamless workload scheduling and management in Kubernetes including, but not limited to, preemption and autoscaling.
Kubernetes v1.35 introduces the Workload API that you can use to describe the desired shape as well as scheduling-oriented requirements of the workload. It comes with an initial implementation of gang scheduling that instructs the kube-scheduler to schedule gang Pods in the all-or-nothing fashion. Finally, we improved scheduling of identical Pods (that typically make a gang) to speed up the process thanks to the opportunistic batching feature.
Workload API
The new Workload API resource is part of the scheduling.k8s.io/v1alpha1 API group. This resource acts as a structured, machine-readable definition of the scheduling requirements of a multi-Pod application. While user-facing workloads like Jobs define what to run, the Workload resource determines how a group of Pods should be scheduled and how its placement should be managed throughout its lifecycle.
A Workload allows you to define a group of Pods and apply a scheduling policy to them. Here is what a gang scheduling configuration looks like. You can define a podGroup named workers and apply the gang policy with a minCount of 4.
apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
name: training-job-workload
namespace: some-ns
spec:
podGroups:
- name: workers
policy:
gang:
# The gang is schedulable only if 4 pods can run at once
minCount: 4
When you create your Pods, you link them to this Workload using the new workloadRef field:
apiVersion: v1
kind: Pod
metadata:
name: worker-0
namespace: some-ns
spec:
workloadRef:
name: training-job-workload
podGroup: workers
...
How gang scheduling works
The gang policy enforces all-or-nothing placement. Without gang scheduling, a Job might be partially scheduled, consuming resources without being able to run, leading to resource wastage and potential deadlocks.
When you create Pods that are part of a gang-scheduled pod group, the scheduler's GangScheduling plugin manages the lifecycle independently for each pod group (or replica key):
-
When you create your Pods (or a controller makes them for you), the scheduler blocks them from scheduling, until:
- The referenced Workload object is created.
- The referenced pod group exists in a Workload.
- The number of pending Pods in that group meets your
minCount.
-
Once enough Pods arrive, the scheduler tries to place them. However, instead of binding them to nodes immediately, the Pods wait at a
Permitgate. -
The scheduler checks if it has found valid assignments for the entire group (at least the
minCount).- If there is room for the group, the gate opens, and all Pods are bound to nodes.
- If only a subset of the group pods was successfully scheduled within a timeout (set to 5 minutes), the scheduler rejects all of the Pods in the group. They go back to the queue, freeing up the reserved resources for other workloads.
We'd like to point out that that while this is a first implementation, the Kubernetes project firmly intends to improve and expand the gang scheduling algorithm in future releases. Benefits we hope to deliver include a single-cycle scheduling phase for a whole gang, workload-level preemption, and more, moving towards the north star goal.
Opportunistic batching
In addition to explicit gang scheduling, v1.35 introduces opportunistic batching. This is a Beta feature that improves scheduling latency for identical Pods.
Unlike gang scheduling, this feature does not require the Workload API or any explicit opt-in on the user's part. It works opportunistically within the scheduler by identifying Pods that have identical scheduling requirements (container images, resource requests, affinities, etc.). When the scheduler processes a Pod, it can reuse the feasibility calculations for subsequent identical Pods in the queue, significantly speeding up the process.
Most users will benefit from this optimization automatically, without taking any special steps, provided their Pods meet the following criteria.
Restrictions
Opportunistic batching works under specific conditions. All fields used by the kube-scheduler to find a placement must be identical between Pods. Additionally, using some features disables the batching mechanism for those Pods to ensure correctness.
Note that you may need to review your kube-scheduler configuration to ensure it is not implicitly disabling batching for your workloads.
See the docs for more details about restrictions.
The north star vision
The project has a broad ambition to deliver workload aware scheduling. These new APIs and scheduling enhancements are just the first steps. In the near future, the effort aims to tackle:
- Introducing a workload scheduling phase
- Improved support for multi-node DRA and topology aware scheduling
- Workload-level preemption
- Improved integration between scheduling and autoscaling
- Improved interaction with external workload schedulers
- Managing placement of workloads throughout their entire lifecycle
- Multi-workload scheduling simulations
And more. The priority and implementation order of these focus areas are subject to change. Stay tuned for further updates.
Getting started
To try the workload aware scheduling improvements:
- Workload API: Enable the
GenericWorkloadfeature gate on bothkube-apiserverandkube-scheduler, and ensure thescheduling.k8s.io/v1alpha1API group is enabled. - Gang scheduling: Enable the
GangSchedulingfeature gate onkube-scheduler(requires the Workload API to be enabled). - Opportunistic batching: As a Beta feature, it is enabled by default in v1.35. You can disable it using the
OpportunisticBatchingfeature gate onkube-schedulerif needed.
We encourage you to try out workload aware scheduling in your test clusters and share your experiences to help shape the future of Kubernetes scheduling. You can send your feedback by:
- Reaching out via Slack (#sig-scheduling).
- Commenting on the workload aware scheduling tracking issue
- Filing a new issue in the Kubernetes repository.
Learn more
- Read the KEPs for Workload API and gang scheduling and Opportunistic batching.
- Track the Workload aware scheduling issue for recent updates.
29 Dec 2025 6:30pm GMT
23 Dec 2025
Kubernetes Blog
Kubernetes v1.35: Fine-grained Supplemental Groups Control Graduates to GA
On behalf of Kubernetes SIG Node, we are pleased to announce the graduation of fine-grained supplemental groups control to General Availability (GA) in Kubernetes v1.35!
The new Pod field, supplementalGroupsPolicy, was introduced as an opt-in alpha feature for Kubernetes v1.31, and then had graduated to beta in v1.33. Now, the feature is generally available. This feature allows you to implement more precise control over supplemental groups in Linux containers that can strengthen the security posture particularly in accessing volumes. Moreover, it also enhances the transparency of UID/GID details in containers, offering improved security oversight.
If you are planning to upgrade your cluster from v1.32 or an earlier version, please be aware that some behavioral breaking change introduced since beta (v1.33). For more details, see the behavioral changes introduced in beta and the upgrade considerations sections of the previous blog for graduation to beta.
Motivation: Implicit group memberships defined in /etc/group in the container image
Even though the majority of Kubernetes cluster admins/users may not be aware of this, by default Kubernetes merges group information from the Pod with information defined in /etc/group in the container image.
Here's an example; a Pod manifest that specifies spec.securityContext.runAsUser: 1000, spec.securityContext.runAsGroup: 3000 and spec.securityContext.supplementalGroups: 4000 as part of the Pod's security context.
apiVersion: v1
kind: Pod
metadata:
name: implicit-groups-example
spec:
securityContext:
runAsUser: 1000
runAsGroup: 3000
supplementalGroups: [4000]
containers:
- name: example-container
image: registry.k8s.io/e2e-test-images/agnhost:2.45
command: [ "sh", "-c", "sleep 1h" ]
securityContext:
allowPrivilegeEscalation: false
What is the result of id command in the example-container container? The output should be similar to this:
uid=1000 gid=3000 groups=3000,4000,50000
Where does group ID 50000 in supplementary groups (groups field) come from, even though 50000 is not defined in the Pod's manifest at all? The answer is /etc/group file in the container image.
Checking the contents of /etc/group in the container image contains something like the following:
user-defined-in-image:x:1000:
group-defined-in-image:x:50000:user-defined-in-image
This shows that the container's primary user 1000 belongs to the group 50000 in the last entry.
Thus, the group membership defined in /etc/group in the container image for the container's primary user is implicitly merged to the information from the Pod. Please note that this was a design decision the current CRI implementations inherited from Docker, and the community never really reconsidered it until now.
What's wrong with it?
The implicitly merged group information from /etc/group in the container image poses a security risk. These implicit GIDs can't be detected or validated by policy engines because there's no record of them in the Pod manifest. This can lead to unexpected access control issues, particularly when accessing volumes (see kubernetes/kubernetes#112879 for details) because file permission is controlled by UID/GIDs in Linux.
Fine-grained supplemental groups control in a Pod: supplementaryGroupsPolicy
To tackle this problem, a Pod's .spec.securityContext now includes supplementalGroupsPolicy field.
This field lets you control how Kubernetes calculates the supplementary groups for container processes within a Pod. The available policies are:
-
Merge: The group membership defined in
/etc/groupfor the container's primary user will be merged. If not specified, this policy will be applied (i.e. as-is behavior for backward compatibility). -
Strict: Only the group IDs specified in
fsGroup,supplementalGroups, orrunAsGroupare attached as supplementary groups to the container processes. Group memberships defined in/etc/groupfor the container's primary user are ignored.
I'll explain how the Strict policy works. The following Pod manifest specifies supplementalGroupsPolicy: Strict:
apiVersion: v1
kind: Pod
metadata:
name: strict-supplementalgroups-policy-example
spec:
securityContext:
runAsUser: 1000
runAsGroup: 3000
supplementalGroups: [4000]
supplementalGroupsPolicy: Strict
containers:
- name: example-container
image: registry.k8s.io/e2e-test-images/agnhost:2.45
command: [ "sh", "-c", "sleep 1h" ]
securityContext:
allowPrivilegeEscalation: false
The result of id command in the example-container container should be similar to this:
uid=1000 gid=3000 groups=3000,4000
You can see Strict policy can exclude group 50000 from groups!
Thus, ensuring supplementalGroupsPolicy: Strict (enforced by some policy mechanism) helps prevent the implicit supplementary groups in a Pod.
Note:
A container with sufficient privileges can change its process identity. The supplementalGroupsPolicy only affect the initial process identity.
Read on for more details.
Attached process identity in Pod status
This feature also exposes the process identity attached to the first container process of the container via .status.containerStatuses[].user.linux field. It would be helpful to see if implicit group IDs are attached.
...
status:
containerStatuses:
- name: ctr
user:
linux:
gid: 3000
supplementalGroups:
- 3000
- 4000
uid: 1000
...
Note:
Please note that the values in status.containerStatuses[].user.linux field is the firstly attached process identity to the first container process in the container. If the container has sufficient privilege to call system calls related to process identity (e.g. setuid(2), setgid(2) or setgroups(2), etc.), the container process can change its identity. Thus, the actual process identity will be dynamic.
There are several ways to restrict these permissions in containers. We suggest the belows as simple solutions:
- setting
privilege: falseandallowPrivilegeEscalation: falsein your container'ssecurityContext, or - conform your pod to
Restrictedpolicy in Pod Security Standard.
Also, kubelet has no visibility into NRI plugins or container runtime internal workings. Cluster Administrator configuring nodes or highly privilege workloads with the permission of a local administrator may change supplemental groups for any pod. However this is outside of a scope of Kubernetes control and should not be a concern for security-hardened nodes.
Strict policy requires up-to-date container runtimes
The high level container runtime (e.g. containerd, CRI-O) plays a key role for calculating supplementary group ids that will be attached to the containers. Thus, supplementalGroupsPolicy: Strict requires a CRI runtime that support this feature. The old behavior (supplementalGroupsPolicy: Merge) can work with a CRI runtime that does not support this feature, because this policy is fully backward compatible.
Here are some CRI runtimes that support this feature, and the versions you need to be running:
- containerd: v2.0 or later
- CRI-O: v1.31 or later
And, you can see if the feature is supported in the Node's .status.features.supplementalGroupsPolicy field. Please note that this field is different from status.declaredFeatures introduced in KEP-5328: Node Declared Features(formerly Node Capabilities).
apiVersion: v1
kind: Node
...
status:
features:
supplementalGroupsPolicy: true
As container runtimes support this feature universally, various security policies may start enforcing the Strict behavior as more secure. It is the best practice to ensure that your Pods are ready for this enforcement and all supplemental groups are transparently declared in Pod spec, rather than in images.
Getting involved
This enhancement was driven by the SIG Node community. Please join us to connect with the community and share your ideas and feedback around the above feature and beyond. We look forward to hearing from you!
How can I learn more?
- Configure a Security Context for a Pod or Container for the further details of
supplementalGroupsPolicy - KEP-3619: Fine-grained SupplementalGroups control
23 Dec 2025 6:30pm GMT