09 Jan 2026
Kubernetes Blog
Kubernetes v1.35: Restricting executables invoked by kubeconfigs via exec plugin allowList added to kuberc
Did you know that kubectl can run arbitrary executables, including shell scripts, with the full privileges of the invoking user, and without your knowledge? Whenever you download or auto-generate a kubeconfig, the users[n].exec.command field can specify an executable to fetch credentials on your behalf. Don't get me wrong, this is an incredible feature that allows you to authenticate to the cluster with external identity providers. Nevertheless, you probably see the problem: Do you know exactly what executables your kubeconfig is running on your system? Do you trust the pipeline that generated your kubeconfig? If there has been a supply-chain attack on the code that generates the kubeconfig, or if the generating pipeline has been compromised, an attacker might well be doing unsavory things to your machine by tricking your kubeconfig into running arbitrary code.
To give the user more control over what gets run on their system, SIG-Auth and SIG-CLI added the credential plugin policy and allowlist as a beta feature to Kubernetes 1.35. This is available to all clients using the client-go library, by filling out the ExecProvider.PluginPolicy struct on a REST config. To broaden the impact of this change, Kubernetes v1.35 also lets you manage this without writing a line of application code. You can configure kubectl to enforce the policy and allowlist by adding two fields to the kuberc configuration file: credentialPluginPolicy and credentialPluginAllowlist. Adding one or both of these fields restricts which credential plugins kubectl is allowed to execute.
How it works
A full description of this functionality is available in our official documentation for kuberc, but this blog post will give a brief overview of the new security knobs. The new features are in beta and available without using any feature gates.
The following example is the simplest one: simply don't specify the new fields.
apiVersion: kubectl.config.k8s.io/v1beta1
kind: Preference
This will keep kubectl acting as it always has, and all plugins will be allowed.
The next example is functionally identical, but it is more explicit and therefore preferred if it's actually what you want:
apiVersion: kubectl.config.k8s.io/v1beta1
kind: Preference
credentialPluginPolicy: AllowAll
If you don't know whether or not you're using exec credential plugins, try setting your policy to DenyAll:
apiVersion: kubectl.config.k8s.io/v1beta1
kind: Preference
credentialPluginPolicy: DenyAll
If you are using credential plugins, you'll quickly find out what kubectl is trying to execute. You'll get an error like the following.
Unable to connect to the server: getting credentials: plugin "cloudco-login" not allowed: policy set to "DenyAll"
If there is insufficient information for you to debug the issue, increase the logging verbosity when you run your next command. For example:
# increase or decrease verbosity if the issue is still unclear
kubectl get pods --verbosity 5
Selectively allowing plugins
What if you need the cloudco-login plugin to do your daily work? That is why there's a third option for your policy, Allowlist. To allow a specific plugin, set the policy and add the credentialPluginAllowlist:
apiVersion: kubectl.config.k8s.io/v1beta1
kind: Preference
credentialPluginPolicy: Allowlist
credentialPluginAllowlist:
- name: /usr/local/bin/cloudco-login
- name: get-identity
You'll notice that there are two entries in the allowlist. One of them is specified by full path, and the other, get-identity is just a basename. When you specify just the basename, the full path will be looked up using exec.LookPath, which does not expand globbing or handle wildcards. Globbing is not supported at this time. Both forms (basename and full path) are acceptable, but the full path is preferable because it narrows the scope of allowed binaries even further.
Future enhancements
Currently, an allowlist entry has only one field, name. In the future, we (Kubernetes SIG CLI) want to see other requirements added. One idea that seems useful is checksum verification whereby, for example, a binary would only be allowed to run if it has the sha256 sum b9a3fad00d848ff31960c44ebb5f8b92032dc085020f857c98e32a5d5900ff9c and exists at the path /usr/bin/cloudco-login.
Another possibility is only allowing binaries that have been signed by one of a set of a trusted signing keys.
Get involved
The credential plugin policy is still under development and we are very interested in your feedback. We'd love to hear what you like about it and what problems you'd like to see it solve. Or, if you have the cycles to contribute one of the above enhancements, they'd be a great way to get started contributing to Kubernetes. Feel free to join in the discussion on slack:
09 Jan 2026 6:30pm GMT
08 Jan 2026
Kubernetes Blog
Kubernetes v1.35: Mutable PersistentVolume Node Affinity (alpha)
The PersistentVolume node affinity API dates back to Kubernetes v1.10. It is widely used to express that volumes may not be equally accessible by all nodes in the cluster. This field was previously immutable, and it is now mutable in Kubernetes v1.35 (alpha). This change opens a door to more flexible online volume management.
Why make node affinity mutable?
This raises an obvious question: why make node affinity mutable now? While stateless workloads like Deployments can be changed freely and the changes will be rolled out automatically by re-creating every Pod, PersistentVolumes (PVs) are stateful and cannot be re-created easily without losing data.
However, Storage providers evolve and storage requirements change. Most notably, multiple providers are offering regional disks now. Some of them even support live migration from zonal to regional disks, without disrupting the workloads. This change can be expressed through the VolumeAttributesClass API, which recently graduated to GA in 1.34. However, even if the volume is migrated to regional storage, Kubernetes still prevents scheduling Pods to other zones because of the node affinity recorded in the PV object. In this case, you may want to change the PV node affinity from:
spec:
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east1-b
to:
spec:
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/region
operator: In
values:
- us-east1
As another example, providers sometimes offer new generations of disks. New disks cannot always be attached to older nodes in the cluster. This accessibility can also be expressed through PV node affinity and ensures the Pods can be scheduled to the right nodes. But when the disk is upgraded, new Pods using this disk can still be scheduled to older nodes. To prevent this, you may want to change the PV node affinity from:
spec:
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: provider.com/disktype.gen1
operator: In
values:
- available
to:
spec:
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: provider.com/disktype.gen2
operator: In
values:
- available
So, it is mutable now, a first step towards a more flexible online volume management. While it is a simple change that removes one validation from the API server, we still have a long way to go to integrate well with the Kubernetes ecosystem.
Try it out
This feature is for you if you are a Kubernetes cluster administrator, and your storage provider allows online update that you want to utilize, but those updates can affect the accessibility of the volume.
Note that changing PV node affinity alone will not actually change the accessibility of the underlying volume. Before using this feature, you must first update the underlying volume in the storage provider, and understand which nodes can access the volume after the update. You can then enable this feature and keep the PV node affinity in sync.
Currently, this feature is in alpha state. It is disabled by default, and may subject to change. To try it out, enable the MutablePVNodeAffinity feature gate on APIServer, then you can edit the PV spec.nodeAffinity field. Typically only administrators can edit PVs, please make sure you have the right RBAC permissions.
Race condition between updating and scheduling
There are only a few factors outside of a Pod that can affect the scheduling decision, and PV node affinity is one of them. It is fine to allow more nodes to access the volume by relaxing node affinity, but there is a race condition when you try to tighten node affinity: it is unclear how the Scheduler will see the modified PV in its cache, so there is a small window where the scheduler may place a Pod on an old node that can no longer access the volume. In this case, the Pod will stuck at ContainerCreating state.
One mitigation currently under discussion is for the kubelet to fail Pod startup if the PersistentVolume's node affinity is violated. This has not landed yet. So if you are trying this out now, please watch subsequent Pods that use the updated PV, and make sure they are scheduled onto nodes that can access the volume. If you update PV and immediately start new Pods in a script, it may not work as intended.
Future integration with CSI (Container Storage Interface)
Currently, it is up to the cluster administrator to modify both PV's node affinity and the underlying volume in the storage provider. But manual operations are error-prone and time-consuming. It is preferred to eventually integrate this with VolumeAttributesClass, so that an unprivileged user can modify their PersistentVolumeClaim (PVC) to trigger storage-side updates, and PV node affinity is updated automatically when appropriate, without the need for cluster admin's intervention.
We welcome your feedback from users and storage driver developers
As noted earlier, this is only a first step.
If you are a Kubernetes user, we would like to learn how you use (or will use) PV node affinity. Is it beneficial to update it online in your case?
If you are a CSI driver developer, would you be willing to implement this feature? How would you like the API to look?
Please provide your feedback via:
- Slack channel #sig-storage.
- Mailing list kubernetes-sig-storage.
- The KEP issue Mutable PersistentVolume Node Affinity.
For any inquiries or specific questions related to this feature, please reach out to the SIG Storage community.
08 Jan 2026 6:30pm GMT
07 Jan 2026
Kubernetes Blog
Kubernetes v1.35: A Better Way to Pass Service Account Tokens to CSI Drivers
If you maintain a CSI driver that uses service account tokens, Kubernetes v1.35 brings a refinement you'll want to know about. Since the introduction of the TokenRequests feature, service account tokens requested by CSI drivers have been passed to them through the volume_context field. While this has worked, it's not the ideal place for sensitive information, and we've seen instances where tokens were accidentally logged in CSI drivers.
Kubernetes v1.35 introduces a beta solution to address this: CSI Driver Opt-in for Service Account Tokens via Secrets Field. This allows CSI drivers to receive service account tokens through the secrets field in NodePublishVolumeRequest, which is the appropriate place for sensitive data in the CSI specification.
Understanding the existing approach
When CSI drivers use the TokenRequests feature, they can request service account tokens for workload identity by configuring the TokenRequests field in the CSIDriver spec. These tokens are passed to drivers as part of the volume attributes map, using the key csi.storage.k8s.io/serviceAccount.tokens.
The volume_context field works, but it's not designed for sensitive data. Because of this, there are a few challenges:
First, the protosanitizer tool that CSI drivers use doesn't treat volume context as sensitive, so service account tokens can end up in logs when gRPC requests are logged. This happened with CVE-2023-2878 in the Secrets Store CSI Driver and CVE-2024-3744 in the Azure File CSI Driver.
Second, each CSI driver that wants to avoid this issue needs to implement its own sanitization logic, which leads to inconsistency across drivers.
The CSI specification already has a secrets field in NodePublishVolumeRequest that's designed exactly for this kind of sensitive information. The challenge is that we can't just change where we put the tokens without breaking existing CSI drivers that expect them in volume context.
How the opt-in mechanism works
Kubernetes v1.35 introduces an opt-in mechanism that lets CSI drivers choose how they receive service account tokens. This way, existing drivers continue working as they do today, and drivers can move to the more appropriate secrets field when they're ready.
CSI drivers can set a new field in their CSIDriver spec:
#
# CAUTION: this is an example configuration.
# Do not use this for your own cluster!
#
apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
name: example-csi-driver
spec:
# ... existing fields ...
tokenRequests:
- audience: "example.com"
expirationSeconds: 3600
# New field for opting into secrets delivery
serviceAccountTokenInSecrets: true # defaults to false
The behavior depends on the serviceAccountTokenInSecrets field:
When set to false (the default), tokens are placed in VolumeContext with the key csi.storage.k8s.io/serviceAccount.tokens, just like today. When set to true, tokens are placed only in the Secrets field with the same key.
About the beta release
The CSIServiceAccountTokenSecrets feature gate is enabled by default on both kubelet and kube-apiserver. Since the serviceAccountTokenInSecrets field defaults to false, enabling the feature gate doesn't change any existing behavior. All drivers continue receiving tokens via volume context unless they explicitly opt in. This is why we felt comfortable starting at beta rather than alpha.
Guide for CSI driver authors
If you maintain a CSI driver that uses service account tokens, here's how to adopt this feature.
Adding fallback logic
First, update your driver code to check both locations for tokens. This makes your driver compatible with both the old and new approaches:
const serviceAccountTokenKey = "csi.storage.k8s.io/serviceAccount.tokens"
func getServiceAccountTokens(req *csi.NodePublishVolumeRequest) (string, error) {
// Check secrets field first (new behavior when driver opts in)
if tokens, ok := req.Secrets[serviceAccountTokenKey]; ok {
return tokens, nil
}
// Fall back to volume context (existing behavior)
if tokens, ok := req.VolumeContext[serviceAccountTokenKey]; ok {
return tokens, nil
}
return "", fmt.Errorf("service account tokens not found")
}
This fallback logic is backward compatible and safe to ship in any driver version, even before clusters upgrade to v1.35.
Rollout sequence
CSI driver authors need to follow a specific sequence when adopting this feature to avoid breaking existing volumes.
Driver preparation (can happen anytime)
You can start preparing your driver right away by adding fallback logic that checks both the secrets field and volume context for tokens. This code change is backward compatible and safe to ship in any driver version, even before clusters upgrade to v1.35. We encourage you to add this fallback logic early, cut releases, and even backport to maintenance branches where feasible.
Cluster upgrade and feature enablement
Once your driver has the fallback logic deployed, here's the safe rollout order for enabling the feature in a cluster:
- Complete the kube-apiserver upgrade to 1.35 or later
- Complete kubelet upgrade to 1.35 or later on all nodes
- Ensure CSI driver version with fallback logic is deployed (if not already done in preparation phase)
- Fully complete CSI driver DaemonSet rollout across all nodes
- Update your CSIDriver manifest to set
serviceAccountTokenInSecrets: true
Important constraints
The most important thing to remember is timing. If your CSI driver DaemonSet and CSIDriver object are in the same manifest or Helm chart, you need two separate updates. Deploy the new driver version with fallback logic first, wait for the DaemonSet rollout to complete, then update the CSIDriver spec to set serviceAccountTokenInSecrets: true.
Also, don't update the CSIDriver before all driver pods have rolled out. If you do, volume mounts will fail on nodes still running the old driver version, since those pods only check volume context.
Why this matters
Adopting this feature helps in a few ways:
- It eliminates the risk of accidentally logging service account tokens as part of volume context in gRPC requests
- It uses the CSI specification's designated field for sensitive data, which feels right
- The
protosanitizertool automatically handles the secrets field correctly, so you don't need driver-specific workarounds - It's opt-in, so you can migrate at your own pace without breaking existing deployments
Call to action
We (Kubernetes SIG Storage) encourage CSI driver authors to adopt this feature and provide feedback on the migration experience. If you have thoughts on the API design or run into any issues during adoption, please reach out to us on the #csi channel on Kubernetes Slack (for an invitation, visit https://slack.k8s.io/).
You can follow along on KEP-5538 to track progress across the coming Kubernetes releases.
07 Jan 2026 6:30pm GMT
05 Jan 2026
Kubernetes Blog
Kubernetes v1.35: Extended Toleration Operators to Support Numeric Comparisons (Alpha)
Many production Kubernetes clusters blend on-demand (higher-SLA) and spot/preemptible (lower-SLA) nodes to optimize costs while maintaining reliability for critical workloads. Platform teams need a safe default that keeps most workloads away from risky capacity, while allowing specific workloads to opt-in with explicit thresholds like "I can tolerate nodes with failure probability up to 5%".
Today, Kubernetes taints and tolerations can match exact values or check for existence, but they can't compare numeric thresholds. You'd need to create discrete taint categories, use external admission controllers, or accept less-than-optimal placement decisions.
In Kubernetes v1.35, we're introducing Extended Toleration Operators as an alpha feature. This enhancement adds Gt (Greater Than) and Lt (Less Than) operators to spec.tolerations, enabling threshold-based scheduling decisions that unlock new possibilities for SLA-based placement, cost optimization, and performance-aware workload distribution.
The evolution of tolerations
Historically, Kubernetes supported two primary toleration operators:
Equal: The toleration matches a taint if the key and value are exactly equalExists: The toleration matches a taint if the key exists, regardless of value
While these worked well for categorical scenarios, they fell short for numeric comparisons. Starting with v1.35, we are closing this gap.
Consider these real-world scenarios:
- SLA requirements: Schedule high-availability workloads only on nodes with failure probability below a certain threshold
- Cost optimization: Allow cost-sensitive batch jobs to run on cheaper nodes that exceed a specific cost-per-hour value
- Performance guarantees: Ensure latency-sensitive applications run only on nodes with disk IOPS or network bandwidth above minimum thresholds
Without numeric comparison operators, cluster operators have had to resort to workarounds like creating multiple discrete taint values or using external admission controllers, neither of which scale well or provide the flexibility needed for dynamic threshold-based scheduling.
Why extend tolerations instead of using NodeAffinity?
You might wonder: NodeAffinity already supports numeric comparison operators, so why extend tolerations? While NodeAffinity is powerful for expressing pod preferences, taints and tolerations provide critical operational benefits:
- Policy orientation: NodeAffinity is per-pod, requiring every workload to explicitly opt-out of risky nodes. Taints invert control-nodes declare their risk level, and only pods with matching tolerations may land there. This provides a safer default; most pods stay away from spot/preemptible nodes unless they explicitly opt-in.
- Eviction semantics: NodeAffinity has no eviction capability. Taints support the
NoExecuteeffect withtolerationSeconds, enabling operators to drain and evict pods when a node's SLA degrades or spot instances receive termination notices. - Operational ergonomics: Centralized, node-side policy is consistent with other safety taints like disk-pressure and memory-pressure, making cluster management more intuitive.
This enhancement preserves the well-understood safety model of taints and tolerations while enabling threshold-based placement for SLA-aware scheduling.
Introducing Gt and Lt operators
Kubernetes v1.35 introduces two new operators for tolerations:
Gt(Greater Than): The toleration matches if the taint's numeric value is less than the toleration's valueLt(Less Than): The toleration matches if the taint's numeric value is greater than the toleration's value
When a pod tolerates a taint with Lt, it's saying "I can tolerate nodes where this metric is less than my threshold". Since tolerations allow scheduling, the pod can run on nodes where the taint value is greater than the toleration value. Think of it as: "I tolerate nodes that are above my minimum requirements".
These operators work with numeric taint values and enable the scheduler to make sophisticated placement decisions based on continuous metrics rather than discrete categories.
Note:
Numeric values for Gt and Lt operators must be positive 64-bit integers without leading zeros. For example, "100" is valid, but "0100" (with leading zero) and "0" (zero value) are not permitted.
The Gt and Lt operators work with all taint effects: NoSchedule, NoExecute, and PreferNoSchedule.
Use cases and examples
Let's explore how Extended Toleration Operators solve real-world scheduling challenges.
Example 1: Spot instance protection with SLA thresholds
Many clusters mix on-demand and spot/preemptible nodes to optimize costs. Spot nodes offer significant savings but have higher failure rates. You want most workloads to avoid spot nodes by default, while allowing specific workloads to opt-in with clear SLA boundaries.
First, taint spot nodes with their failure probability (for example, 15% annual failure rate):
apiVersion: v1
kind: Node
metadata:
name: spot-node-1
spec:
taints:
- key: "failure-probability"
value: "15"
effect: "NoExecute"
On-demand nodes have much lower failure rates:
apiVersion: v1
kind: Node
metadata:
name: ondemand-node-1
spec:
taints:
- key: "failure-probability"
value: "2"
effect: "NoExecute"
Critical workloads can specify strict SLA requirements:
apiVersion: v1
kind: Pod
metadata:
name: payment-processor
spec:
tolerations:
- key: "failure-probability"
operator: "Lt"
value: "5"
effect: "NoExecute"
tolerationSeconds: 30
containers:
- name: app
image: payment-app:v1
This pod will only schedule on nodes with failure-probability less than 5 (meaning ondemand-node-1 with 2% but not spot-node-1 with 15%). The NoExecute effect with tolerationSeconds: 30 means if a node's SLA degrades (for example, cloud provider changes the taint value), the pod gets 30 seconds to gracefully terminate before forced eviction.
Meanwhile, a fault-tolerant batch job can explicitly opt-in to spot instances:
apiVersion: v1
kind: Pod
metadata:
name: batch-job
spec:
tolerations:
- key: "failure-probability"
operator: "Lt"
value: "20"
effect: "NoExecute"
containers:
- name: worker
image: batch-worker:v1
This batch job tolerates nodes with failure probability up to 20%, so it can run on both on-demand and spot nodes, maximizing cost savings while accepting higher risk.
Example 2: AI workload placement with GPU tiers
AI and machine learning workloads often have specific hardware requirements. With Extended Toleration Operators, you can create GPU node tiers and ensure workloads land on appropriately powered hardware.
Taint GPU nodes with their compute capability score:
apiVersion: v1
kind: Node
metadata:
name: gpu-node-a100
spec:
taints:
- key: "gpu-compute-score"
value: "1000"
effect: "NoSchedule"
---
apiVersion: v1
kind: Node
metadata:
name: gpu-node-t4
spec:
taints:
- key: "gpu-compute-score"
value: "500"
effect: "NoSchedule"
A heavy training workload can require high-performance GPUs:
apiVersion: v1
kind: Pod
metadata:
name: model-training
spec:
tolerations:
- key: "gpu-compute-score"
operator: "Gt"
value: "800"
effect: "NoSchedule"
containers:
- name: trainer
image: ml-trainer:v1
resources:
limits:
nvidia.com/gpu: 1
This ensures the training pod only schedules on nodes with compute scores greater than 800 (like the A100 node), preventing placement on lower-tier GPUs that would slow down training.
Meanwhile, inference workloads with less demanding requirements can use any available GPU:
apiVersion: v1
kind: Pod
metadata:
name: model-inference
spec:
tolerations:
- key: "gpu-compute-score"
operator: "Gt"
value: "400"
effect: "NoSchedule"
containers:
- name: inference
image: ml-inference:v1
resources:
limits:
nvidia.com/gpu: 1
Example 3: Cost-optimized workload placement
For batch processing or non-critical workloads, you might want to minimize costs by running on cheaper nodes, even if they have lower performance characteristics.
Nodes can be tainted with their cost rating:
spec:
taints:
- key: "cost-per-hour"
value: "50"
effect: "NoSchedule"
A cost-sensitive batch job can express its tolerance for expensive nodes:
tolerations:
- key: "cost-per-hour"
operator: "Lt"
value: "100"
effect: "NoSchedule"
This batch job will schedule on nodes costing less than $100/hour but avoid more expensive nodes. Combined with Kubernetes scheduling priorities, this enables sophisticated cost-tiering strategies where critical workloads get premium nodes while batch workloads efficiently use budget-friendly resources.
Example 4: Performance-based placement
Storage-intensive applications often require minimum disk performance guarantees. With Extended Toleration Operators, you can enforce these requirements at the scheduling level.
tolerations:
- key: "disk-iops"
operator: "Gt"
value: "3000"
effect: "NoSchedule"
This toleration ensures the pod only schedules on nodes where disk-iops exceeds 3000. The Gt operator means "I need nodes that are greater than this minimum".
How to use this feature
Extended Toleration Operators is an alpha feature in Kubernetes v1.35. To try it out:
-
Enable the feature gate on both your API server and scheduler:
--feature-gates=TaintTolerationComparisonOperators=true -
Taint your nodes with numeric values representing the metrics relevant to your scheduling needs:
kubectl taint nodes node-1 failure-probability=5:NoSchedule kubectl taint nodes node-2 disk-iops=5000:NoSchedule -
Use the new operators in your pod specifications:
spec: tolerations: - key: "failure-probability" operator: "Lt" value: "1" effect: "NoSchedule"
Note:
As an alpha feature, Extended Toleration Operators may change in future releases and should be used with caution in production environments. Always test thoroughly in non-production clusters first.What's next?
This alpha release is just the beginning. As we gather feedback from the community, we plan to:
- Add support for CEL (Common Expression Language) expressions in tolerations and node affinity for even more flexible scheduling logic, including semantic versioning comparisons
- Improve integration with cluster autoscaling for threshold-aware capacity planning
- Graduate the feature to beta and eventually GA with production-ready stability
We're particularly interested in hearing about your use cases! Do you have scenarios where threshold-based scheduling would solve problems? Are there additional operators or capabilities you'd like to see?
Getting involved
This feature is driven by the SIG Scheduling community. Please join us to connect with the community and share your ideas and feedback around this feature and beyond.
You can reach the maintainers of this feature at:
- Slack: #sig-scheduling on Kubernetes Slack
- Mailing list: kubernetes-sig-scheduling@googlegroups.com
For questions or specific inquiries related to Extended Toleration Operators, please reach out to the SIG Scheduling community. We look forward to hearing from you!
How can I learn more?
- Taints and Tolerations for understanding the fundamentals
- Numeric comparison operators for details on using
GtandLtoperators - KEP-5471: Extended Toleration Operators for Threshold-Based Placement
05 Jan 2026 6:30pm GMT
02 Jan 2026
Kubernetes Blog
Kubernetes v1.35: New level of efficiency with in-place Pod restart
The release of Kubernetes 1.35 introduces a powerful new feature that provides a much-requested capability: the ability to trigger a full, in-place restart of the Pod. This feature, Restart All Containers (alpha in 1.35), allows for an efficient way to reset a Pod's state compared to resource-intensive approach of deleting and recreating the entire Pod. This feature is especially useful for AI/ML workloads allowing application developers to concentrate on their core training logic while offloading complex failure-handling and recovery mechanisms to sidecars and declarative Kubernetes configuration. With RestartAllContainers and other planned enhancements, Kubernetes continues to add building blocks for creating the most flexible, robust, and efficient platforms for AI/ML workloads.
This new functionality is available by enabling the RestartAllContainersOnContainerExits feature gate. This alpha feature extends the Container Restart Rules feature, which graduated to beta in Kubernetes 1.35.
The problem: when a single container restart isn't enough and recreating pods is too costly
Kubernetes has long supported restart policies at the Pod level (restartPolicy) and, more recently, at the individual container level. These policies are great for handling crashes in a single, isolated process. However, many modern applications have more complex inter-container dependencies. For instance:
- An init container prepares the environment by mounting a volume or generating a configuration file. If the main application container corrupts this environment, simply restarting that one container is not enough. The entire initialization process needs to run again.
- A watcher sidecar monitors system health. If it detects an unrecoverable but retriable error state, it must trigger a restart of the main application container from a clean slate.
- A sidecar that manages a remote resource fails. Even if the sidecar restarts on its own, the main container may be stuck trying to access an outdated or broken connection.
In all these cases, the desired action is not to restart a single container, but all of them. Previously, the only way to achieve this was to delete the Pod and have a controller (like a Job or ReplicaSet) create a new one. This process is slow and expensive, involving the scheduler, node resource allocation and re-initialization of networking and storage.
This inefficiency becomes even worse when handling large-scale AI/ML workloads (>= 1,000 Nodes with one Pod per Node). A common requirement for these synchronous workloads is that when a failure occurs (such as a Node crash), all Pods in the fleet must be recreated to reset the state before training can resume, even if all the other Pods were not directly affected by the failure. Deleting, creating and scheduling thousands of Pods simultaneously creates a massive bottleneck. The estimated overhead of this failure could cost $100,000 per month in wasted resources.
Handling these failures for AI/ML training jobs requires a complex integration touching both the training framework and Kubernetes, which are often fragile and toilsome. This feature introduces a Kubernetes-native solution, improving system robustness and allowing application developers to concentrate on their core training logic.
Another major benefit of restarting Pods in place is that keeping Pods on their assigned Nodes allows for further optimizations. For example, one can implement node-level caching tied to a specific Pod identity, something that is impossible when Pods are unnecessarily being recreated on different Nodes.
Introducing the RestartAllContainers action
To address this, Kubernetes v1.35 adds a new action to the container restart rules: RestartAllContainers. When a container exits in a way that matches a rule with this action, the kubelet initiates a fast, in-place restart of the Pod.
This in-place restart is highly efficient because it preserves the Pod's most important resources:
- The Pod's UID, IP address and network namespace.
- The Pod's sandbox and any attached devices.
- All volumes, including
emptyDirand mounted volumes from PVCs.
After terminating all running containers, the Pod's startup sequence is re-executed from the very beginning. This means all init containers are run again in order, followed by the sidecar and regular containers, ensuring a completely fresh start in a known-good environment. With the exception of ephemeral containers (which are terminated), all other containers-including those that previously succeeded or failed-will be restarted, regardless of their individual restart policies.
Use cases
1. Efficient restarts for ML/Batch jobs
For ML training jobs, rescheduling a worker Pod on failure is a costly operation that wastes valuable compute resources. On a 1,000-node training cluster, rescheduling overhead can waste over $100,000 in compute resources monthly.
With RestartAllContainers actions you can address this by enabling a much faster, hybrid recovery strategy: recreate only the "bad" Pods (e.g., those on unhealthy Nodes) while triggering RestartAllContainers for the remaining healthy Pods. Benchmarks show this reduces the recovery overhead from minutes to a few seconds.
With in-place restarts, a watcher sidecar can monitor the main training process. If it encounters a specific, retriable error, the watcher can exit with a designated code to trigger a fast reset of the worker Pod, allowing it to restart from the last checkpoint without involving the Job controller. This capability is now natively supported by Kubernetes.
Read more details about future development and JobSet features at KEP-467 JobSet in-place restart.
apiVersion: v1
kind: Pod
metadata:
name: ml-worker-pod
spec:
restartPolicy: Never
initContainers:
# This init container will re-run on every in-place restart
- name: setup-environment
image: my-repo/setup-worker:1.0
- name: watcher-sidecar
image: my-repo/watcher:1.0
restartPolicy: Always
restartPolicyRules:
- action: RestartAllContainers
onExit:
exitCodes:
operator: In
# A specific exit code from the watcher triggers a full pod restart
values: [88]
containers:
- name: main-application
image: my-repo/training-app:1.0
2. Re-running init containers for a clean state
Imagine a scenario where an init container is responsible for fetching credentials or setting up a shared volume. If the main application fails in a way that corrupts this shared state, you need the init container to rerun.
By configuring the main application to exit with a specific code upon detecting such a corruption, you can trigger the RestartAllContainers action, guaranteeing that the init container provides a clean setup before the application restarts.
3. Handling high rate of similar tasks execution
There are cases when tasks are best represented as a Pod execution. And each task requires a clean execution. The task may be a game session backend or some queue item processing. If the rate of tasks is high, running the whole cycle of Pod creation, scheduling and initialization is simply too expensive, especially when tasks can be short. The ability to restart all containers from scratch enables a Kubernetes-native way to handle this scenario without custom solutions or frameworks.
How to use it
To try this feature, you must enable the RestartAllContainersOnContainerExits feature gate on your Kubernetes cluster components (API server and kubelet) running Kubernetes v1.35+. This alpha feature extends the ContainerRestartRules feature, which graduated to beta in v1.35 and is enabled by default.
Once enabled, you can add restartPolicyRules to any container (init, sidecar, or regular) and use the RestartAllContainers action.
The feature is designed to be easily usable on existing apps. However, if an application does not follow some best practices, it may cause issues for the application or for observability tooling. When enabling the feature, make sure that all containers are reentrant and that external tooling is prepared for init containers to re-run. Also, when restarting all containers, the kubelet does not run preStop hooks. This means containers must be designed to handle abrupt termination without relying on preStop hooks for graceful shutdown.
Observing the restart
To make this process observable, a new Pod condition, AllContainersRestarting, is added to the Pod's status. When a restart is triggered, this condition becomes True and it reverts to False once all containers have terminated and the Pod is ready to start its lifecycle anew. This provides a clear signal to users and other cluster components about the Pod's state.
All containers restarted by this action will have their restart count incremented in the container status.
Learn more
- Read the official documentation on Pod Lifecycle.
- Read the detailed proposal in the KEP-5532: Restart All Containers on Container Exits.
- Read the proposal for JobSet in-place restart in JobSet issue #467.
We want your feedback!
As an alpha feature, RestartAllContainers is ready for you to experiment with and any use cases and feedback are welcome. This feature is driven by the SIG Node community. If you are interested in getting involved, sharing your thoughts, or contributing, please join us!
You can reach SIG Node through:
- Slack: #sig-node
- Mailing list
02 Jan 2026 6:30pm GMT
31 Dec 2025
Kubernetes Blog
Kubernetes 1.35: Enhanced Debugging with Versioned z-pages APIs
Debugging Kubernetes control plane components can be challenging, especially when you need to quickly understand the runtime state of a component or verify its configuration. With Kubernetes 1.35, we're enhancing the z-pages debugging endpoints with structured, machine-parseable responses that make it easier to build tooling and automate troubleshooting workflows.
What are z-pages?
z-pages are special debugging endpoints exposed by Kubernetes control plane components. Introduced as an alpha feature in Kubernetes 1.32, these endpoints provide runtime diagnostics for components like kube-apiserver, kube-controller-manager, kube-scheduler, kubelet and kube-proxy. The name "z-pages" comes from the convention of using /*z paths for debugging endpoints.
Currently, Kubernetes supports two primary z-page endpoints:
/statusz- Displays high-level component information including version information, start time, uptime, and available debug paths
/flagz- Shows all command-line arguments and their values used to start the component (with confidential values redacted for security)
These endpoints are valuable for human operators who need to quickly inspect component state, but until now, they only returned plain text output that was difficult to parse programmatically.
What's new in Kubernetes 1.35?
Kubernetes 1.35 introduces structured, versioned responses for both /statusz and /flagz endpoints. This enhancement maintains backward compatibility with the existing plain text format while adding support for machine-readable JSON responses.
Backward compatible design
The new structured responses are opt-in. Without specifying an Accept header, the endpoints continue to return the familiar plain text format:
$ curl --cert /etc/kubernetes/pki/apiserver-kubelet-client.crt \
--key /etc/kubernetes/pki/apiserver-kubelet-client.key \
--cacert /etc/kubernetes/pki/ca.crt \
https://localhost:6443/statusz
kube-apiserver statusz
Warning: This endpoint is not meant to be machine parseable, has no formatting compatibility guarantees and is for debugging purposes only.
Started: Wed Oct 16 21:03:43 UTC 2024
Up: 0 hr 00 min 16 sec
Go version: go1.23.2
Binary version: 1.35.0-alpha.0.1595
Emulation version: 1.35
Paths: /healthz /livez /metrics /readyz /statusz /version
Structured JSON responses
To receive a structured response, include the appropriate Accept header:
Accept: application/json;v=v1alpha1;g=config.k8s.io;as=Statusz
This returns a versioned JSON response:
{
"kind": "Statusz",
"apiVersion": "config.k8s.io/v1alpha1",
"metadata": {
"name": "kube-apiserver"
},
"startTime": "2025-10-29T00:30:01Z",
"uptimeSeconds": 856,
"goVersion": "go1.23.2",
"binaryVersion": "1.35.0",
"emulationVersion": "1.35",
"paths": [
"/healthz",
"/livez",
"/metrics",
"/readyz",
"/statusz",
"/version"
]
}
Similarly, /flagz supports structured responses with the header:
Accept: application/json;v=v1alpha1;g=config.k8s.io;as=Flagz
Example response:
{
"kind": "Flagz",
"apiVersion": "config.k8s.io/v1alpha1",
"metadata": {
"name": "kube-apiserver"
},
"flags": {
"advertise-address": "192.168.8.4",
"allow-privileged": "true",
"authorization-mode": "[Node,RBAC]",
"enable-priority-and-fairness": "true",
"profiling": "true"
}
}
Why structured responses matter
The addition of structured responses opens up several new possibilities:
1. Automated health checks and monitoring
Instead of parsing plain text, monitoring tools can now easily extract specific fields. For example, you can programmatically check if a component has been running with an unexpected emulated version or verify that critical flags are set correctly.
2. Better debugging tools
Developers can build sophisticated debugging tools that compare configurations across multiple components or track configuration drift over time. The structured format makes it trivial to diff configurations or validate that components are running with expected settings.
3. API versioning and stability
By introducing versioned APIs (starting with v1alpha1), we provide a clear path to stability. As the feature matures, we'll introduce v1beta1 and eventually v1, giving you confidence that your tooling won't break with future Kubernetes releases.
How to use structured z-pages
Prerequisites
Both endpoints require feature gates to be enabled:
/statusz: Enable theComponentStatuszfeature gate/flagz: Enable theComponentFlagzfeature gate
Example: Getting structured responses
Here's an example using curl to retrieve structured JSON responses from the kube-apiserver:
# Get structured statusz response
curl \
--cert /etc/kubernetes/pki/apiserver-kubelet-client.crt \
--key /etc/kubernetes/pki/apiserver-kubelet-client.key \
--cacert /etc/kubernetes/pki/ca.crt \
-H "Accept: application/json;v=v1alpha1;g=config.k8s.io;as=Statusz" \
https://localhost:6443/statusz | jq .
# Get structured flagz response
curl \
--cert /etc/kubernetes/pki/apiserver-kubelet-client.crt \
--key /etc/kubernetes/pki/apiserver-kubelet-client.key \
--cacert /etc/kubernetes/pki/ca.crt \
-H "Accept: application/json;v=v1alpha1;g=config.k8s.io;as=Flagz" \
https://localhost:6443/flagz | jq .
Note:
The examples above use client certificate authentication and verify the server's certificate using--cacert. If you need to bypass certificate verification in a test environment, you can use --insecure (or -k), but this should never be done in production as it makes you vulnerable to man-in-the-middle attacks.Important considerations
Alpha feature status
The structured z-page responses are an alpha feature in Kubernetes 1.35. This means:
- The API format may change in future releases
- These endpoints are intended for debugging, not production automation
- You should avoid relying on them for critical monitoring workflows until they reach beta or stable status
Security and access control
z-pages expose internal component information and require proper access controls. Here are the key security considerations:
Authorization: Access to z-page endpoints is restricted to members of the system:monitoring group, which follows the same authorization model as other debugging endpoints like /healthz, /livez, and /readyz. This ensures that only authorized users and service accounts can access debugging information. If your cluster uses RBAC, you can manage access by granting appropriate permissions to this group.
Authentication: The authentication requirements for these endpoints depend on your cluster's configuration. Unless anonymous authentication is enabled for your cluster, you typically need to use authentication mechanisms (such as client certificates) to access these endpoints.
Information disclosure: These endpoints reveal configuration details about your cluster components, including:
- Component versions and build information
- All command-line arguments and their values (with confidential values redacted)
- Available debug endpoints
Only grant access to trusted operators and debugging tools. Avoid exposing these endpoints to unauthorized users or automated systems that don't require this level of access.
Future evolution
As the feature matures, we (Kubernetes SIG Instrumentation) expect to:
- Introduce
v1beta1and eventuallyv1versions of the API - Gather community feedback on the response schema
- Potentially add additional z-page endpoints based on user needs
Try it out
We encourage you to experiment with structured z-pages in a test environment:
- Enable the
ComponentStatuszandComponentFlagzfeature gates on your control plane components - Try querying the endpoints with both plain text and structured formats
- Build a simple tool or script that uses the structured data
- Share your feedback with the community
Learn more
- z-pages documentation
- KEP-4827: Component Statusz
- KEP-4828: Component Flagz
- Join the discussion in the #sig-instrumentation channel on Kubernetes Slack
Get involved
We'd love to hear your feedback! The structured z-pages feature is designed to make Kubernetes easier to debug and monitor. Whether you're building internal tooling, contributing to open source projects, or just exploring the feature, your input helps shape the future of Kubernetes observability.
If you have questions, suggestions, or run into issues, please reach out to SIG Instrumentation. You can find us on Slack or at our regular community meetings.
Happy debugging!
31 Dec 2025 6:30pm GMT
30 Dec 2025
Kubernetes Blog
Kubernetes v1.35: Watch Based Route Reconciliation in the Cloud Controller Manager
Up to and including Kubernetes v1.34, the route controller in Cloud Controller Manager (CCM) implementations built using the k8s.io/cloud-provider library reconciles routes at a fixed interval. This causes unnecessary API requests to the cloud provider when there are no changes to routes. Other controllers implemented through the same library already use watch-based mechanisms, leveraging informers to avoid unnecessary API calls. A new feature gate is being introduced in v1.35 to allow changing the behavior of the route controller to use watch-based informers.
What's new?
The feature gate CloudControllerManagerWatchBasedRoutesReconciliation has been introduced to k8s.io/cloud-provider in alpha stage by SIG Cloud Provider. To enable this feature you can use --feature-gate=CloudControllerManagerWatchBasedRoutesReconciliation=true in the CCM implementation you are using.
About the feature gate
This feature gate will trigger the route reconciliation loop whenever a node is added, deleted, or the fields .spec.podCIDRs or .status.addresses are updated.
An additional reconcile is performed in a random interval between 12h and 24h, which is chosen at the controller's start time.
This feature gate does not modify the logic within the reconciliation loop. Therefore, users of a CCM implementation should not experience significant changes to their existing route configurations.
How can I learn more?
For more details, refer to the KEP-5237.
30 Dec 2025 6:30pm GMT
29 Dec 2025
Kubernetes Blog
Kubernetes v1.35: Introducing Workload Aware Scheduling
Scheduling large workloads is a much more complex and fragile operation than scheduling a single Pod, as it often requires considering all Pods together instead of scheduling each one independently. For example, when scheduling a machine learning batch job, you often need to place each worker strategically, such as on the same rack, to make the entire process as efficient as possible. At the same time, the Pods that are part of such a workload are very often identical from the scheduling perspective, which fundamentally changes how this process should look.
There are many custom schedulers adapted to perform workload scheduling efficiently, but considering how common and important workload scheduling is to Kubernetes users, especially in the AI era with the growing number of use cases, it is high time to make workloads a first-class citizen for kube-scheduler and support them natively.
Workload aware scheduling
The recent 1.35 release of Kubernetes delivered the first tranche of workload aware scheduling improvements. These are part of a wider effort that is aiming to improve scheduling and management of workloads. The effort will span over many SIGs and releases, and is supposed to gradually expand capabilities of the system toward reaching the north star goal, which is seamless workload scheduling and management in Kubernetes including, but not limited to, preemption and autoscaling.
Kubernetes v1.35 introduces the Workload API that you can use to describe the desired shape as well as scheduling-oriented requirements of the workload. It comes with an initial implementation of gang scheduling that instructs the kube-scheduler to schedule gang Pods in the all-or-nothing fashion. Finally, we improved scheduling of identical Pods (that typically make a gang) to speed up the process thanks to the opportunistic batching feature.
Workload API
The new Workload API resource is part of the scheduling.k8s.io/v1alpha1 API group. This resource acts as a structured, machine-readable definition of the scheduling requirements of a multi-Pod application. While user-facing workloads like Jobs define what to run, the Workload resource determines how a group of Pods should be scheduled and how its placement should be managed throughout its lifecycle.
A Workload allows you to define a group of Pods and apply a scheduling policy to them. Here is what a gang scheduling configuration looks like. You can define a podGroup named workers and apply the gang policy with a minCount of 4.
apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
name: training-job-workload
namespace: some-ns
spec:
podGroups:
- name: workers
policy:
gang:
# The gang is schedulable only if 4 pods can run at once
minCount: 4
When you create your Pods, you link them to this Workload using the new workloadRef field:
apiVersion: v1
kind: Pod
metadata:
name: worker-0
namespace: some-ns
spec:
workloadRef:
name: training-job-workload
podGroup: workers
...
How gang scheduling works
The gang policy enforces all-or-nothing placement. Without gang scheduling, a Job might be partially scheduled, consuming resources without being able to run, leading to resource wastage and potential deadlocks.
When you create Pods that are part of a gang-scheduled pod group, the scheduler's GangScheduling plugin manages the lifecycle independently for each pod group (or replica key):
-
When you create your Pods (or a controller makes them for you), the scheduler blocks them from scheduling, until:
- The referenced Workload object is created.
- The referenced pod group exists in a Workload.
- The number of pending Pods in that group meets your
minCount.
-
Once enough Pods arrive, the scheduler tries to place them. However, instead of binding them to nodes immediately, the Pods wait at a
Permitgate. -
The scheduler checks if it has found valid assignments for the entire group (at least the
minCount).- If there is room for the group, the gate opens, and all Pods are bound to nodes.
- If only a subset of the group pods was successfully scheduled within a timeout (set to 5 minutes), the scheduler rejects all of the Pods in the group. They go back to the queue, freeing up the reserved resources for other workloads.
We'd like to point out that that while this is a first implementation, the Kubernetes project firmly intends to improve and expand the gang scheduling algorithm in future releases. Benefits we hope to deliver include a single-cycle scheduling phase for a whole gang, workload-level preemption, and more, moving towards the north star goal.
Opportunistic batching
In addition to explicit gang scheduling, v1.35 introduces opportunistic batching. This is a Beta feature that improves scheduling latency for identical Pods.
Unlike gang scheduling, this feature does not require the Workload API or any explicit opt-in on the user's part. It works opportunistically within the scheduler by identifying Pods that have identical scheduling requirements (container images, resource requests, affinities, etc.). When the scheduler processes a Pod, it can reuse the feasibility calculations for subsequent identical Pods in the queue, significantly speeding up the process.
Most users will benefit from this optimization automatically, without taking any special steps, provided their Pods meet the following criteria.
Restrictions
Opportunistic batching works under specific conditions. All fields used by the kube-scheduler to find a placement must be identical between Pods. Additionally, using some features disables the batching mechanism for those Pods to ensure correctness.
Note that you may need to review your kube-scheduler configuration to ensure it is not implicitly disabling batching for your workloads.
See the docs for more details about restrictions.
The north star vision
The project has a broad ambition to deliver workload aware scheduling. These new APIs and scheduling enhancements are just the first steps. In the near future, the effort aims to tackle:
- Introducing a workload scheduling phase
- Improved support for multi-node DRA and topology aware scheduling
- Workload-level preemption
- Improved integration between scheduling and autoscaling
- Improved interaction with external workload schedulers
- Managing placement of workloads throughout their entire lifecycle
- Multi-workload scheduling simulations
And more. The priority and implementation order of these focus areas are subject to change. Stay tuned for further updates.
Getting started
To try the workload aware scheduling improvements:
- Workload API: Enable the
GenericWorkloadfeature gate on bothkube-apiserverandkube-scheduler, and ensure thescheduling.k8s.io/v1alpha1API group is enabled. - Gang scheduling: Enable the
GangSchedulingfeature gate onkube-scheduler(requires the Workload API to be enabled). - Opportunistic batching: As a Beta feature, it is enabled by default in v1.35. You can disable it using the
OpportunisticBatchingfeature gate onkube-schedulerif needed.
We encourage you to try out workload aware scheduling in your test clusters and share your experiences to help shape the future of Kubernetes scheduling. You can send your feedback by:
- Reaching out via Slack (#sig-scheduling).
- Commenting on the workload aware scheduling tracking issue
- Filing a new issue in the Kubernetes repository.
Learn more
- Read the KEPs for Workload API and gang scheduling and Opportunistic batching.
- Track the Workload aware scheduling issue for recent updates.
29 Dec 2025 6:30pm GMT
23 Dec 2025
Kubernetes Blog
Kubernetes v1.35: Fine-grained Supplemental Groups Control Graduates to GA
On behalf of Kubernetes SIG Node, we are pleased to announce the graduation of fine-grained supplemental groups control to General Availability (GA) in Kubernetes v1.35!
The new Pod field, supplementalGroupsPolicy, was introduced as an opt-in alpha feature for Kubernetes v1.31, and then had graduated to beta in v1.33. Now, the feature is generally available. This feature allows you to implement more precise control over supplemental groups in Linux containers that can strengthen the security posture particularly in accessing volumes. Moreover, it also enhances the transparency of UID/GID details in containers, offering improved security oversight.
If you are planning to upgrade your cluster from v1.32 or an earlier version, please be aware that some behavioral breaking change introduced since beta (v1.33). For more details, see the behavioral changes introduced in beta and the upgrade considerations sections of the previous blog for graduation to beta.
Motivation: Implicit group memberships defined in /etc/group in the container image
Even though the majority of Kubernetes cluster admins/users may not be aware of this, by default Kubernetes merges group information from the Pod with information defined in /etc/group in the container image.
Here's an example; a Pod manifest that specifies spec.securityContext.runAsUser: 1000, spec.securityContext.runAsGroup: 3000 and spec.securityContext.supplementalGroups: 4000 as part of the Pod's security context.
apiVersion: v1
kind: Pod
metadata:
name: implicit-groups-example
spec:
securityContext:
runAsUser: 1000
runAsGroup: 3000
supplementalGroups: [4000]
containers:
- name: example-container
image: registry.k8s.io/e2e-test-images/agnhost:2.45
command: [ "sh", "-c", "sleep 1h" ]
securityContext:
allowPrivilegeEscalation: false
What is the result of id command in the example-container container? The output should be similar to this:
uid=1000 gid=3000 groups=3000,4000,50000
Where does group ID 50000 in supplementary groups (groups field) come from, even though 50000 is not defined in the Pod's manifest at all? The answer is /etc/group file in the container image.
Checking the contents of /etc/group in the container image contains something like the following:
user-defined-in-image:x:1000:
group-defined-in-image:x:50000:user-defined-in-image
This shows that the container's primary user 1000 belongs to the group 50000 in the last entry.
Thus, the group membership defined in /etc/group in the container image for the container's primary user is implicitly merged to the information from the Pod. Please note that this was a design decision the current CRI implementations inherited from Docker, and the community never really reconsidered it until now.
What's wrong with it?
The implicitly merged group information from /etc/group in the container image poses a security risk. These implicit GIDs can't be detected or validated by policy engines because there's no record of them in the Pod manifest. This can lead to unexpected access control issues, particularly when accessing volumes (see kubernetes/kubernetes#112879 for details) because file permission is controlled by UID/GIDs in Linux.
Fine-grained supplemental groups control in a Pod: supplementaryGroupsPolicy
To tackle this problem, a Pod's .spec.securityContext now includes supplementalGroupsPolicy field.
This field lets you control how Kubernetes calculates the supplementary groups for container processes within a Pod. The available policies are:
-
Merge: The group membership defined in
/etc/groupfor the container's primary user will be merged. If not specified, this policy will be applied (i.e. as-is behavior for backward compatibility). -
Strict: Only the group IDs specified in
fsGroup,supplementalGroups, orrunAsGroupare attached as supplementary groups to the container processes. Group memberships defined in/etc/groupfor the container's primary user are ignored.
I'll explain how the Strict policy works. The following Pod manifest specifies supplementalGroupsPolicy: Strict:
apiVersion: v1
kind: Pod
metadata:
name: strict-supplementalgroups-policy-example
spec:
securityContext:
runAsUser: 1000
runAsGroup: 3000
supplementalGroups: [4000]
supplementalGroupsPolicy: Strict
containers:
- name: example-container
image: registry.k8s.io/e2e-test-images/agnhost:2.45
command: [ "sh", "-c", "sleep 1h" ]
securityContext:
allowPrivilegeEscalation: false
The result of id command in the example-container container should be similar to this:
uid=1000 gid=3000 groups=3000,4000
You can see Strict policy can exclude group 50000 from groups!
Thus, ensuring supplementalGroupsPolicy: Strict (enforced by some policy mechanism) helps prevent the implicit supplementary groups in a Pod.
Note:
A container with sufficient privileges can change its process identity. The supplementalGroupsPolicy only affect the initial process identity.
Read on for more details.
Attached process identity in Pod status
This feature also exposes the process identity attached to the first container process of the container via .status.containerStatuses[].user.linux field. It would be helpful to see if implicit group IDs are attached.
...
status:
containerStatuses:
- name: ctr
user:
linux:
gid: 3000
supplementalGroups:
- 3000
- 4000
uid: 1000
...
Note:
Please note that the values in status.containerStatuses[].user.linux field is the firstly attached process identity to the first container process in the container. If the container has sufficient privilege to call system calls related to process identity (e.g. setuid(2), setgid(2) or setgroups(2), etc.), the container process can change its identity. Thus, the actual process identity will be dynamic.
There are several ways to restrict these permissions in containers. We suggest the belows as simple solutions:
- setting
privilege: falseandallowPrivilegeEscalation: falsein your container'ssecurityContext, or - conform your pod to
Restrictedpolicy in Pod Security Standard.
Also, kubelet has no visibility into NRI plugins or container runtime internal workings. Cluster Administrator configuring nodes or highly privilege workloads with the permission of a local administrator may change supplemental groups for any pod. However this is outside of a scope of Kubernetes control and should not be a concern for security-hardened nodes.
Strict policy requires up-to-date container runtimes
The high level container runtime (e.g. containerd, CRI-O) plays a key role for calculating supplementary group ids that will be attached to the containers. Thus, supplementalGroupsPolicy: Strict requires a CRI runtime that support this feature. The old behavior (supplementalGroupsPolicy: Merge) can work with a CRI runtime that does not support this feature, because this policy is fully backward compatible.
Here are some CRI runtimes that support this feature, and the versions you need to be running:
- containerd: v2.0 or later
- CRI-O: v1.31 or later
And, you can see if the feature is supported in the Node's .status.features.supplementalGroupsPolicy field. Please note that this field is different from status.declaredFeatures introduced in KEP-5328: Node Declared Features(formerly Node Capabilities).
apiVersion: v1
kind: Node
...
status:
features:
supplementalGroupsPolicy: true
As container runtimes support this feature universally, various security policies may start enforcing the Strict behavior as more secure. It is the best practice to ensure that your Pods are ready for this enforcement and all supplemental groups are transparently declared in Pod spec, rather than in images.
Getting involved
This enhancement was driven by the SIG Node community. Please join us to connect with the community and share your ideas and feedback around the above feature and beyond. We look forward to hearing from you!
How can I learn more?
- Configure a Security Context for a Pod or Container for the further details of
supplementalGroupsPolicy - KEP-3619: Fine-grained SupplementalGroups control
23 Dec 2025 6:30pm GMT
22 Dec 2025
Kubernetes Blog
Kubernetes v1.35: Kubelet Configuration Drop-in Directory Graduates to GA
With the recent v1.35 release of Kubernetes, support for a kubelet configuration drop-in directory is generally available. The newly stable feature simplifies the management of kubelet configuration across large, heterogeneous clusters.
With v1.35, the kubelet command line argument --config-dir is production-ready and fully supported, allowing you to specify a directory containing kubelet configuration drop-in files. All files in that directory will be automatically merged with your main kubelet configuration. This allows cluster administrators to maintain a cohesive base configuration for kubelets while enabling targeted customizations for different node groups or use cases, and without complex tooling or manual configuration management.
The problem: managing kubelet configuration at scale
As Kubernetes clusters grow larger and more complex, they often include heterogeneous node pools with different hardware capabilities, workload requirements, and operational constraints. This diversity necessitates different kubelet configurations across node groups-yet managing these varied configurations at scale becomes increasingly challenging. Several pain points emerge:
- Configuration drift: Different nodes may have slightly different configurations, leading to inconsistent behavior
- Node group customization: GPU nodes, edge nodes, and standard compute nodes often require different kubelet settings
- Operational overhead: Maintaining separate, complete configuration files for each node type is error-prone and difficult to audit
- Change management: Rolling out configuration changes across heterogeneous node pools requires careful coordination
Before this support was added to Kubernetes, cluster administrators had to choose between using a single monolithic configuration file for all nodes, manually maintaining multiple complete configuration files, or relying on separate tooling. Each approach had its own drawbacks. This graduation to stable gives cluster administrators a fully supported fourth way to solve that challenge.
Example use cases
Managing heterogeneous node pools
Consider a cluster with multiple node types: standard compute nodes, high-capacity nodes (such as those with GPUs or large amounts of memory), and edge nodes with specialized requirements.
Base configuration
File: 00-base.conf
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
clusterDNS:
- "10.96.0.10"
clusterDomain: cluster.local
High-capacity node override
File: 50-high-capacity-nodes.conf
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
maxPods: 50
systemReserved:
memory: "4Gi"
cpu: "1000m"
Edge node override
File: 50-edge-nodes.conf (edge compute typically has lower capacity)
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
evictionHard:
memory.available: "500Mi"
nodefs.available: "5%"
With this structure, high-capacity nodes apply both the base configuration and the capacity-specific overrides, while edge nodes apply the base configuration with edge-specific settings.
Gradual configuration rollouts
When rolling out configuration changes, you can:
- Add a new drop-in file with a high numeric prefix (e.g.,
99-new-feature.conf) - Test the changes on a subset of nodes
- Gradually roll out to more nodes
- Once stable, merge changes into the base configuration
Viewing the merged configuration
Since configuration is now spread across multiple files, you can inspect the final merged configuration using the kubelet's /configz endpoint:
# Start kubectl proxy
kubectl proxy
# In another terminal, fetch the merged configuration
# Change the '<node-name>' placeholder before running the curl command
curl -X GET http://127.0.0.1:8001/api/v1/nodes/<node-name>/proxy/configz | jq .
This shows the actual configuration the kubelet is using after all merging has been applied. The merged configuration also includes any configuration settings that were specified via kubelet command-line arguments.
For detailed setup instructions, configuration examples, and merging behavior, see the official documentation:
Good practices
When using the kubelet configuration drop-in directory:
-
Test configurations incrementally: Always test new drop-in configurations on a subset of nodes before rolling out cluster-wide to minimize risk
-
Version control your drop-ins: Store your drop-in configuration files in version control (or the configuration source from which these are generated) alongside your infrastructure as code to track changes and enable easy rollbacks
-
Use numeric prefixes for predictable ordering: Name files with numeric prefixes (e.g.,
00-,50-,90-) to explicitly control merge order and make the configuration layering obvious to other administrators -
Be mindful of temporary files: Some text editors automatically create backup files (such as
.bak,.swp, or files with~suffix) in the same directory when editing. Ensure these temporary or backup files are not left in the configuration directory, as they may be processed by the kubelet
Acknowledgments
This feature was developed through the collaborative efforts of SIG Node. Special thanks to all contributors who helped design, implement, test, and document this feature across its journey from alpha in v1.28, through beta in v1.30, to GA in v1.35.
To provide feedback on this feature, join the Kubernetes Node Special Interest Group, participate in discussions on the public Slack channel (#sig-node), or file an issue on GitHub.
Get involved
If you have feedback or questions about kubelet configuration management, or want to share your experience using this feature, join the discussion:
- SIG Node community page
- Kubernetes Slack in the #sig-node channel
- SIG Node mailing list
SIG Node would love to hear about your experiences using this feature in production!
22 Dec 2025 6:30pm GMT
21 Dec 2025
Kubernetes Blog
Avoiding Zombie Cluster Members When Upgrading to etcd v3.6
This article is a mirror of an original that was recently published to the official etcd blog. The key takeaway? Always upgrade to etcd v3.5.26 or later before moving to v3.6. This ensures your cluster is automatically repaired, and avoids zombie members.
Issue summary
Recently, the etcd community addressed an issue that may appear when users upgrade from v3.5 to v3.6. This bug can cause the cluster to report "zombie members", which are etcd nodes that were removed from the database cluster some time ago, and are re-appearing and joining database consensus. The etcd cluster is then inoperable until these zombie members are removed.
In etcd v3.5 and earlier, the v2store was the source of truth for membership data, even though the v3store was also present. As a part of our v2store deprecation plan, in v3.6 the v3store is the source of truth for cluster membership. Through a bug report we found out that, in some older clusters, v2store and v3store could become inconsistent. This inconsistency manifests after upgrading as seeing old, removed "zombie" cluster members re-appearing in the cluster.
The fix and upgrade path
We've added a mechanism in etcd v3.5.26 to automatically sync v3store from v2store, ensuring that affected clusters are repaired before upgrading to 3.6.x.
To support the many users currently upgrading to 3.6, we have provided the following safe upgrade path:
- Upgrade your cluster to v3.5.26 or later.
- Wait and confirm that all members are healthy post-update.
- Upgrade to v3.6.
We are unable to provide a safe workaround path for users who have some obstacle preventing updating to v3.5.26. As such, if v3.5.26 is not available from your packaging source or vendor, you should delay upgrading to v3.6 until it is.
Additional technical detail
Information below is offered for reference only. Users can follow the safe upgrade path without knowledge of the following details.
This issue is encountered with clusters that have been running in production on etcd v3.5.25 or earlier. It is a side effect of adding and removing members from the cluster, or recovering the cluster from failure. This means that the issue is more likely the older the etcd cluster is, but it cannot be ruled out for any user regardless of the age of the cluster.
etcd maintainers, working with issue reporters, have found three possible triggers for the issue based on symptoms and an analysis of etcd code and logs:
- Bug in
etcdctl snapshot restore(v3.4 and old versions): When restoring a snapshot usingetcdctl snapshot restore, etcdctl was supposed to remove existing members before adding the new ones. In v3.4, due to a bug, old members were not removed, resulting in zombie members. Refer to the comment on etcdctl. --force-new-clusterin v3.5 and earlier versions: In rare cases, forcibly creating a new single-member cluster did not fully remove old members, leaving zombies. The issue was resolved in v3.5.22. Please refer to this PR in the Raft project for detailed technical information.- --unsafe-no-sync enabled: If
--unsafe-no-syncis enabled, in rare cases etcd might persist a membership change to v3store but crash before writing it to the WAL, causing inconsistency between v2store and v3store. This is a problem for single-member clusters. For multi-member clusters, forcibly creating a new single-member cluster from the crashed node's data may lead to zombie members.
Note
--unsafe-no-sync is generally not recommended, as it may break the guarantees given by the consensus protocol.Importantly, there may be other triggers for v2store and v3store membership data becoming inconsistent that we have not yet found. This means that you cannot assume that you are safe just because you have not performed any of the three actions above. Once users are upgraded to etcd v3.6, v3store becomes the source of membership data, and further inconsistency is not possible.
Advanced users who want to verify the consistency between v2store and v3store can follow the steps described in this comment. This check is not required to fix the issue, nor does SIG etcd recommend bypassing the v3.5.26 update regardless of the results of the check.
Key takeaway
Always upgrade to v3.5.26 or later before moving to v3.6. This ensures your cluster is automatically repaired and avoids zombie members.
Acknowledgements
We would like to thank Christian Baumann for reporting this long-standing upgrade issue. His report and follow-up work helped bring the issue to our attention so that we could investigate and resolve it upstream.
21 Dec 2025 12:00am GMT
19 Dec 2025
Kubernetes Blog
Kubernetes 1.35: In-Place Pod Resize Graduates to Stable
This release marks a major step: more than 6 years after its initial conception, the In-Place Pod Resize feature (also known as In-Place Pod Vertical Scaling), first introduced as alpha in Kubernetes v1.27, and graduated to beta in Kubernetes v1.33, is now stable (GA) in Kubernetes 1.35!
This graduation is a major milestone for improving resource efficiency and flexibility for workloads running on Kubernetes.
What is in-place Pod Resize?
In the past, the CPU and memory resources allocated to a container in a Pod were immutable. This meant changing them required deleting and recreating the entire Pod. For stateful services, batch jobs, or latency-sensitive workloads, this was an incredibly disruptive operation.
In-Place Pod Resize makes CPU and memory requests and limits mutable, allowing you to adjust these resources within a running Pod, often without requiring a container restart.
Key Concept:
- Desired Resources: A container's
spec.containers[*].resourcesfield now represents the desired resources. For CPU and memory, these fields are now mutable. - Actual Resources: The
status.containerStatuses[*].resourcesfield reflects the resources currently configured for a running container. - Triggering a Resize: You can request a resize by updating the desired
requestsandlimitsin the Pod's specification by utilizing the newresizesubresource.
How can I start using in-place Pod Resize?
Detailed usage instructions and examples are provided in the official documentation: Resize CPU and Memory Resources assigned to Containers.
How does this help me?
In-place Pod Resize is a foundational building block that unlocks seamless, vertical autoscaling and improvements to workload efficiency.
- Resources adjusted without disruption Workloads sensitive to latency or restarts can have their resources modified in-place without downtime or loss of state.
- More powerful autoscaling Autoscalers are now empowered to adjust resources and with less impact. For example, Vertical Pod Autoscaler (VPA)'s
InPlaceOrRecreateupdate mode, which leverages this feature, has graduated to beta. This allows resources to be adjusted automatically and seamlessly based on usage with minimal disruption.- See AEP-4016 for more details.
- Address transient resource needs Workloads that temporarily need more resources can be adjusted quickly. This enables features like the CPU Startup Boost (AEP-7862) where applications can request more CPU during startup and then automatically scale back down.
Here are a few examples of some use cases:
- A game server that needs to adjust its size with shifting player count.
- A pre-warmed worker that can be shrunk while unused but inflated with the first request.
- Dynamically scale with load for efficient bin-packing.
- Increased resources for JIT compilation on startup.
Changes between beta (1.33) and stable (1.35)
Since the initial beta in v1.33, development effort has primarily been around stabilizing the feature and improving its usability based on community feedback. Here are the primary changes for the stable release:
- Memory limit decrease Decreasing memory limits was previously prohibited. This restriction has been lifted, and memory limit decreases are now permitted. The Kubelet attempts to prevent OOM-kills by allowing the resize only if the current memory usage is below the new desired limit. However, this check is best-effort and not guaranteed.
- Prioritized resizes If a node doesn't have enough room to accept all resize requests, Deferred resizes are reattempted based on the following priority:
- PriorityClass
- QoS class
- Duration Deferred, with older requests prioritized first.
- Pod Level Resources (Alpha) Support for in-place Pod Resize with Pod Level Resources has been introduced behind its own feature gate, which is alpha in v1.35.
- Increased observability: There are now new Kubelet metrics and Pod events specifically associated with In-Place Pod Resize to help users track and debug resource changes.
What's next?
The graduation of In-Place Pod Resize to stable opens the door for powerful integrations across the Kubernetes ecosystem. There are several areas for futher improvement that are currently planned.
Integration with autoscalers and other projects
There are planned integrations with several autoscalers and other projects to improve workload efficiency at a larger scale. Some projects under discussion:
- VPA CPU startup boost (AEP-7862): Allows applications to request more CPU at startup and scale back down after a specific period of time.
- VPA Support for in-place updates (AEP-4016): VPA support for
InPlaceOrRecreatehas recently graduated to beta, with the eventual goal being to graduate the feature to stable. Support forInPlacemode is still being worked on; see this pull request. - Ray autoscaler: Plans to leverage In-Place Pod Resize to improve workload efficiency. See this Google Cloud blog post for more details.
- Agent-sandbox "Soft-Pause": Investigating leveraging in-place Pod Resize for better improved latency. See the Github issue for more details.
- Runtime support: Java and Python runtimes do not support resizing memory without restart. There is an open conversation with the Java developers, see the bug.
If you have a project that could benefit from integration with in-place pod resize, please reach out using the channels listed in the feedback section!
Feature expansion
Today, In-Place Pod Resize is prohibited when used in combination with: swap, the static CPU Manager, and the static Memory Manager. Additionally, resources other than CPU and memory are still immutable. Expanding the set of supported features and resources is under consideration as more feedback about community needs comes in.
There are also plans to support workload preemption; if there is not enough room on the node for the resize of a high priority pod, the goal is to enable policies to automatically evict a lower-priority pod or upsize the node.
Improved stability
-
Resolve kubelet-scheduler race conditions There are known race conditions between the kubelet and scheduler with regards to in-place pod resize. Work is underway to resolve these issues over the next few releases. See the issue for more details.
-
Safer memory limit decrease The Kubelet's best-effort check for OOM-kill prevention can be made even safer by moving the memory usage check into the container runtime itself. See the issue for more details.
Providing feedback
Looking to further build on this foundational feature, please share your feedback on how to improve and extend this feature. You can share your feedback through GitHub issues, mailing lists, or Slack channels related to the Kubernetes #sig-node and #sig-autoscaling communities.
Thank you to everyone who contributed to making this long-awaited feature a reality!
19 Dec 2025 6:30pm GMT
18 Dec 2025
Kubernetes Blog
Kubernetes v1.35: Job Managed By Goes GA
In Kubernetes v1.35, the ability to specify an external Job controller (through .spec.managedBy) graduates to General Availability.
This feature allows external controllers to take full responsibility for Job reconciliation, unlocking powerful scheduling patterns like multi-cluster dispatching with MultiKueue.
Why delegate Job reconciliation?
The primary motivation for this feature is to support multi-cluster batch scheduling architectures, such as MultiKueue.
The MultiKueue architecture distinguishes between a Management Cluster and a pool of Worker Clusters:
- The Management Cluster is responsible for dispatching Jobs but not executing them. It needs to accept Job objects to track status, but it skips the creation and execution of Pods.
- The Worker Clusters receive the dispatched Jobs and execute the actual Pods.
- Users usually interact with the Management Cluster. Because the status is automatically propagated back, they can observe the Job's progress "live" without accessing the Worker Clusters.
- In the Worker Clusters, the dispatched Jobs run as regular Jobs managed by the built-in Job controller, with no
.spec.managedByset.
By using .spec.managedBy, the MultiKueue controller on the Management Cluster can take over the reconciliation of a Job. It copies the status from the "mirror" Job running on the Worker Cluster back to the Management Cluster.
Why not just disable the Job controller? While one could theoretically achieve this by disabling the built-in Job controller entirely, this is often impossible or impractical for two reasons:
- Managed Control Planes: In many cloud environments, the Kubernetes control plane is locked, and users cannot modify controller manager flags.
- Hybrid Cluster Role: Users often need a "hybrid" mode where the Management Cluster dispatches some heavy workloads to remote clusters but still executes smaller or control-plane-related Jobs in the Management Cluster.
.spec.managedByallows this granularity on a per-Job basis.
How .spec.managedBy works
The .spec.managedBy field indicates which controller is responsible for the Job, specifically there are two modes of operation:
- Standard: if unset or set to the reserved value
kubernetes.io/job-controller, the built-in Job controller reconciles the Job as usual (standard behavior). - Delegation: If set to any other value, the built-in Job controller skips reconciliation entirely for that Job.
To prevent orphaned Pods or resource leaks, this field is immutable. You cannot transfer a running Job from one controller to another.
If you are looking into implementing an external controller, be aware that your controller needs to be conformant with the definitions for the Job API. In order to enforce the conformance, a significant part of the effort was to introduce the extensive Job status validation rules. Navigate to the How can you learn more? section for more details.
Ecosystem Adoption
The .spec.managedBy field is rapidly becoming the standard interface for delegating control in the Kubernetes batch ecosystem.
Various custom workload controllers are adding this field (or an equivalent) to allow MultiKueue to take over their reconciliation and orchestrate them across clusters:
While it is possible to use .spec.managedBy to implement a custom Job controller from scratch, we haven't observed that yet. The feature is specifically designed to support delegation patterns, like MultiKueue, without reinventing the wheel.
How can you learn more?
If you want to dig deeper:
Read the user-facing documentation for:
Deep dive into the design history:
- The Kubernetes Enhancement Proposal (KEP) Job's managed-by mechanism including introduction of the extensive Job status validation rules.
- The Kueue KEP for MultiKueue.
Explore how MultiKueue uses .spec.managedBy in practice in the task guide for running Jobs across clusters.
Acknowledgments
As with any Kubernetes feature, a lot of people helped shape this one through design discussions, reviews, test runs, and bug reports.
We would like to thank, in particular:
- Maciej Szulik - for guidance, mentorship, and reviews.
- Filip Křepinský - for guidance, mentorship, and reviews.
Get involved
This work was sponsored by the Kubernetes Batch Working Group in close collaboration with the SIG Apps, and with strong input from the SIG Scheduling community.
If you are interested in batch scheduling, multi-cluster solutions, or further improving the Job API:
- Join us in the Batch WG and SIG Apps meetings.
- Subscribe to the WG Batch Slack channel.
18 Dec 2025 6:30pm GMT
17 Dec 2025
Kubernetes Blog
Kubernetes v1.35: Timbernetes (The World Tree Release)
Editors: Aakanksha Bhende, Arujjwal Negi, Chad M. Crowell, Graziano Casto, Swathi Rao
Similar to previous releases, the release of Kubernetes v1.35 introduces new stable, beta, and alpha features. The consistent delivery of high-quality releases underscores the strength of our development cycle and the vibrant support from our community.
This release consists of 60 enhancements, including 17 stable, 19 beta, and 22 alpha features.
There are also some deprecations and removals in this release; make sure to read about those.
Release theme and logo

2025 began in the shimmer of Octarine: The Color of Magic (v1.33) and rode the gusts Of Wind & Will (v1.34). We close the year with our hands on the World Tree, inspired by Yggdrasil, the tree of life that binds many realms. Like any great tree, Kubernetes grows ring by ring and release by release, shaped by the care of a global community.
At its center sits the Kubernetes wheel wrapped around the Earth, grounded by the resilient maintainers, contributors and users who keep showing up. Between day jobs, life changes, and steady open-source stewardship, they prune old APIs, graft new features and keep one of the world's largest open source projects healthy.
Three squirrels guard the tree: a wizard holding the LGTM scroll for reviewers, a warrior with an axe and Kubernetes shield for the release crews who cut new branches, and a rogue with a lantern for the triagers who bring light to dark issue queues.
Together, they stand in for a much larger adventuring party. Kubernetes v1.35 adds another growth ring to the World Tree, a fresh cut shaped by many hands, many paths and a community whose branches reach higher as its roots grow deeper.
Spotlight on key updates
Kubernetes v1.35 is packed with new features and improvements. Here are a few select updates the Release Team would like to highlight!
Stable: In-place update of Pod resources
Kubernetes has graduated in-place updates for Pod resources to General Availability (GA). This feature allows users to adjust CPU and memory resources without restarting Pods or Containers. Previously, such modifications required recreating Pods, which could disrupt workloads, particularly for stateful or batch applications. Earlier Kubernetes releases allowed you to change only infrastructure resource settings (requests and limits) for existing Pods. The new in-place functionality allows for smoother, nondisruptive vertical scaling, improves efficiency, and can also simplify development.
This work was done as part of KEP #1287 led by SIG Node.
Beta: Pod certificates for workload identity and security
Previously, delivering certificates to pods required external controllers (cert-manager, SPIFFE/SPIRE), CRD orchestration, and Secret management, with rotation handled by sidecars or init containers. Kubernetes v1.35 enables native workload identity with automated certificate rotation, drastically simplifying service mesh and zero-trust architectures.
Now, the kubelet generates keys, requests certificates via PodCertificateRequest, and writes credential bundles directly to the Pod's filesystem. The kube-apiserver enforces node restriction at admission time, eliminating the most common pitfall for third-party signers: accidentally violating node isolation boundaries. This enables pure mTLS flows with no bearer tokens in the issuance path.
This work was done as part of KEP #4317 led by SIG Auth.
Alpha: Node declared features before scheduling
When control planes enable new features but nodes lag behind (permitted by Kubernetes skew policy), the scheduler can place pods requiring those features onto incompatible older nodes. The node-declaration features framework allows nodes to declare their supported Kubernetes features. With the new alpha feature enabled, a Node reports the features it supports, publishing this information to the control plane via a new .status.declaredFeatures field. Then, the kube-scheduler, admission controllers, and third-party components can use these declarations. For example, you can enforce scheduling and API validation constraints to ensure that Pods run only on compatible nodes.
This work was done as part of KEP #5328 led by SIG Node.
Features graduating to Stable
This is a selection of some of the improvements that are now stable following the v1.35 release.
PreferSameNode traffic distribution
The trafficDistribution field for Services has been updated to provide more explicit control over traffic routing. A new option, PreferSameNode, has been introduced to let services strictly prioritize endpoints on the local node if available, falling back to remote endpoints otherwise.
Simultaneously, the existing PreferClose option has been renamed to PreferSameZone. This change makes the API self-explanatory by explicitly indicating that traffic is preferred within the current availability zone. While PreferClose is preserved for backward compatibility, PreferSameZone is now the standard for zonal routing, ensuring that both node-level and zone-level preferences are clearly distinguished.
This work was done as part of KEP #3015 led by SIG Network.
Job API managed-by mechanism
The Job API now includes a managedBy field that allows an external controller to handle Job status synchronization. This feature, which graduates to stable in Kubernetes v1.35, is primarily driven by MultiKueue, a multi-cluster dispatching system where a Job created in a management cluster is mirrored and executed in a worker cluster, with status updates propagated back. To enable this workflow, the built-in Job controller must not act on a particular Job resource so that the Kueue controller can manage status updates instead.
The goal is to allow clean delegation of Job synchronization to another controller. It does not aim to pass custom parameters to that controller or modify CronJob concurrency policies.
This work was done as part of KEP #4368 led by SIG Apps.
Reliable Pod update tracking with .metadata.generation
Historically, the Pod API lacked the metadata.generation field found in other Kubernetes objects such as Deployments. Because of this omission, controllers and users had no reliable way to verify whether the kubelet had actually processed the latest changes to a Pod's specification. This ambiguity was particularly problematic for features like In-Place Pod Vertical Scaling, where it was difficult to know exactly when a resource resize request had been enacted.
Kubernetes v1.33 added .metadata.generation fields for Pods, as an alpha feature. That field is now stable in the v1.35 Pod API, which means that every time a Pod's spec is updated, the .metadata.generation value is incremented. As part of this improvement, the Pod API also gained a .status.observedGeneration field, which reports the generation that the kubelet has successfully seen and processed. Pod conditions also each contain their own individual observedGeneration field that clients can report and / or observe.
Because this feature has graduated to stable in v1.35, it is available for all workloads.
This work was done as part of KEP #5067 led by SIG Node.
Configurable NUMA node limit for topology manager
The topology manager historically used a hard-coded limit of 8 for the maximum number of NUMA nodes it can support, preventing state explosion during affinity calculation. (There's an important detail here; a NUMA node is not the same as a Node in the Kubernetes API.) This limit on the number of NUMA nodes prevented Kubernetes from fully utilizing modern high-end servers, which increasingly feature CPU architectures with more than 8 NUMA nodes.
Kubernetes v1.31 introduced a new, beta max-allowable-numa-nodes option to the topology manager policy configuration. In Kubernetes v1.35, that option is stable. Cluster administrators who enable it can use servers with more than 8 NUMA nodes.
Although the configuration option is stable, the Kubernetes community is aware of the poor performance for large NUMA hosts, and there is a proposed enhancement (KEP-5726) that aims to improve on it. You can learn more about this by reading Control Topology Management Policies on a node.
This work was done as part of KEP #4622 led by SIG Node.
New features in Beta
This is a selection of some of the improvements that are now beta following the v1.35 release.
Expose node topology labels via Downward API
Accessing node topology information, such as region and zone, from within a Pod has typically required querying the Kubernetes API server. While functional, this approach creates complexity and security risks by necessitating broad RBAC permissions or sidecar containers just to retrieve infrastructure metadata. Kubernetes v1.35 promotes the capability to expose node topology labels directly via the Downward API to beta.
The kubelet can now inject standard topology labels, such as topology.kubernetes.io/zone and topology.kubernetes.io/region, into Pods as environment variables or projected volume files. The primary benefit is a safer and more efficient way for workloads to be topology-aware. This allows applications to natively adapt to their availability zone or region without dependencies on the API server, strengthening security by upholding the principle of least privilege and simplifying cluster configuration.
Note: Kubernetes now injects available topology labels to every Pod so that they can be used as inputs to the downward API. With the v1.35 upgrade, most cluster administrators will see several new labels added to each Pod; this is expected as part of the design.
This work was done as part of KEP #4742 led by SIG Node.
Native support for storage version migration
In Kubernetes v1.35, the native support for storage version migration graduates to beta and is enabled by default. This move integrates the migration logic directly into the core Kubernetes control plane ("in-tree"), eliminating the dependency on external tools.
Historically, administrators relied on manual "read/write loops"-often piping kubectl get into kubectl replace-to update schemas or re-encrypt data at rest. This method was inefficient and prone to conflicts, especially for large resources like Secrets. With this release, the built-in controller automatically handles update conflicts and consistency tokens, providing a safe, streamlined, and reliable way to ensure stored data remains current with minimal operational overhead.
This work was done as part of KEP #4192 led by SIG API Machinery.
Mutable Volume attach limits
A CSI (Container Storage Interface) driver is a Kubernetes plugin that provides a consistent way for storage systems to be exposed to containerized workloads. The CSINode object records details about all CSI drivers installed on a node. However, a mismatch can arise between the reported and actual attachment capacity on nodes. When volume slots are consumed after a CSI driver starts up, the kube-scheduler may assign stateful pods to nodes without sufficient capacity, ultimately getting stuck in a ContainerCreating state.
Kubernetes v1.35 makes CSINode.spec.drivers[*].allocatable.count mutable so that a node's available volume attachment capacity can be updated dynamically. It also allows CSI drivers to control how frequently the allocatable.count value is updated on all nodes by introducing a configurable refresh interval, defined through the CSIDriver object. Additionally, it automatically updates CSINode.spec.drivers[*].allocatable.count on detecting a failure in volume attachment due to insufficient capacity. Although this feature graduated to beta in v1.34 with the feature flag MutableCSINodeAllocatableCount disabled by default, it remains in beta for v1.35 to allow time for feedback, but the feature flag is enabled by default.
This work was done as part of KEP #4876 led by SIG Storage.
Opportunistic batching
Historically, the Kubernetes scheduler processes pods sequentially with time complexity of O(num pods × num nodes), which can result in redundant computation for compatible pods. This KEP introduces an opportunistic batching mechanism that aims to improve performance by identifying such compatible Pods via Pod scheduling signature and batching them together, allowing shared filtering and scoring results across them.
The pod scheduling signature ensures that two pods with the same signature are "the same" from a scheduling perspective. It takes into account not only the pod and node attributes, but also the other pods in the system and global data about the pod placement. This means that any pod with the given signature will get the same scores/feasibility results from any arbitrary set of nodes.
The batching mechanism consists of two operations that can be invoked whenever needed - create and nominate. Create leads to the creation of a new set of batch information from the scheduling results of Pods that have a valid signature. Nominate uses the batching information from create to set the nominated node name from a new Pod whose signature matches the canonical Pod's signature.
This work was done as part of KEP #5598 led by SIG Scheduling.
maxUnavailable for StatefulSets
A StatefulSet runs a group of Pods and maintains a sticky identity for each of those Pods. This is critical for stateful workloads requiring stable network identifiers or persistent storage. When a StatefulSet's .spec.updateStrategy.<type> is set to RollingUpdate, the StatefulSet controller will delete and recreate each Pod in the StatefulSet. It will proceed in the same order as Pod termination (from the largest ordinal to the smallest), updating each Pod one at a time.
Kubernetes v1.24 added a new alpha field to a StatefulSet's rollingUpdate configuration settings, called maxUnavailable. That field wasn't part of the Kubernetes API unless your cluster administrator explicitly opted in. In Kubernetes v1.35 that field is beta and is available by default. You can use it to define the maximum number of pods that can be unavailable during an update. This setting is most effective in combination with .spec.podManagementPolicy set to Parallel. You can set maxUnavailable as either a positive number (example: 2) or a percentage of the desired number of Pods (example: 10%). If this field is not specified, it will default to 1, to maintain the previous behavior of only updating one Pod at a time. This improvement allows stateful applications (that can tolerate more than one Pod being down) to finish updating faster.
This work was done as part of KEP #961 led by SIG Apps.
Configurable credential plugin policy in kuberc
The optional kuberc file is a way to separate server configurations and cluster credentials from user preferences without disrupting already running CI pipelines with unexpected outputs.
As part of the v1.35 release, kuberc gains additional functionality which allows users to configure credential plugin policy. This change introduces two fields credentialPluginPolicy, which allows or denies all plugins, and allows specifying a list of allowed plugins using credentialPluginAllowlist.
This work was done as part of KEP #3104 as a cooperation between SIG Auth and SIG CLI.
KYAML
YAML is a human-readable format of data serialization. In Kubernetes, YAML files are used to define and configure resources, such as Pods, Services, and Deployments. However, complex YAML is difficult to read. YAML's significant whitespace requires careful attention to indentation and nesting, while its optional string-quoting can lead to unexpected type coercion (see: The Norway Bug). While JSON is an alternative, it lacks support for comments and has strict requirements for trailing commas and quoted keys.
KYAML is a safer and less ambiguous subset of YAML designed specifically for Kubernetes. Introduced as an opt-in alpha feature in v1.34, this feature graduated to beta in Kubernetes v1.35 and has been enabled by default. It can be disabled by setting the environment variable KUBECTL_KYAML=false.
KYAML addresses challenges pertaining to both YAML and JSON. All KYAML files are also valid YAML files. This means you can write KYAML and pass it as an input to any version of kubectl. This also means that you don't need to write in strict KYAML for the input to be parsed.
This work was done as part of KEP #5295 led by SIG CLI.
Configurable tolerance for HorizontalPodAutoscalers
The Horizontal Pod Autoscaler (HPA) has historically relied on a fixed, global 10% tolerance for scaling actions. A drawback of this hardcoded value was that workloads requiring high sensitivity, such as those needing to scale on a 5% load increase, were often blocked from scaling, while others might oscillate unnecessarily.
With Kubernetes v1.35, the configurable tolerance feature graduates to beta and is enabled by default. This enhancement allows users to define a custom tolerance window on a per-resource basis within the HPA behavior field. By setting a specific tolerance (e.g., lowering it to 0.05 for 5%), operators gain precise control over autoscaling sensitivity, ensuring that critical workloads react quickly to small metric changes, without requiring cluster-wide configuration adjustments.
This work was done as part of KEP #4951 led by SIG Autoscaling.
Support for user namespaces in Pods
Kubernetes is adding support for user namespaces, allowing pods to run with isolated user and group ID mappings instead of sharing host IDs. This means containers can operate as root internally while actually being mapped to an unprivileged user on the host, reducing the risk of privilege escalation in the event of a compromise. The feature improves pod-level security and makes it safer to run workloads that need root inside the container. Over time, support has expanded to both stateless and stateful Pods through id-mapped mounts.
This work was done as part of KEP #127 led by SIG Node.
VolumeSource: OCI artifact and/or image
When creating a Pod, you often need to provide data, binaries, or configuration files for your containers. This meant including the content into the main container image or using a custom init container to download and unpack files into an emptyDir. Both these approaches are still valid. Kubernetes v1.31 added support for the image volume type allowing Pods to declaratively pull and unpack OCI container image artifacts into a volume. This lets you package and deliver data-only artifacts such as configs, binaries, or machine learning models using standard OCI registry tools.
With this feature, you can fully separate your data from your container image and remove the need for extra init containers or startup scripts. The image volume type has been in beta since v1.33 and is enabled by default in v1.35. Please note that using this feature requires a compatible container runtime, such as containerd v2.1 or later.
This work was done as part of KEP #4639 led by SIG Node.
Enforced kubelet credential verification for cached images
The imagePullPolicy: IfNotPresent setting currently allows a Pod to use a container image that is already cached on a node, even if the Pod itself does not possess the credentials to pull that image. A drawback of this behavior is that it creates a security vulnerability in multi-tenant clusters: if a Pod with valid credentials pulls a sensitive private image to a node, a subsequent unauthorized Pod on the same node can access that image simply by relying on the local cache.
This KEP introduces a mechanism where the kubelet enforces credential verification for cached images. Before allowing a Pod to use a locally cached image, the kubelet checks if the Pod has the valid credentials to pull it. This ensures that only authorized workloads can use private images, regardless of whether they are already present on the node, significantly hardening the security posture for shared clusters.
In Kubernetes v1.35, this feature has graduated to beta and is enabled by default. Users can still disable it by setting the KubeletEnsureSecretPulledImages feature gate to false. Additionally, the imagePullCredentialsVerificationPolicy flag allows operators to configure the desired security level, ranging from a mode that prioritizes backward compatibility to a strict enforcement mode that offers maximum security.
This work was done as part of KEP #2535 led by SIG Node.
Fine-grained Container restart rules
Historically, the restartPolicy field was defined strictly at the Pod level, forcing the same behavior on all containers within a Pod. A drawback of this global setting was the lack of granularity for complex workloads, such as AI/ML training jobs. These often required restartPolicy: Never for the Pod to manage job completion, yet individual containers would benefit from in-place restarts for specific, retriable errors (like network glitches or GPU init failures).
Kubernetes v1.35 addresses this by enabling restartPolicy and restartPolicyRules within the container API itself. This allows users to define restart strategies for individual regular and init containers that operate independently of the Pod's overall policy. For example, a container can now be configured to restart automatically only if it exits with a specific error code, avoiding the expensive overhead of rescheduling the entire Pod for a transient failure.
In this release, the feature has graduated to beta and is enabled by default. Users can immediately leverage restartPolicyRules in their container specifications to optimize recovery times and resource utilization for long-running workloads, without altering the broader lifecycle logic of their Pods.
This work was done as part of KEP #5307 led by SIG Node.
CSI driver opt-in for service account tokens via secrets field
Providing ServiceAccount tokens to Container Storage Interface (CSI) drivers has traditionally relied on injecting them into the volume_context field. This approach presents a significant security risk because volume_context is intended for non-sensitive configuration data and is frequently logged in plain text by drivers and debugging tools, potentially leaking credentials.
Kubernetes v1.35 introduces an opt-in mechanism for CSI drivers to receive ServiceAccount tokens via the dedicated secrets field in the NodePublishVolume request. Drivers can now enable this behavior by setting the serviceAccountTokenInSecrets field to true in their CSIDriver object, instructing the kubelet to populate the token securely.
The primary benefit is the prevention of accidental credential exposure in logs and error messages. This change ensures that sensitive workload identities are handled via the appropriate secure channels, aligning with best practices for secret management while maintaining backward compatibility for existing drivers.
This work was done as part of KEP #5538 led by SIG Auth in cooperation with SIG Storage.
Deployment status: count of terminating replicas
Historically, the Deployment status provided details on available and updated replicas but lacked explicit visibility into Pods that were in the process of shutting down. A drawback of this omission was that users and controllers could not easily distinguish between a stable Deployment and one that still had Pods executing cleanup tasks or adhering to long grace periods.
Kubernetes v1.35 promotes the terminatingReplicas field within the Deployment status to beta. This field provides a count of Pods that have a deletion timestamp set but have not yet been removed from the system. This feature is a foundational step in a larger initiative to improve how Deployments handle Pod replacement, laying the groundwork for future policies regarding when to create new Pods during a rollout.
The primary benefit is improved observability for lifecycle management tools and operators. By exposing the number of terminating Pods, external systems can now make more informed decisions such as waiting for a complete shutdown before proceeding with subsequent tasks without needing to manually query and filter individual Pod lists.
This work was done as part of KEP #3973 led by SIG Apps.
New features in Alpha
This is a selection of some of the improvements that are now alpha following the v1.35 release.
Gang scheduling support in Kubernetes
Scheduling interdependent workloads, such as AI/ML training jobs or HPC simulations, has traditionally been challenging because the default Kubernetes scheduler places Pods individually. This often leads to partial scheduling where some Pods start while others wait indefinitely for resources, resulting in deadlocks and wasted cluster capacity.
Kubernetes v1.35 introduces native support for so-called "gang scheduling" via the new Workload API and PodGroup concept. This feature implements an "all-or-nothing" scheduling strategy, ensuring that a defined group of Pods is scheduled only if the cluster has sufficient resources to accommodate the entire group simultaneously.
The primary benefit is improved reliability and efficiency for batch and parallel workloads. By preventing partial deployments, it eliminates resource deadlocks and ensures that expensive cluster capacity is utilized only when a complete job can run, significantly optimizing the orchestration of large-scale data processing tasks.
This work was done as part of KEP #4671 led by SIG Scheduling.
Constrained impersonation
Historically, the impersonate verb in Kubernetes RBAC functioned on an all-or-nothing basis: once a user was authorized to impersonate a target identity, they gained all associated permissions. A drawback of this broad authorization was that it violated the principle of least privilege, preventing administrators from restricting impersonators to specific actions or resources.
Kubernetes v1.35 introduces a new alpha feature, constrained impersonation, which adds a secondary authorization check to the impersonation flow. When enabled via the ConstrainedImpersonation feature gate, the API server verifies not only the basic impersonate permission but also checks if the impersonator is authorized for the specific action using new verb prefixes (e.g., impersonate-on:<mode>:<verb>). This allows administrators to define fine-grained policies-such as permitting a support engineer to impersonate a cluster admin solely to view logs, without granting full administrative access.
This work was done as part of KEP #5284 led by SIG Auth.
Flagz for Kubernetes components
Verifying the runtime configuration of Kubernetes components, such as the API server or kubelet, has traditionally required privileged access to the host node or process arguments. To address this, the /flagz endpoint was introduced to expose command-line options via HTTP. However, its output was initially limited to plain text, making it difficult for automated tools to parse and validate configurations reliably.
In Kubernetes v1.35, the /flagz endpoint has been enhanced to support structured, machine-readable JSON output. Authorized users can now request a versioned JSON response using standard HTTP content negotiation, while the original plain text format remains available for human inspection. This update significantly improves observability and compliance workflows, allowing external systems to programmatically audit component configurations without fragile text parsing or direct infrastructure access.
This work was done as part of KEP #4828 led by SIG Instrumentation.
Statusz for Kubernetes components
Troubleshooting Kubernetes components like the kube-apiserver or kubelet has traditionally involved parsing unstructured logs or text output, which is brittle and difficult to automate. While a basic /statusz endpoint existed previously, it lacked a standardized, machine-readable format, limiting its utility for external monitoring systems.
In Kubernetes v1.35, the /statusz endpoint has been enhanced to support structured, machine-readable JSON output. Authorized users can now request this format using standard HTTP content negotiation to retrieve precise status data-such as version information and health indicators-without relying on fragile text parsing. This improvement provides a reliable, consistent interface for automated debugging and observability tools across all core components.
This work was done as part of KEP #4827 led by SIG Instrumentation.
CCM: watch-based route controller reconciliation using informers
Managing network routes within cloud environments has traditionally relied on the Cloud Controller Manager (CCM) periodically polling the cloud provider's API to verify and update route tables. This fixed-interval reconciliation approach can be inefficient, often generating a high volume of unnecessary API calls and introducing latency between a node state change and the corresponding route update.
For the Kubernetes v1.35 release, the cloud-controller-manager library introduces a watch-based reconciliation strategy for the route controller. Instead of relying on a timer, the controller now utilizes informers to watch for specific Node events, such as additions, deletions, or relevant field updates and triggers route synchronization only when a change actually occurs.
The primary benefit is a significant reduction in cloud provider API usage, which lowers the risk of hitting rate limits and reduces operational overhead. Additionally, this event-driven model improves the responsiveness of the cluster's networking layer by ensuring that route tables are updated immediately following changes in cluster topology.
This work was done as part of KEP #5237 led by SIG Cloud Provider.
Extended toleration operators for threshold-based placement
Kubernetes v1.35 introduces SLA-aware scheduling by enabling workloads to express reliability requirements. The feature adds numeric comparison operators to tolerations, allowing pods to match or avoid nodes based on SLA-oriented taints such as service guarantees or fault-domain quality.
The primary benefit is enhancing the scheduler with more precise placement. Critical workloads can demand higher-SLA nodes, while lower priority workloads can opt into lower SLA ones. This improves utilization and reduces cost without compromising reliability.
This work was done as part of KEP #5471 led by SIG Scheduling.
Mutable container resources when Job is suspended
Running batch workloads often involves trial and error with resource limits. Currently, the Job specification is immutable, meaning that if a Job fails due to an Out of Memory (OOM) error or insufficient CPU, the user cannot simply adjust the resources; they must delete the Job and create a new one, losing the execution history and status.
Kubernetes v1.35 introduces the capability to update resource requests and limits for Jobs that are in a suspended state. Enabled via the MutableJobPodResourcesForSuspendedJobs feature gate, this enhancement allows users to pause a failing Job, modify its Pod template with appropriate resource values, and then resume execution with the corrected configuration.
The primary benefit is a smoother recovery workflow for misconfigured jobs. By allowing in-place corrections during suspension, users can resolve resource bottlenecks without disrupting the Job's lifecycle identity or losing track of its completion status, significantly improving the developer experience for batch processing.
This work was done as part of KEP #5440 led by SIG Apps.
Other notable changes
Continued innovation in Dynamic Resource Allocation (DRA)
The core functionality was graduated to stable in v1.34, with the ability to turn it off. In v1.35 it is always enabled. Several alpha features have also been significantly improved and are ready for testing. We encourage users to provide feedback on these capabilities to help clear the path for their target promotion to beta in upcoming releases.
Extended Resource Requests via DRA
Several functional gaps compared to Extended Resource requests via Device Plugins were addressed, for example scoring and reuse of devices in init containers.
Device Taints and Tolerations
The new "None" effect can be used to report a problem without immediately affecting scheduling or running pod. DeviceTaintRule now provides status information about an ongoing eviction. The "None" effect can be used for a "dry run" before actually evicting pods:
- Create DeviceTaintRule with "effect: None".
- Check the status to see how many pods would be evicted.
- Replace "effect: None" with "effect: NoExecute".
Partitionable Devices
Devices belonging to the same partitionable devices may now be defined in different ResourceSlices. You can read more in the official documentation.
Consumable Capacity, Device Binding Conditions
Several bugs were fixed and/or more tests added. You can learn more about Consumable Capacity and Binding Conditions in the official documentation.
Comparable resource version semantics
Kubernetes v1.35 changes the way that clients are allowed to interpret resource versions.
Before v1.35, the only supported comparison that clients could make was to check for string equality: if two resource versions were equal, they were the same. Clients could also provide a resource version to the API server and ask the control plane to do internal comparisons, such as streaming all events since a particular resource version.
In v1.35, all in-tree resource versions meet a new stricter definition: the values are a special form of decimal number. And, because they can be compared, clients can do their own operations to compare two different resource versions. For example, this means that a client reconnecting after a crash can detect when it has lost updates, as distinct from the case where there has been an update but no lost changes in the meantime.
This change in semantics enables other important use cases such as storage version migration, performance improvements to informers (a client helper concept), and controller reliability. All of those cases require knowing whether one resource version is newer than another.
This work was done as part of KEP #5504 led by SIG API Machinery.
Graduations, deprecations, and removals in v1.35
Graduations to stable
This lists all the features that graduated to stable (also known as general availability). For a full list of updates including new features and graduations from alpha to beta, see the release notes.
This release includes a total of 15 enhancements promoted to stable:
- Add CPUManager policy option to restrict reservedSystemCPUs to system daemons and interrupt processing
- Pod Generation
- Invariant Testing
- In-Place Update of Pod Resources
- Fine-grained SupplementalGroups control
- Add support for a drop-in kubelet configuration directory
- Remove gogo protobuf dependency for Kubernetes API types
- kubelet image GC after a maximum age
- Kubelet limit of Parallel Image Pulls
- Add a TopologyManager policy option for MaxAllowableNUMANodes
- Include kubectl command metadata in http request headers
- PreferSameNode Traffic Distribution (formerly PreferLocal traffic policy / Node-level topology)
- Job API managed-by mechanism
- Transition from SPDY to WebSockets
Deprecations, removals and community updates
As Kubernetes develops and matures, features may be deprecated, removed, or replaced with better ones to improve the project's overall health. See the Kubernetes deprecation and removal policy for more details on this process. Kubernetes v1.35 includes a couple of deprecations.
Ingress NGINX retirement
For years, the Ingress NGINX controller has been a popular choice for routing traffic into Kubernetes clusters. It was flexible, widely adopted, and served as the standard entry point for countless applications.
However, maintaining the project has become unsustainable. With a severe shortage of maintainers and mounting technical debt, the community recently made the difficult decision to retire it. This isn't strictly part of the v1.35 release, but it's such an important change that we wanted to highlight it here.
Consequently, the Kubernetes project announced that Ingress NGINX will receive only best-effort maintenance until March 2026. After this date, it will be archived with no further updates. The recommended path forward is to migrate to the Gateway API, which offers a more modern, secure, and extensible standard for traffic management.
You can find more in the official blog post.
Removal of cgroup v1 support
When it comes to managing resources on Linux nodes, Kubernetes has historically relied on cgroups (control groups). While the original cgroup v1 was functional, it was often inconsistent and limited. That is why Kubernetes introduced support for cgroup v2 back in v1.25, offering a much cleaner, unified hierarchy and better resource isolation.
Because cgroup v2 is now the modern standard, Kubernetes is ready to retire the legacy cgroup v1 support in v1.35. This is an important notice for cluster administrators: if you are still running nodes on older Linux distributions that don't support cgroup v2, your kubelet will fail to start. To avoid downtime, you will need to migrate those nodes to systems where cgroup v2 is enabled.
To learn more, read about cgroup v2;
you can also track the switchover work via KEP-5573: Remove cgroup v1 support.
Deprecation of ipvs mode in kube-proxy
Years ago, Kubernetes adopted the ipvs mode in kube-proxy to provide faster load balancing than the standard iptables. While it offered a performance boost, keeping it in sync with evolving networking requirements created too much technical debt and complexity.
Because of this maintenance burden, Kubernetes v1.35 deprecates ipvs mode. Although the mode remains available in this release, kube-proxy will now emit a warning on startup when configured to use it. The goal is to streamline the codebase and focus on modern standards. For Linux nodes, you should begin transitioning to nftables, which is now the recommended replacement.
You can find more in KEP-5495: Deprecate ipvs mode in kube-proxy.
Final call for containerd v1.X
While Kubernetes v1.35 still supports containerd 1.7 and other LTS releases, this is the final version with such support. The SIG Node community has designated v1.35 as the last release to support the containerd v1.X series.
This serves as an important reminder: before upgrading to the next Kubernetes version, you must switch to containerd 2.0 or later. To help identify which nodes need attention, you can monitor the kubelet_cri_losing_support metric within your cluster.
You can find more in the official blog post or in KEP-4033: Discover cgroup driver from CRI.
Improved Pod stability during kubelet restarts
Previously, restarting the kubelet service often caused a temporary disruption in Pod status. During a restart, the kubelet would reset container states, causing healthy Pods to be marked as NotReady and removed from load balancers, even if the application itself was still running correctly.
To address this reliability issue, this behavior has been corrected to ensure seamless node maintenance. The kubelet now properly restores the state of existing containers from the runtime upon startup. This ensures that your workloads remain Ready and traffic continues to flow uninterrupted during kubelet restarts or upgrades.
You can find more in KEP-4781: Fix inconsistent container ready state after kubelet restart.
Release notes
Check out the full details of the Kubernetes v1.35 release in our release notes.
Availability
Kubernetes v1.35 is available for download on GitHub or on the Kubernetes download page.
To get started with Kubernetes, check out these interactive tutorials or run local Kubernetes clusters using minikube. You can also easily install v1.35 using kubeadm.
Release team
Kubernetes is only possible with the support, commitment, and hard work of its community. Each release team is made up of dedicated community volunteers who work together to build the many pieces that make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management.
We honor the memory of Han Kang, a long-time contributor and respected engineer whose technical excellence and infectious enthusiasm left a lasting impact on the Kubernetes community. Han was a significant force within SIG Instrumentation and SIG API Machinery, earning a 2021 Kubernetes Contributor Award for his critical work and sustained commitment to the project's core stability. Beyond his technical contributions, Han was deeply admired for his generosity as a mentor and his passion for building connections among people. He was known for "opening doors" for others, whether guiding new contributors through their first pull requests or supporting colleagues with patience and kindness. Han's legacy lives on through the engineers he inspired, the robust systems he helped build, and the warm, collaborative spirit he fostered within the cloud native ecosystem.
We would like to thank the entire Release Team for the hours spent hard at work to deliver the Kubernetes v1.35 release to our community. The Release Team's membership ranges from first-time shadows to returning team leads with experience forged over several release cycles. We are incredibly grateful to our Release Lead, Drew Hagen, whose hands-on guidance and vibrant energy not only navigated us through complex challenges but also fueled the community spirit behind this successful release.
Project velocity
The CNCF K8s DevStats project aggregates a number of interesting data points related to the velocity of Kubernetes and various sub-projects. This includes everything from individual contributions to the number of companies that are contributing and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.
During the v1.35 release cycle, which spanned 14 weeks from 15th September 2025 to 17th December 2025, Kubernetes received contributions from as many as 85 different companies and 419 individuals. In the wider cloud native ecosystem, the figure goes up to 281 companies, counting 1769 total contributors.
Note that "contribution" counts when someone makes a commit, code review, comment, creates an issue or PR, reviews a PR (including blogs and documentation) or comments on issues and PRs.
If you are interested in contributing, visit Getting Started on our contributor website.
Sources for this data:
Events update
Explore upcoming Kubernetes and cloud native events, including KubeCon + CloudNativeCon, KCD, and other notable conferences worldwide. Stay informed and get involved with the Kubernetes community!
February 2026
- KCD - Kubernetes Community Days: New Delhi: Feb 21, 2026 | New Delhi, India
- KCD - Kubernetes Community Days: Guadalajara: Feb 23, 2026 | Guadalajara, Mexico
March 2026
- KubeCon + CloudNativeCon Europe 2026: Mar 23-26, 2026 | Amsterdam, Netherlands
May 2026
- KCD - Kubernetes Community Days: Toronto: May 13, 2026 | Toronto, Canada
- KCD - Kubernetes Community Days: Helsinki: May 20, 2026 | Helsinki, Finland
June 2026
- KubeCon + CloudNativeCon India 2026: Jun 18-19, 2026 | Mumbai, India
- KCD - Kubernetes Community Days: Kuala Lumpur: Jun 27, 2026 | Kuala Lumpur, Malaysia
July 2026
- KubeCon + CloudNativeCon Japan 2026: Jul 29-30, 2026 | Yokohama, Japan
You can find the latest event details here.
Upcoming release webinar
Join members of the Kubernetes v1.35 Release Team on Wednesday, January 14, 2026, at 5:00 PM (UTC) to learn about the release highlights of this release. For more information and registration, visit the event page on the CNCF Online Programs site.
Get involved
The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you'd like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.
- Follow us on Bluesky @Kubernetesio for the latest updates
- Join the community discussion on Discuss
- Join the community on Slack
- Post questions (or answer questions) on Stack Overflow
- Share your Kubernetes story
- Read more about what's happening with Kubernetes on the blog
- Learn more about the Kubernetes Release Team
17 Dec 2025 6:30pm GMT
26 Nov 2025
Kubernetes Blog
Kubernetes v1.35 Sneak Peek
As the release of Kubernetes v1.35 approaches, the Kubernetes project continues to evolve. Features may be deprecated, removed, or replaced to improve the project's overall health. This blog post outlines planned changes for the v1.35 release that the release team believes you should be aware of to ensure the continued smooth operation of your Kubernetes cluster(s), and to keep you up to date with the latest developments. The information below is based on the current status of the v1.35 release and is subject to change before the final release date.
Deprecations and removals for Kubernetes v1.35
cgroup v1 support
On Linux nodes, container runtimes typically rely on cgroups (short for "control groups"). Support for using cgroup v2 has been stable in Kubernetes since v1.25, providing an alternative to the original v1 cgroup support. While cgroup v1 provided the initial resource control mechanism, it suffered from well-known inconsistencies and limitations. Adding support for cgroup v2 allowed use of a unified control group hierarchy, improved resource isolation, and served as the foundation for modern features, making legacy cgroup v1 support ready for removal. The removal of cgroup v1 support will only impact cluster administrators running nodes on older Linux distributions that do not support cgroup v2; on those nodes, the kubelet will fail to start. Administrators must migrate their nodes to systems with cgroup v2 enabled. More details on compatibility requirements will be available in a blog post soon after the v1.35 release.
To learn more, read about cgroup v2;
you can also track the switchover work via KEP-5573: Remove cgroup v1 support.
Deprecation of ipvs mode in kube-proxy
Many releases ago, the Kubernetes project implemented an ipvs mode in kube-proxy. It was adopted as a way to provide high-performance service load balancing, with better performance than the existing iptables mode. However, maintaining feature parity between ipvs and other kube-proxy modes became difficult, due to technical complexity and diverging requirements. This created significant technical debt and made the ipvs backend impractical to support alongside newer networking capabilities.
The Kubernetes project intends to deprecate kube-proxy ipvs mode in the v1.35 release, to streamline the kube-proxy codebase. For Linux nodes, the recommended kube-proxy mode is already nftables.
You can find more in KEP-5495: Deprecate ipvs mode in kube-proxy
Kubernetes is deprecating containerd v1.y support
While Kubernetes v1.35 still supports containerd 1.7 and other LTS releases of containerd, as a consequence of automated cgroup driver detection, the Kubernetes SIG Node community has formally agreed upon a final support timeline for containerd v1.X. Kubernetes v1.35 is the last release to offer this support (aligned with containerd 1.7 EOL).
This is a final warning that if you are using containerd 1.X, you must switch to 2.0 or later before upgrading Kubernetes to the next version. You are able to monitor the kubelet_cri_losing_support metric to determine if any nodes in your cluster are using a containerd version that will soon be unsupported.
You can find more in the official blog post or in KEP-4033: Discover cgroup driver from CRI
Featured enhancements of Kubernetes v1.35
The following enhancements are some of those likely to be included in the v1.35 release. This is not a commitment, and the release content is subject to change.
Node declared features
When scheduling Pods, Kubernetes uses node labels, taints, and tolerations to match workload requirements with node capabilities. However, managing feature compatibility becomes challenging during cluster upgrades due to version skew between the control plane and nodes. This can lead to Pods being scheduled on nodes that lack required features, resulting in runtime failures.
The node declared features framework will introduce a standard mechanism for nodes to declare their supported Kubernetes features. With the new alpha feature enabled, a Node reports the features it can support, publishing this information to the control plane through a new .status.declaredFeatures field. Then, the kube-scheduler, admission controllers and third-party components can use these declarations. For example, you can enforce scheduling and API validation constraints, ensuring that Pods run only on compatible nodes.
This approach reduces manual node labeling, improves scheduling accuracy, and prevents incompatible pod placements proactively. It also integrates with the Cluster Autoscaler for informed scale-up decisions. Feature declarations are temporary and tied to Kubernetes feature gates, enabling safe rollout and cleanup.
Targeting alpha in v1.35, node declared features aims to solve version skew scheduling issues by making node capabilities explicit, enhancing reliability and cluster stability in heterogeneous version environments.
To learn more about this before the official documentation is published, you can read KEP-5328.
In-place update of Pod resources
Kubernetes is graduating in-place updates for Pod resources to General Availability (GA). This feature allows users to adjust cpu and memory resources without restarting Pods or Containers. Previously, such modifications required recreating Pods, which could disrupt workloads, particularly for stateful or batch applications. Previous Kubernetes releases already allowed you to change infrastructure resources settings (requests and limits) for existing Pods. This allows for smoother vertical scaling, improves efficiency, and can also simplify solution development.
The Container Runtime Interface (CRI) has also been improved, extending the UpdateContainerResources API for Windows and future runtimes while allowing ContainerStatus to report real-time resource configurations. Together, these changes make scaling in Kubernetes faster, more flexible, and disruption-free. The feature was introduced as alpha in v1.27, graduated to beta in v1.33, and is targeting graduation to stable in v1.35.
You can find more in KEP-1287: In-place Update of Pod Resources
Pod certificates
When running microservices, Pods often require a strong cryptographic identity to authenticate with each other using mutual TLS (mTLS). While Kubernetes provides Service Account tokens, these are designed for authenticating to the API server, not for general-purpose workload identity.
Before this enhancement, operators had to rely on complex, external projects like SPIFFE/SPIRE or cert-manager to provision and rotate certificates for their workloads. But what if you could issue a unique, short-lived certificate to your Pods natively and automatically? KEP-4317 is designed to enable such native workload identity. It opens up various possibilities for securing pod-to-pod communication by allowing the kubelet to request and mount certificates for a Pod via a projected volume.
This provides a built-in mechanism for workload identity, complete with automated certificate rotation, significantly simplifying the setup of service meshes and other zero-trust network policies. This feature was introduced as alpha in v1.34 and is targeting beta in v1.35.
You can find more in KEP-4317: Pod Certificates
Numeric values for taints
Kubernetes is enhancing taints and tolerations by adding numeric comparison operators, such as Gt (Greater Than) and Lt (Less Than).
Previously, tolerations supported only exact (Equal) or existence (Exists) matches, which were not suitable for numeric properties such as reliability SLAs.
With this change, a Pod can use a toleration to "opt-in" to nodes that meet a specific numeric threshold. For example, a Pod can require a Node with an SLA taint value greater than 950 (operator: Gt, value: "950").
This approach is more powerful than Node Affinity because it supports the NoExecute effect, allowing Pods to be automatically evicted if a node's numeric value drops below the tolerated threshold.
You can find more in KEP-5471: Enable SLA-based Scheduling
User namespaces
When running Pods, you can use securityContext to drop privileges, but containers inside the pod often still run as root (UID 0). This simplicity poses a significant challenge, as that container UID 0 maps directly to the host's root user.
Before this enhancement, a container breakout vulnerability could grant an attacker full root access to the node. But what if you could dynamically remap the container's root user to a safe, unprivileged user on the host? KEP-127 specifically allows such native support for Linux User Namespaces. It opens up various possibilities for pod security by isolating container and host user/group IDs. This allows a process to have root privileges (UID 0) within its namespace, while running as a non-privileged, high-numbered UID on the host.
Released as alpha in v1.25 and beta in v1.30, this feature continues to progress through beta maturity, paving the way for truly "rootless" containers that drastically reduce the attack surface for a whole class of security vulnerabilities.
You can find more in KEP-127: User Namespaces
Support for mounting OCI images as volumes
When provisioning a Pod, you often need to bundle data, binaries, or configuration files for your containers. Before this enhancement, people often included that kind of data directly into the main container image, or required a custom init container to download and unpack files into an emptyDir. You can still take either of those approaches, of course.
But what if you could populate a volume directly from a data-only artifact in an OCI registry, just like pulling a container image? Kubernetes v1.31 added support for the image volume type, allowing Pods to pull and unpack OCI container image artifacts into a volume declaratively.
This allows for seamless distribution of data, binaries, or ML models using standard registry tooling, completely decoupling data from the container image and eliminating the need for complex init containers or startup scripts. This volume type has been in beta since v1.33 and will likely be enabled by default in v1.35.
You can try out the beta version of image volumes, or you can learn more about the plans from KEP-4639: OCI Volume Source.
Want to know more?
New features and deprecations are also announced in the Kubernetes release notes. We will formally announce what's new in Kubernetes v1.35 as part of the CHANGELOG for that release.
The Kubernetes v1.35 release is planned for December 17, 2025. Stay tuned for updates!
You can also see the announcements of changes in the release notes for:
Get involved
The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you'd like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.
- Follow us on Bluesky @kubernetes.io for the latest updates
- Join the community discussion on Discuss
- Join the community on Slack
- Post questions (or answer questions) on Server Fault or Stack Overflow
- Share your Kubernetes story
- Read more about what's happening with Kubernetes on the blog
- Learn more about the Kubernetes Release Team
26 Nov 2025 12:00am GMT
25 Nov 2025
Kubernetes Blog
Kubernetes Configuration Good Practices
Configuration is one of those things in Kubernetes that seems small until it's not. Configuration is at the heart of every Kubernetes workload. A missing quote, a wrong API version or a misplaced YAML indent can ruin your entire deploy.
This blog brings together tried-and-tested configuration best practices. The small habits that make your Kubernetes setup clean, consistent and easier to manage. Whether you are just starting out or already deploying apps daily, these are the little things that keep your cluster stable and your future self sane.
This blog is inspired by the original Configuration Best Practices page, which has evolved through contributions from many members of the Kubernetes community.
General configuration practices
Use the latest stable API version
Kubernetes evolves fast. Older APIs eventually get deprecated and stop working. So, whenever you are defining resources, make sure you are using the latest stable API version. You can always check with
kubectl api-resources
This simple step saves you from future compatibility issues.
Store configuration in version control
Never apply manifest files directly from your desktop. Always keep them in a version control system like Git, it's your safety net. If something breaks, you can instantly roll back to a previous commit, compare changes or recreate your cluster setup without panic.
Write configs in YAML not JSON
Write your configuration files using YAML rather than JSON. Both work technically, but YAML is just easier for humans. It's cleaner to read and less noisy and widely used in the community.
YAML has some sneaky gotchas with boolean values: Use only true or false. Don't write yes, no, on or off. They might work in one version of YAML but break in another. To be safe, quote anything that looks like a Boolean (for example "yes").
Keep configuration simple and minimal
Avoid setting default values that are already handled by Kubernetes. Minimal manifests are easier to debug, cleaner to review and less likely to break things later.
Group related objects together
If your Deployment, Service and ConfigMap all belong to one app, put them in a single manifest file.
It's easier to track changes and apply them as a unit. See the Guestbook all-in-one.yaml file for an example of this syntax.
You can even apply entire directories with:
kubectl apply -f configs/
One command and boom everything in that folder gets deployed.
Add helpful annotations
Manifest files are not just for machines, they are for humans too. Use annotations to describe why something exists or what it does. A quick one-liner can save hours when debugging later and also allows better collaboration.
The most helpful annotation to set is kubernetes.io/description. It's like using comment, except that it gets copied into the API so that everyone else can see it even after you deploy.
Managing Workloads: Pods, Deployments, and Jobs
A common early mistake in Kubernetes is creating Pods directly. Pods work, but they don't reschedule themselves if something goes wrong.
Naked Pods (Pods not managed by a controller, such as Deployment or a StatefulSet) are fine for testing, but in real setups, they are risky.
Why? Because if the node hosting that Pod dies, the Pod dies with it and Kubernetes won't bring it back automatically.
Use Deployments for apps that should always be running
A Deployment, which both creates a ReplicaSet to ensure that the desired number of Pods is always available, and specifies a strategy to replace Pods (such as RollingUpdate), is almost always preferable to creating Pods directly. You can roll out a new version, and if something breaks, roll back instantly.
Use Jobs for tasks that should finish
A Job is perfect when you need something to run once and then stop like database migration or batch processing task. It will retry if the pods fails and report success when it's done.
Service Configuration and Networking
Services are how your workloads talk to each other inside (and sometimes outside) your cluster. Without them, your pods exist but can't reach anyone. Let's make sure that doesn't happen.
Create Services before workloads that use them
When Kubernetes starts a Pod, it automatically injects environment variables for existing Services. So, if a Pod depends on a Service, create a Service before its corresponding backend workloads (Deployments or StatefulSets), and before any workloads that need to access it.
For example, if a Service named foo exists, all containers will get the following variables in their initial environment:
FOO_SERVICE_HOST=<the host the Service runs on>
FOO_SERVICE_PORT=<the port the Service runs on>
DNS based discovery doesn't have this problem, but it's a good habit to follow anyway.
Use DNS for Service discovery
If your cluster has the DNS add-on (most do), every Service automatically gets a DNS entry. That means you can access it by name instead of IP:
curl http://my-service.default.svc.cluster.local
It's one of those features that makes Kubernetes networking feel magical.
Avoid hostPort and hostNetwork unless absolutely necessary
You'll sometimes see these options in manifests:
hostPort: 8080
hostNetwork: true
But here's the thing: They tie your Pods to specific nodes, making them harder to schedule and scale. Because each <hostIP, hostPort, protocol> combination must be unique. If you don't specify the hostIP and protocol explicitly, Kubernetes will use 0.0.0.0 as the default hostIP and TCP as the default protocol. Unless you're debugging or building something like a network plugin, avoid them.
If you just need local access for testing, try kubectl port-forward:
kubectl port-forward deployment/web 8080:80
See Use Port Forwarding to access applications in a cluster to learn more. Or if you really need external access, use a type: NodePort Service. That's the safer, Kubernetes-native way.
Use headless Services for internal discovery
Sometimes, you don't want Kubernetes to load balance traffic. You want to talk directly to each Pod. That's where headless Services come in.
You create one by setting clusterIP: None. Instead of a single IP, DNS gives you a list of all Pods IPs, perfect for apps that manage connections themselves.
Working with labels effectively
Labels are key/value pairs that are attached to objects such as Pods. Labels help you organize, query and group your resources. They don't do anything by themselves, but they make everything else from Services to Deployments work together smoothly.
Use semantics labels
Good labels help you understand what's what, even after months later. Define and use labels that identify semantic attributes of your application or Deployment. For example;
labels:
app.kubernetes.io/name: myapp
app.kubernetes.io/component: web
tier: frontend
phase: test
app.kubernetes.io/name: what the app istier: which layer it belongs to (frontend/backend)phase: which stage it's in (test/prod)
You can then use these labels to make powerful selectors. For example:
kubectl get pods -l tier=frontend
This will list all frontend Pods across your cluster, no matter which Deployment they came from. Basically you are not manually listing Pod names; you are just describing what you want. See the guestbook app for examples of this approach.
Use common Kubernetes labels
Kubernetes actually recommends a set of common labels. It's a standardized way to name things across your different workloads or projects. Following this convention makes your manifests cleaner, and it means that tools such as Headlamp, dashboard, or third-party monitoring systems can all automatically understand what's running.
Manipulate labels for debugging
Since controllers (like ReplicaSets or Deployments) use labels to manage Pods, you can remove a label to "detach" a Pod temporarily.
Example:
kubectl label pod mypod app-
The app- part removes the label key app. Once that happens, the controller won't manage that Pod anymore. It's like isolating it for inspection, a "quarantine mode" for debugging. To interactively remove or add labels, use kubectl label.
You can then check logs, exec into it and once done, delete it manually. That's a super underrated trick every Kubernetes engineer should know.
Handy kubectl tips
These small tips make life much easier when you are working with multiple manifest files or clusters.
Apply entire directories
Instead of applying one file at a time, apply the whole folder:
# Using server-side apply is also a good practice
kubectl apply -f configs/ --server-side
This command looks for .yaml, .yml and .json files in that folder and applies them all together. It's faster, cleaner and helps keep things grouped by app.
Use label selectors to get or delete resources
You don't always need to type out resource names one by one. Instead, use selectors to act on entire groups at once:
kubectl get pods -l app=myapp
kubectl delete pod -l phase=test
It's especially useful in CI/CD pipelines, where you want to clean up test resources dynamically.
Quickly create Deployments and Services
For quick experiments, you don't always need to write a manifest. You can spin up a Deployment right from the CLI:
kubectl create deployment webapp --image=nginx
Then expose it as a Service:
kubectl expose deployment webapp --port=80
This is great when you just want to test something before writing full manifests. Also, see Use a Service to Access an Application in a cluster for an example.
Conclusion
Cleaner configuration leads to calmer cluster administrators. If you stick to a few simple habits: keep configuration simple and minimal, version-control everything, use consistent labels, and avoid relying on naked Pods, you'll save yourself hours of debugging down the road.
The best part? Clean configurations stay readable. Even after months, you or anyone on your team can glance at them and know exactly what's happening.
25 Nov 2025 12:00am GMT