16 May 2025
Kubernetes Blog
Kubernetes v1.33: In-Place Pod Resize Graduated to Beta
On behalf of the Kubernetes project, I am excited to announce that the in-place Pod resize feature (also known as In-Place Pod Vertical Scaling), first introduced as alpha in Kubernetes v1.27, has graduated to Beta and will be enabled by default in the Kubernetes v1.33 release! This marks a significant milestone in making resource management for Kubernetes workloads more flexible and less disruptive.
What is in-place Pod resize?
Traditionally, changing the CPU or memory resources allocated to a container required restarting the Pod. While acceptable for many stateless applications, this could be disruptive for stateful services, batch jobs, or any workloads sensitive to restarts.
In-place Pod resizing allows you to change the CPU and memory requests and limits assigned to containers within a running Pod, often without requiring a container restart.
Here's the core idea:
- The
spec.containers[*].resources
field in a Pod specification now represents the desired resources and is mutable for CPU and memory. - The
status.containerStatuses[*].resources
field reflects the actual resources currently configured on a running container. - You can trigger a resize by updating the desired resources in the Pod spec via the new
resize
subresource.
You can try it out on a v1.33 Kubernetes cluster by using kubectl to edit a Pod (requires kubectl
v1.32+):
kubectl edit pod <pod-name> --subresource resize
For detailed usage instructions and examples, please refer to the official Kubernetes documentation: Resize CPU and Memory Resources assigned to Containers.
Why does in-place Pod resize matter?
Kubernetes still excels at scaling workloads horizontally (adding or removing replicas), but in-place Pod resizing unlocks several key benefits for vertical scaling:
- Reduced Disruption: Stateful applications, long-running batch jobs, and sensitive workloads can have their resources adjusted without suffering the downtime or state loss associated with a Pod restart.
- Improved Resource Utilization: Scale down over-provisioned Pods without disruption, freeing up resources in the cluster. Conversely, provide more resources to Pods under heavy load without needing a restart.
- Faster Scaling: Address transient resource needs more quickly. For example Java applications often need more CPU during startup than during steady-state operation. Start with higher CPU and resize down later.
What's changed between Alpha and Beta?
Since the alpha release in v1.27, significant work has gone into maturing the feature, improving its stability, and refining the user experience based on feedback and further development. Here are the key changes:
Notable user-facing changes
resize
Subresource: Modifying Pod resources must now be done via the Pod'sresize
subresource (kubectl patch pod <name> --subresource resize ...
).kubectl
versions v1.32+ support this argument.- Resize Status via Conditions: The old
status.resize
field is deprecated. The status of a resize operation is now exposed via two Pod conditions:PodResizePending
: Indicates the Kubelet cannot grant the resize immediately (e.g.,reason: Deferred
if temporarily unable,reason: Infeasible
if impossible on the node).PodResizeInProgress
: Indicates the resize is accepted and being applied. Errors encountered during this phase are now reported in this condition's message withreason: Error
.
- Sidecar Support: Resizing sidecar containers in-place is now supported.
Stability and reliability enhancements
- Refined Allocated Resources Management: The allocation management logic with the Kubelet was significantly reworked, making it more consistent and robust. The changes eliminated whole classes of bugs, and greatly improved the reliability of in-place Pod resize.
- Improved Checkpointing & State Tracking: A more robust system for tracking "allocated" and "actuated" resources was implemented, using new checkpoint files (
allocated_pods_state
,actuated_pods_state
) to reliably manage resize state across Kubelet restarts and handle edge cases where runtime-reported resources differ from requested ones. Several bugs related to checkpointing and state restoration were fixed. Checkpointing efficiency was also improved. - Faster Resize Detection: Enhancements to the Kubelet's Pod Lifecycle Event Generator (PLEG) allow the Kubelet to respond to and complete resizes much more quickly.
- Enhanced CRI Integration: A new
UpdatePodSandboxResources
CRI call was added to better inform runtimes and plugins (like NRI) about Pod-level resource changes. - Numerous Bug Fixes: Addressed issues related to systemd cgroup drivers, handling of containers without limits, CPU minimum share calculations, container restart backoffs, error propagation, test stability, and more.
What's next?
Graduating to Beta means the feature is ready for broader adoption, but development doesn't stop here! Here's what the community is focusing on next:
- Stability and Productionization: Continued focus on hardening the feature, improving performance, and ensuring it is robust for production environments.
- Addressing Limitations: Working towards relaxing some of the current limitations noted in the documentation, such as allowing memory limit decreases.
- VerticalPodAutoscaler (VPA) Integration: Work to enable VPA to leverage in-place Pod resize is already underway. A new
InPlaceOrRecreate
update mode will allow it to attempt non-disruptive resizes first, or fall back to recreation if needed. This will allow users to benefit from VPA's recommendations with significantly less disruption. - User Feedback: Gathering feedback from users adopting the beta feature is crucial for prioritizing further enhancements and addressing any uncovered issues or bugs.
Getting started and providing feedback
With the InPlacePodVerticalScaling
feature gate enabled by default in v1.33, you can start experimenting with in-place Pod resizing right away!
Refer to the documentation for detailed guides and examples.
As this feature moves through Beta, your feedback is invaluable. Please report any issues or share your experiences via the standard Kubernetes communication channels (GitHub issues, mailing lists, Slack). You can also review the KEP-1287: In-place Update of Pod Resources for the full in-depth design details.
We look forward to seeing how the community leverages in-place Pod resize to build more efficient and resilient applications on Kubernetes!
16 May 2025 6:30pm GMT
Announcing etcd v3.6.0
This announcement originally appeared on the etcd blog.
Today, we are releasing etcd v3.6.0, the first minor release since etcd v3.5.0 on June 15, 2021. This release introduces several new features, makes significant progress on long-standing efforts like downgrade support and migration to v3store, and addresses numerous critical & major issues. It also includes major optimizations in memory usage, improving efficiency and performance.
In addition to the features of v3.6.0, etcd has joined Kubernetes as a SIG (sig-etcd), enabling us to improve project sustainability. We've introduced systematic robustness testing to ensure correctness and reliability. Through the etcd-operator Working Group, we plan to improve usability as well.
What follows are the most significant changes introduced in etcd v3.6.0, along with the discussion of the roadmap for future development. For a detailed list of changes, please refer to the CHANGELOG-3.6.
A heartfelt thank you to all the contributors who made this release possible!
Security
etcd takes security seriously. To enhance software security in v3.6.0, we have improved our workflow checks by integrating govulncheck
to scan the source code and trivy
to scan container images. These improvements have also been backported to supported stable releases.
etcd continues to follow the Security Release Process to ensure vulnerabilities are properly managed and addressed.
Features
Migration to v3store
The v2store has been deprecated since etcd v3.4 but could still be enabled via --enable-v2
. It remained the source of truth for membership data. In etcd v3.6.0, v2store can no longer be enabled as the --enable-v2
flag has been removed, and v3store has become the sole source of truth for membership data.
While v2store still exists in v3.6.0, etcd will fail to start if it contains any data other than membership information. To assist with migration, etcd v3.5.18+ provides the etcdutl check v2store
command, which verifies that v2store contains only membership data (see PR 19113).
Compared to v2store, v3store offers better performance and transactional support. It is also the actively maintained storage engine moving forward.
The removal of v2store is still ongoing and is tracked in issues/12913.
Downgrade
etcd v3.6.0 is the first version to fully support downgrade. The effort for this downgrade task spans both versions 3.5 and 3.6, and all related work is tracked in issues/11716.
At a high level, the process involves migrating the data schema to the target version (e.g., v3.5), followed by a rolling downgrade.
Ensure the cluster is healthy, and take a snapshot backup. Validate whether the downgrade is valid:
$ etcdctl downgrade validate 3.5
Downgrade validate success, cluster version 3.6
If the downgrade is valid, enable downgrade mode:
$ etcdctl downgrade enable 3.5
Downgrade enable success, cluster version 3.6
etcd will then migrate the data schema in the background. Once complete, proceed with the rolling downgrade.
For details, refer to the Downgrade-3.6 guide.
Feature gates
In etcd v3.6.0, we introduced Kubernetes-style feature gates for managing new features. Previously, we indicated unstable features through the --experimental
prefix in feature flag names. The prefix was removed once the feature was stable, causing a breaking change. Now, features will start in Alpha, progress to Beta, then GA, or get deprecated. This ensures a much smoother upgrade and downgrade experience for users.
See feature-gates for details.
livez / readyz checks
etcd now supports /livez
and /readyz
endpoints, aligning with Kubernetes' Liveness and Readiness probes. /livez
indicates whether the etcd instance is alive, while /readyz
indicates when it is ready to serve requests. This feature has also been backported to release-3.5 (starting from v3.5.11) and release-3.4 (starting from v3.4.29). See livez/readyz for details.
The existing /health
endpoint remains functional. /livez
is similar to /health?serializable=true
, while /readyz
is similar to /health
or /health?serializable=false
. Clearly, the /livez
and /readyz
endpoints provide clearer semantics and are easier to understand.
v3discovery
In etcd v3.6.0, the new discovery protocol v3discovery was introduced, based on clientv3. It facilitates the discovery of all cluster members during the bootstrap phase.
The previous v2discovery protocol, based on clientv2, has been deprecated. Additionally, the public discovery service at https://discovery.etcd.io/, which relied on v2discovery, is no longer maintained.
Performance
Memory
In this release, we reduced average memory consumption by at least 50% (see Figure 1). This improvement is primarily due to two changes:
- The default value of
--snapshot-count
has been reduced from 100,000 in v3.5 to 10,000 in v3.6. As a result, etcd v3.6 now retains only about 10% of the history records compared to v3.5. - Raft history is compacted more frequently, as introduced in PR/18825.
Figure 1: Memory usage comparison between etcd v3.5.20 and v3.6.0-rc.2 under different read/write ratios. Each subplot shows the memory usage over time with a specific read/write ratio. The red line represents etcd v3.5.20, while the teal line represents v3.6.0-rc.2. Across all tested ratios, v3.6.0-rc.2 exhibits lower and more stable memory usage.
Throughput
Compared to v3.5, etcd v3.6 delivers an average performance improvement of approximately 10% in both read and write throughput (see Figure 2, 3, 4 and 5). This improvement is not attributed to any single major change, but rather the cumulative effect of multiple minor enhancements. One such example is the optimization of the free page queries introduced in PR/419.
Figure 2: Read throughput comparison between etcd v3.5.20 and v3.6.0-rc.2 under a high write ratio. The read/write ratio is 0.0078, meaning 1 read per 128 writes. The right bar shows the percentage improvement in read throughput of v3.6.0-rc.2 over v3.5.20, ranging from 3.21% to 25.59%.
Figure 3: Read throughput comparison between etcd v3.5.20 and v3.6.0-rc.2 under a high read ratio. The read/write ratio is 8, meaning 8 reads per write. The right bar shows the percentage improvement in read throughput of v3.6.0-rc.2 over v3.5.20, ranging from 4.38% to 27.20%.
Figure 4: Write throughput comparison between etcd v3.5.20 and v3.6.0-rc.2 under a high write ratio. The read/write ratio is 0.0078, meaning 1 read per 128 writes. The right bar shows the percentage improvement in write throughput of v3.6.0-rc.2 over v3.5.20, ranging from 2.95% to 24.24%.
Figure 5: Write throughput comparison between etcd v3.5.20 and v3.6.0-rc.2 under a high read ratio. The read/write ratio is 8, meaning 8 reads per write. The right bar shows the percentage improvement in write throughput of v3.6.0-rc.2 over v3.5.20, ranging from 3.86% to 28.37%.
Breaking changes
This section highlights a few notable breaking changes. For a complete list, please refer to the Upgrade etcd from v3.5 to v3.6 and the CHANGELOG-3.6.
Old binaries are incompatible with new schema versions
Old etcd binaries are not compatible with newer data schema versions. For example, etcd 3.5 cannot start with data created by etcd 3.6, and etcd 3.4 cannot start with data created by either 3.5 or 3.6.
When downgrading etcd, it's important to follow the documented downgrade procedure. Simply replacing the binary or image will result in the incompatibility issue.
Peer endpoints no longer serve client requests
Client endpoints (--advertise-client-urls
) are intended to serve client requests only, while peer endpoints (--initial-advertise-peer-urls
) are intended solely for peer communication. However, due to an implementation oversight, the peer endpoints were also able to handle client requests in etcd 3.4 and 3.5. This behavior was misleading and encouraged incorrect usage patterns. In etcd 3.6, this misleading behavior was corrected via PR/13565; peer endpoints no longer serve client requests.
Clear boundary between etcdctl and etcdutl
Both etcdctl
and etcdutl
are command line tools. etcdutl
is an offline utility designed to operate directly on etcd data files, while etcdctl
is an online tool that interacts with etcd over a network. Previously, there were some overlapping functionalities between the two, but these overlaps were removed in 3.6.0.
-
Removed
etcdctl defrag --data-dir
The
etcdctl defrag
command only support online defragmentation and no longer supports offline defragmentation. To perform offline defragmentation, use theetcdutl defrag --data-dir
command instead. -
Removed
etcdctl snapshot status
etcdctl
no longer supports retrieving the status of a snapshot. Use theetcdutl snapshot status
command instead. -
Removed
etcdctl snapshot restore
etcdctl
no longer supports restoring from a snapshot. Use theetcdutl snapshot restore
command instead.
Critical bug fixes
Correctness has always been a top priority for the etcd project. In the process of developing 3.6.0, we found and fixed a few notable bugs that could lead to data inconsistency in specific cases. These fixes have been backported to previous releases, but we believe they deserve special mention here.
- Data Inconsistency when Crashing Under Load
Previously, when etcd was applying data, it would update the consistent-index first, followed by committing the data. However, these operations were not atomic. If etcd crashed in between, it could lead to data inconsistency (see issue/13766). The issue was introduced in v3.5.0, and fixed in v3.5.3 with PR/13854.
- Durability API guarantee broken in single node cluster
When a client writes data and receives a success response, the data is expected to be persisted. However, the data might be lost if etcd crashes immediately after sending the success response to the client. This was a legacy issue (see issue/14370) affecting all previous releases. It was addressed in v3.4.21 and v3.5.5 with PR/14400, and fixed in raft side in main branch (now release-3.6) with PR/14413.
- Revision Inconsistency when Crashing During Defragmentation
If etcd crashed during the defragmentation operation, upon restart, it might reapply some entries which had already been applied, accordingly leading to the revision inconsistency issue (see the discussions in PR/14685). The issue was introduced in v3.5.0, and fixed in v3.5.6 with PR/14730.
Upgrade issue
This section highlights a common issue issues/19557 in the etcd v3.5 to v3.6 upgrade that may cause the upgrade process to fail. For a complete upgrade guide, refer to Upgrade etcd from v3.5 to v3.6.
The issue was introduced in etcd v3.5.1, and resolved in v3.5.20.
Key takeaway: users are required to first upgrade to etcd v3.5.20 (or a higher patch version) before upgrading to etcd v3.6.0; otherwise, the upgrade may fail.
For more background and technical context, see upgrade_from_3.5_to_3.6_issue.
Testing
We introduced the Robustness testing to verify correctness, which has always been our top priority. It plays traffic of various types and volumes against an etcd cluster, concurrently injects a random failpoint, records all operations (including both requests and responses), and finally performs a linearizability check. It also verifies that the Watch APIs guarantees have not been violated. The robustness test increases our confidence in ensuring the quality of each etcd release.
We have migrated most of the etcd workflow tests to Kubernetes' Prow testing infrastructure to take advantage of its benefit, such as nice dashboards for viewing test results and the ability for contributors to rerun failed tests themselves.
Platforms
While retaining all existing supported platforms, we have promoted Linux/ARM64 to Tier 1 support. For more details, please refer to issues/15951. For the complete list of supported platforms, see supported-platform.
Dependencies
Dependency bumping guide
We have published an official guide on how to bump dependencies for etcd's main branch and stable releases. It also covers how to update the Go version. For more details, please refer to dependency_management. With this guide available, any contributors can now help with dependency upgrades.
Core Dependency Updates
bbolt and raft are two core dependencies of etcd.
Both etcd v3.4 and v3.5 depend on bbolt v1.3, while etcd v3.6 depends on bbolt v1.4.
For the release-3.4 and release-3.5 branches, raft is included in the etcd repository itself, so etcd v3.4 and v3.5 do not depend on an external raft module. Starting from etcd v3.6, raft was moved to a separate repository (raft), and the first standalone raft release is v3.6.0. As a result, etcd v3.6.0 depends on raft v3.6.0.
Please see the table below for a summary:
etcd versions | bbolt versions | raft versions |
---|---|---|
3.4.x | v1.3.x | N/A |
3.5.x | v1.3.x | N/A |
3.6.x | v1.4.x | v3.6.x |
grpc-gateway@v2
We upgraded grpc-gateway from v1 to v2 via PR/16595 in etcd v3.6.0. This is a major step toward migrating to protobuf-go, the second major version of the Go protocol buffer API implementation.
grpc-gateway@v2 is designed to work with protobuf-go. However, etcd v3.6 still depends on the deprecated gogo/protobuf, which is actually protocol buffer v1 implementation. To resolve this incompatibility, we applied a patch to the generated *.pb.gw.go files to convert v1 messages to v2 messages.
grpc-ecosystem/go-grpc-middleware/providers/prometheus
We switched from the deprecated (and archived) grpc-ecosystem/go-grpc-prometheus to grpc-ecosystem/go-grpc-middleware/providers/prometheus via PR/19195. This change ensures continued support and access to the latest features and improvements in the gRPC Prometheus integration.
Community
There are exciting developments in the etcd community that reflect our ongoing commitment to strengthening collaboration, improving maintainability, and evolving the project's governance.
etcd Becomes a Kubernetes SIG
etcd has officially become a Kubernetes Special Interest Group: SIG-etcd. This change reflects etcd's critical role as the primary datastore for Kubernetes and establishes a more structured and transparent home for long-term stewardship and cross-project collaboration. The new SIG designation will help streamline decision-making, align roadmaps with Kubernetes needs, and attract broader community involvement.
New contributors, maintainers, and reviewers
We've seen increasing engagement from contributors, which has resulted in the addition of three new maintainers:
Their continued contributions have been instrumental in driving the project forward.
We also welcome two new reviewers to the project:
We appreciate their dedication to code quality and their willingness to take on broader review responsibilities within the community.
New release team
We've formed a new release team led by ivanvc and jmhbnz, streamlining the release process by automating many previously manual steps. Inspired by Kubernetes SIG Release, we've adopted several best practices, including clearly defined release team roles and the introduction of release shadows to support knowledge sharing and team sustainability. These changes have made our releases smoother and more reliable, allowing us to approach each release with greater confidence and consistency.
Introducing the etcd Operator Working Group
To further advance etcd's operational excellence, we have formed a new working group: WG-etcd-operator. The working group is dedicated to enabling the automatic and efficient operation of etcd clusters that run in the Kubernetes environment using an etcd-operator.
Future Development
The legacy v2store has been deprecated since etcd v3.4, and the flag --enable-v2
was removed entirely in v3.6. This means that starting from v3.6, there is no longer a way to enable or use the v2store. However, etcd still bootstraps internally from the legacy v2 snapshots. To address this inconsistency, We plan to change etcd to bootstrap from the v3store and replay the WAL entries based on the consistent-index
. The work is being tracked in issues/12913.
One of the most persistent challenges remains the large range of queries from the kube-apiserver, which can lead to process crashes due to their unpredictable nature. The range stream feature, originally outlined in the v3.5 release blog/Future roadmaps, remains an idea worth revisiting to address the challenges of large range queries.
For more details and upcoming plans, please refer to the etcd roadmap.
16 May 2025 12:00am GMT
15 May 2025
Kubernetes Blog
Kubernetes 1.33: Job's SuccessPolicy Goes GA
On behalf of the Kubernetes project, I'm pleased to announce that Job success policy has graduated to General Availability (GA) as part of the v1.33 release.
About Job's Success Policy
In batch workloads, you might want to use leader-follower patterns like MPI, in which the leader controls the execution, including the followers' lifecycle.
In this case, you might want to mark it as succeeded even if some of the indexes failed. Unfortunately, a leader-follower Kubernetes Job that didn't use a success policy, in most cases, would have to require all Pods to finish successfully for that Job to reach an overall succeeded state.
For Kubernetes Jobs, the API allows you to specify the early exit criteria using the .spec.successPolicy
field (you can only use the .spec.successPolicy
field for an indexed Job). Which describes a set of rules either using a list of succeeded indexes for a job, or defining a minimal required size of succeeded indexes.
This newly stable field is especially valuable for scientific simulation, AI/ML and High-Performance Computing (HPC) batch workloads. Users in these areas often run numerous experiments and may only need a specific number to complete successfully, rather than requiring all of them to succeed. In this case, the leader index failure is the only relevant Job exit criteria, and the outcomes for individual follower Pods are handled only indirectly via the status of the leader index. Moreover, followers do not know when they can terminate themselves.
After Job meets any Success Policy, the Job is marked as succeeded, and all Pods are terminated including the running ones.
How it works
The following excerpt from a Job manifest, using .successPolicy.rules[0].succeededCount
, shows an example of using a custom success policy:
parallelism: 10
completions: 10
completionMode: Indexed
successPolicy:
rules:
- succeededCount: 1
Here, the Job is marked as succeeded when one index succeeded regardless of its number. Additionally, you can constrain index numbers against succeededCount
in .successPolicy.rules[0].succeededCount
as shown below:
parallelism: 10
completions: 10
completionMode: Indexed
successPolicy:
rules:
- succeededIndexes: 0 # index of the leader Pod
succeededCount: 1
This example shows that the Job will be marked as succeeded once a Pod with a specific index (Pod index 0) has succeeded.
Once the Job either reaches one of the successPolicy
rules, or achieves its Complete
criteria based on .spec.completions
, the Job controller within kube-controller-manager adds the SuccessCriteriaMet
condition to the Job status. After that, the job-controller initiates cleanup and termination of Pods for Jobs with SuccessCriteriaMet
condition. Eventually, Jobs obtain Complete
condition when the job-controller finished cleanup and termination.
Learn more
- Read the documentation for success policy.
- Read the KEP for the Job success/completion policy
Get involved
This work was led by the Kubernetes batch working group in close collaboration with the SIG Apps community.
If you are interested in working on new features in the space I recommend subscribing to our Slack channel and attending the regular community meetings.
15 May 2025 6:30pm GMT
14 May 2025
Kubernetes Blog
Kubernetes v1.33: Updates to Container Lifecycle
Kubernetes v1.33 introduces a few updates to the lifecycle of containers. The Sleep action for container lifecycle hooks now supports a zero sleep duration (feature enabled by default). There is also alpha support for customizing the stop signal sent to containers when they are being terminated.
This blog post goes into the details of these new aspects of the container lifecycle, and how you can use them.
Zero value for Sleep action
Kubernetes v1.29 introduced the Sleep
action for container PreStop and PostStart Lifecycle hooks. The Sleep action lets your containers pause for a specified duration after the container is started or before it is terminated. This was needed to provide a straightforward way to manage graceful shutdowns. Before the Sleep action, folks used to run the sleep
command using the exec action in their container lifecycle hooks. If you wanted to do this you'd need to have the binary for the sleep
command in your container image. This is difficult if you're using third party images.
The sleep action when it was added initially didn't have support for a sleep duration of zero seconds. The time.Sleep
which the Sleep action uses under the hood supports a duration of zero seconds. Using a negative or a zero value for the sleep returns immediately, resulting in a no-op. We wanted the same behaviour with the sleep action. This support for the zero duration was later added in v1.32, with the PodLifecycleSleepActionAllowZero
feature gate.
The PodLifecycleSleepActionAllowZero
feature gate has graduated to beta in v1.33, and is now enabled by default. The original Sleep action for preStop
and postStart
hooks is been enabled by default, starting from Kubernetes v1.30. With a cluster running Kubernetes v1.33, you are able to set a zero duration for sleep lifecycle hooks. For a cluster with default configuration, you don't need to enable any feature gate to make that possible.
Container stop signals
Container runtimes such as containerd and CRI-O honor a StopSignal
instruction in the container image definition. This can be used to specify a custom stop signal that the runtime will used to terminate containers based on that image. Stop signal configuration was not originally part of the Pod API in Kubernetes. Until Kubernetes v1.33, the only way to override the stop signal for containers was by rebuilding your container image with the new custom stop signal (for example, specifying STOPSIGNAL
in a Containerfile
or Dockerfile
).
The ContainerStopSignals
feature gate which is newly added in Kubernetes v1.33 adds stop signals to the Kubernetes API. This allows users to specify a custom stop signal in the container spec. Stop signals are added to the API as a new lifecycle along with the existing PreStop and PostStart lifecycle handlers. In order to use this feature, we expect the Pod to have the operating system specified with spec.os.name
. This is enforced so that we can cross-validate the stop signal against the operating system and make sure that the containers in the Pod are created with a valid stop signal for the operating system the Pod is being scheduled to. For Pods scheduled on Windows nodes, only SIGTERM
and SIGKILL
are allowed as valid stop signals. Find the full list of signals supported in Linux nodes here.
Default behaviour
If a container has a custom stop signal defined in its lifecycle, the container runtime would use the signal defined in the lifecycle to kill the container, given that the container runtime also supports custom stop signals. If there is no custom stop signal defined in the container lifecycle, the runtime would fallback to the stop signal defined in the container image. If there is no stop signal defined in the container image, the default stop signal of the runtime would be used. The default signal is SIGTERM
for both containerd and CRI-O.
Version skew
For the feature to work as intended, both the versions of Kubernetes and the container runtime should support container stop signals. The changes to the Kuberentes API and kubelet are available in alpha stage from v1.33, which can be enabled with the ContainerStopSignals
feature gate. The container runtime implementations for containerd and CRI-O are still a work in progress and will be rolled out soon.
Using container stop signals
To enable this feature, you need to turn on the ContainerStopSignals
feature gate in both the kube-apiserver and the kubelet. Once you have nodes where the feature gate is turned on, you can create Pods with a StopSignal lifecycle and a valid OS name like so:
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
os:
name: linux
containers:
- name: nginx
image: nginx:latest
lifecycle:
stopSignal: SIGUSR1
Do note that the SIGUSR1
signal in this example can only be used if the container's Pod is scheduled to a Linux node. Hence we need to specify spec.os.name
as linux
to be able to use the signal. You will only be able to configure SIGTERM
and SIGKILL
signals if the Pod is being scheduled to a Windows node. You cannot specify a containers[*].lifecycle.stopSignal
if the spec.os.name
field is nil or unset either.
How do I get involved?
This feature is driven by the SIG Node. If you are interested in helping develop this feature, sharing feedback, or participating in any other ongoing SIG Node projects, please reach out to us!
You can reach SIG Node by several means:
You can also contact me directly:
- GitHub: @sreeram-venkitesh
- Slack: @sreeram.venkitesh
14 May 2025 6:30pm GMT
13 May 2025
Kubernetes Blog
Kubernetes v1.33: Job's Backoff Limit Per Index Goes GA
In Kubernetes v1.33, the Backoff Limit Per Index feature reaches general availability (GA). This blog describes the Backoff Limit Per Index feature and its benefits.
About backoff limit per index
When you run workloads on Kubernetes, you must consider scenarios where Pod failures can affect the completion of your workloads. Ideally, your workload should tolerate transient failures and continue running.
To achieve failure tolerance in a Kubernetes Job, you can set the spec.backoffLimit
field. This field specifies the total number of tolerated failures.
However, for workloads where every index is considered independent, like embarassingly parallel workloads - the spec.backoffLimit
field is often not flexible enough. For example, you may choose to run multiple suites of integration tests by representing each suite as an index within an Indexed Job. In that setup, a fast-failing index (test suite) is likely to consume your entire budget for tolerating Pod failures, and you might not be able to run the other indexes.
In order to address this limitation, Kubernetes introduced backoff limit per index, which allows you to control the number of retries per index.
How backoff limit per index works
To use Backoff Limit Per Index for Indexed Jobs, specify the number of tolerated Pod failures per index with the spec.backoffLimitPerIndex
field. When you set this field, the Job executes all indexes by default.
Additionally, to fine-tune the error handling:
- Specify the cap on the total number of failed indexes by setting the
spec.maxFailedIndexes
field. When the limit is exceeded the entire Job is terminated. - Define a short-circuit to detect a failed index by using the
FailIndex
action in the Pod Failure Policy mechanism.
When the number of tolerated failures is exceeded, the Job marks that index as failed and lists it in the Job's status.failedIndexes
field.
Example
The following Job spec snippet is an example of how to combine backoff limit per index with the Pod Failure Policy feature:
completions: 10
parallelism: 10
completionMode: Indexed
backoffLimitPerIndex: 1
maxFailedIndexes: 5
podFailurePolicy:
rules:
- action: Ignore
onPodConditions:
- type: DisruptionTarget
- action: FailIndex
onExitCodes:
operator: In
values: [ 42 ]
In this example, the Job handles Pod failures as follows:
- Ignores any failed Pods that have the built-in disruption condition, called
DisruptionTarget
. These Pods don't count towards Job backoff limits. - Fails the index corresponding to the failed Pod if any of the failed Pod's containers finished with the exit code 42 - based on the matching "FailIndex" rule.
- Retries the first failure of any index, unless the index failed due to the matching
FailIndex
rule. - Fails the entire Job if the number of failed indexes exceeded 5 (set by the
spec.maxFailedIndexes
field).
Learn more
- Read the blog post on the closely related feature of Pod Failure Policy Kubernetes 1.31: Pod Failure Policy for Jobs Goes GA
- For a hands-on guide to using Pod failure policy, including the use of FailIndex, see Handling retriable and non-retriable pod failures with Pod failure policy
- Read the documentation for Backoff limit per index and Pod failure policy
- Read the KEP for the Backoff Limits Per Index For Indexed Jobs
Get involved
This work was sponsored by the Kubernetes batch working group in close collaboration with the SIG Apps community.
If you are interested in working on new features in the space we recommend subscribing to our Slack channel and attending the regular community meetings.
13 May 2025 6:30pm GMT
12 May 2025
Kubernetes Blog
Kubernetes v1.33: Image Pull Policy the way you always thought it worked!
Image Pull Policy the way you always thought it worked!
Some things in Kubernetes are surprising, and the way imagePullPolicy
behaves might be one of them. Given Kubernetes is all about running pods, it may be peculiar to learn that there has been a caveat to restricting pod access to authenticated images for over 10 years in the form of issue 18787! It is an exciting release when you can resolve a ten-year-old issue.
Note:
Throughout this blog post, the term "pod credentials" will be used often. In this context, the term generally encapsulates the authentication material that is available to a pod to authenticate a container image pull.IfNotPresent, even if I'm not supposed to have it
The gist of the problem is that the imagePullPolicy: IfNotPresent
strategy has done precisely what it says, and nothing more. Let's set up a scenario. To begin, Pod A in Namespace X is scheduled to Node 1 and requires image Foo from a private repository. For it's image pull authentication material, the pod references Secret 1 in its imagePullSecrets
. Secret 1 contains the necessary credentials to pull from the private repository. The Kubelet will utilize the credentials from Secret 1 as supplied by Pod A and it will pull container image Foo from the registry. This is the intended (and secure) behavior.
But now things get curious. If Pod B in Namespace Y happens to also be scheduled to Node 1, unexpected (and potentially insecure) things happen. Pod B may reference the same private image, specifying the IfNotPresent
image pull policy. Pod B does not reference Secret 1 (or in our case, any secret) in its imagePullSecrets
. When the Kubelet tries to run the pod, it honors the IfNotPresent
policy. The Kubelet sees that the image Foo is already present locally, and will provide image Foo to Pod B. Pod B gets to run the image even though it did not provide credentials authorizing it to pull the image in the first place.
Using a private image pulled by a different pod
While IfNotPresent
should not pull image Foo if it is already present on the node, it is an incorrect security posture to allow all pods scheduled to a node to have access to previously pulled private image. These pods were never authorized to pull the image in the first place.
IfNotPresent, but only if I am supposed to have it
In Kubernetes v1.33, we - SIG Auth and SIG Node - have finally started to address this (really old) problem and getting the verification right! The basic expected behavior is not changed. If an image is not present, the Kubelet will attempt to pull the image. The credentials each pod supplies will be utilized for this task. This matches behavior prior to 1.33.
If the image is present, then the behavior of the Kubelet changes. The Kubelet will now verify the pod's credentials before allowing the pod to use the image.
Performance and service stability have been a consideration while revising the feature. Pods utilizing the same credential will not be required to re-authenticate. This is also true when pods source credentials from the same Kubernetes Secret object, even when the credentials are rotated.
Never pull, but use if authorized
The imagePullPolicy: Never
option does not fetch images. However, if the container image is already present on the node, any pod attempting to use the private image will be required to provide credentials, and those credentials require verification.
Pods utilizing the same credential will not be required to re-authenticate. Pods that do not supply credentials previously used to successfully pull an image will not be allowed to use the private image.
Always pull, if authorized
The imagePullPolicy: Always
has always worked as intended. Each time an image is requested, the request goes to the registry and the registry will perform an authentication check.
In the past, forcing the Always
image pull policy via pod admission was the only way to ensure that your private container images didn't get reused by other pods on nodes which already pulled the images.
Fortunately, this was somewhat performant. Only the image manifest was pulled, not the image. However, there was still a cost and a risk. During a new rollout, scale up, or pod restart, the image registry that provided the image MUST be available for the auth check, putting the image registry in the critical path for stability of services running inside of the cluster.
How it all works
The feature is based on persistent, file-based caches that are present on each of the nodes. The following is a simplified description of how the feature works. For the complete version, please see KEP-2535.
The process of requesting an image for the first time goes like this:
- A pod requesting an image from a private registry is scheduled to a node.
- The image is not present on the node.
- The Kubelet makes a record of the intention to pull the image.
- The Kubelet extracts credentials from the Kubernetes Secret referenced by the pod as an image pull secret, and uses them to pull the image from the private registry.
- After the image has been successfully pulled, the Kubelet makes a record of the successful pull. This record includes details about credentials used (in the form of a hash) as well as the Secret from which they originated.
- The Kubelet removes the original record of intent.
- The Kubelet retains the record of successful pull for later use.
When future pods scheduled to the same node request the previously pulled private image:
- The Kubelet checks the credentials that the new pod provides for the pull.
- If the hash of these credentials, or the source Secret of the credentials match the hash or source Secret which were recorded for a previous successful pull, the pod is allowed to use the previously pulled image.
- If the credentials or their source Secret are not found in the records of successful pulls for that image, the Kubelet will attempt to use these new credentials to request a pull from the remote registry, triggering the authorization flow.
Try it out
In Kubernetes v1.33 we shipped the alpha version of this feature. To give it a spin, enable the KubeletEnsureSecretPulledImages
feature gate for your 1.33 Kubelets.
You can learn more about the feature and additional optional configuration on the concept page for Images in the official Kubernetes documentation.
What's next?
In future releases we are going to:
- Make this feature work together with Projected service account tokens for Kubelet image credential providers which adds a new, workload-specific source of image pull credentials.
- Write a benchmarking suite to measure the performance of this feature and assess the impact of any future changes.
- Implement an in-memory caching layer so that we don't need to read files for each image pull request.
- Add support for credential expirations, thus forcing previously validated credentials to be re-authenticated.
How to get involved
Reading KEP-2535 is a great way to understand these changes in depth.
If you are interested in further involvement, reach out to us on the #sig-auth-authenticators-dev channel on Kubernetes Slack (for an invitation, visit https://slack.k8s.io/). You are also welcome to join the bi-weekly SIG Auth meetings, held every other Wednesday.
12 May 2025 6:30pm GMT
09 May 2025
Kubernetes Blog
Kubernetes v1.33: Streaming List responses
Managing Kubernetes cluster stability becomes increasingly critical as your infrastructure grows. One of the most challenging aspects of operating large-scale clusters has been handling List requests that fetch substantial datasets - a common operation that could unexpectedly impact your cluster's stability.
Today, the Kubernetes community is excited to announce a significant architectural improvement: streaming encoding for List responses.
The problem: unnecessary memory consumption with large resources
Current API response encoders just serialize an entire response into a single contiguous memory and perform one ResponseWriter.Write call to transmit data to the client. Despite HTTP/2's capability to split responses into smaller frames for transmission, the underlying HTTP server continues to hold the complete response data as a single buffer. Even as individual frames are transmitted to the client, the memory associated with these frames cannot be freed incrementally.
When cluster size grows, the single response body can be substantial - like hundreds of megabytes in size. At large scale, the current approach becomes particularly inefficient, as it prevents incremental memory release during transmission. Imagining that when network congestion occurs, that large response body's memory block stays active for tens of seconds or even minutes. This limitation leads to unnecessarily high and prolonged memory consumption in the kube-apiserver process. If multiple large List requests occur simultaneously, the cumulative memory consumption can escalate rapidly, potentially leading to an Out-of-Memory (OOM) situation that compromises cluster stability.
The encoding/json package uses sync.Pool to reuse memory buffers during serialization. While efficient for consistent workloads, this mechanism creates challenges with sporadic large List responses. When processing these large responses, memory pools expand significantly. But due to sync.Pool's design, these oversized buffers remain reserved after use. Subsequent small List requests continue utilizing these large memory allocations, preventing garbage collection and maintaining persistently high memory consumption in the kube-apiserver even after the initial large responses complete.
Additionally, Protocol Buffers are not designed to handle large datasets. But it's great for handling individual messages within a large data set. This highlights the need for streaming-based approaches that can process and transmit large collections incrementally rather than as monolithic blocks.
As a general rule of thumb, if you are dealing in messages larger than a megabyte each, it may be time to consider an alternate strategy.
Streaming encoder for List responses
The streaming encoding mechanism is specifically designed for List responses, leveraging their common well-defined collection structures. The core idea focuses exclusively on the Items field within collection structures, which represents the bulk of memory consumption in large responses. Rather than encoding the entire Items array as one contiguous memory block, the new streaming encoder processes and transmits each item individually, allowing memory to be freed progressively as frame or chunk is transmitted. As a result, encoding items one by one significantly reduces the memory footprint required by the API server.
With Kubernetes objects typically limited to 1.5 MiB (from ETCD), streaming encoding keeps memory consumption predictable and manageable regardless of how many objects are in a List response. The result is significantly improved API server stability, reduced memory spikes, and better overall cluster performance - especially in environments where multiple large List operations might occur simultaneously.
To ensure perfect backward compatibility, the streaming encoder validates Go struct tags rigorously before activation, guaranteeing byte-for-byte consistency with the original encoder. Standard encoding mechanisms process all fields except Items, maintaining identical output formatting throughout. This approach seamlessly supports all Kubernetes List types-from built-in *List objects to Custom Resource UnstructuredList objects - requiring zero client-side modifications or awareness that the underlying encoding method has changed.
Performance gains you'll notice
- Reduced Memory Consumption: Significantly lowers the memory footprint of the API server when handling large list requests, especially when dealing with large resources.
- Improved Scalability: Enables the API server to handle more concurrent requests and larger datasets without running out of memory.
- Increased Stability: Reduces the risk of OOM kills and service disruptions.
- Efficient Resource Utilization: Optimizes memory usage and improves overall resource efficiency.
Benchmark results
To validate results Kubernetes has introduced a new list benchmark which executes concurrently 10 list requests each returning 1GB of data.
The benchmark has showed 20x improvement, reducing memory usage from 70-80GB to 3GB.

List benchmark memory usage
09 May 2025 6:30pm GMT
08 May 2025
Kubernetes Blog
Kubernetes 1.33: Volume Populators Graduate to GA
Kubernetes volume populators are now generally available (GA)! The AnyVolumeDataSource
feature gate is treated as always enabled for Kubernetes v1.33, which means that users can specify any appropriate custom resource as the data source of a PersistentVolumeClaim (PVC).
An example of how to use dataSourceRef in PVC:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc1
spec:
...
dataSourceRef:
apiGroup: provider.example.com
kind: Provider
name: provider1
What is new
There are four major enhancements from beta.
Populator Pod is optional
During the beta phase, contributors to Kubernetes identified potential resource leaks with PersistentVolumeClaim (PVC) deletion while volume population was in progress; these leaks happened due to limitations in finalizer handling. Ahead of the graduation to general availability, the Kubernetes project added support to delete temporary resources (PVC prime, etc.) if the original PVC is deleted.
To accommodate this, we've introduced three new plugin-based functions:
PopulateFn()
: Executes the provider-specific data population logic.PopulateCompleteFn()
: Checks if the data population operation has finished successfully.PopulateCleanupFn()
: Cleans up temporary resources created by the provider-specific functions after data population is completed
A provider example is added in lib-volume-populator/example.
Mutator functions to modify the Kubernetes resources
For GA, the CSI volume populator controller code gained a MutatorConfig
, allowing the specification of mutator functions to modify Kubernetes resources. For example, if the PVC prime is not an exact copy of the PVC and you need provider-specific information for the driver, you can include this information in the optional MutatorConfig
. This allows you to customize the Kubernetes objects in the volume populator.
Flexible metric handling for providers
Our beta phase highlighted a new requirement: the need to aggregate metrics not just from lib-volume-populator, but also from other components within the provider's codebase.
To address this, SIG Storage introduced a provider metric manager. This enhancement delegates the implementation of metrics logic to the provider itself, rather than relying solely on lib-volume-populator. This shift provides greater flexibility and control over metrics collection and aggregation, enabling a more comprehensive view of provider performance.
Clean up for temporary resources
During the beta phase, we identified potential resource leaks with PersistentVolumeClaim (PVC) deletion while volume population was in progress, due to limitations in finalizer handling. We have improved the populator to support the deletion of temporary resources (PVC prime, etc.) if the original PVC is deleted in this GA release.
How to use it
To try it out, please follow the steps in the previous beta blog.
Future directions and potential feature requests
For next step, there are several potential feature requests for volume populator:
- Multi sync: the current implementation is a one-time unidirectional sync from source to destination. This can be extended to support multiple syncs, enabling periodic syncs or allowing users to sync on demand
- Bidirectional sync: an extension of multi sync above, but making it bidirectional between source and destination
- Populate data with priorities: with a list of different dataSourceRef, populate based on priorities
- Populate data from multiple sources of the same provider: populate multiple different sources to one destination
- Populate data from multiple sources of the different providers: populate multiple different sources to one destination, pipelining different resources' population
To ensure we're building something truly valuable, Kubernetes SIG Storage would love to hear about any specific use cases you have in mind for this feature. For any inquiries or specific questions related to volume populator, please reach out to the SIG Storage community.
08 May 2025 6:30pm GMT
07 May 2025
Kubernetes Blog
Kubernetes v1.33: From Secrets to Service Accounts: Kubernetes Image Pulls Evolved
Kubernetes has steadily evolved to reduce reliance on long-lived credentials stored in the API. A prime example of this shift is the transition of Kubernetes Service Account (KSA) tokens from long-lived, static tokens to ephemeral, automatically rotated tokens with OpenID Connect (OIDC)-compliant semantics. This advancement enables workloads to securely authenticate with external services without needing persistent secrets.
However, one major gap remains: image pull authentication. Today, Kubernetes clusters rely on image pull secrets stored in the API, which are long-lived and difficult to rotate, or on node-level kubelet credential providers, which allow any pod running on a node to access the same credentials. This presents security and operational challenges.
To address this, Kubernetes is introducing Service Account Token Integration for Kubelet Credential Providers, now available in alpha. This enhancement allows credential providers to use pod-specific service account tokens to obtain registry credentials, which kubelet can then use for image pulls - eliminating the need for long-lived image pull secrets.
The problem with image pull secrets
Currently, Kubernetes administrators have two primary options for handling private container image pulls:
-
Image pull secrets stored in the Kubernetes API
- These secrets are often long-lived because they are hard to rotate.
- They must be explicitly attached to a service account or pod.
- Compromise of a pull secret can lead to unauthorized image access.
-
Kubelet credential providers
- These providers fetch credentials dynamically at the node level.
- Any pod running on the node can access the same credentials.
- There's no per-workload isolation, increasing security risks.
Neither approach aligns with the principles of least privilege or ephemeral authentication, leaving Kubernetes with a security gap.
The solution: Service Account token integration for Kubelet credential providers
This new enhancement enables kubelet credential providers to use workload identity when fetching image registry credentials. Instead of relying on long-lived secrets, credential providers can use service account tokens to request short-lived credentials tied to a specific pod's identity.
This approach provides:
- Workload-specific authentication: Image pull credentials are scoped to a particular workload.
- Ephemeral credentials: Tokens are automatically rotated, eliminating the risks of long-lived secrets.
- Seamless integration: Works with existing Kubernetes authentication mechanisms, aligning with cloud-native security best practices.
How it works
1. Service Account tokens for credential providers
Kubelet generates short-lived, automatically rotated tokens for service accounts if the credential provider it communicates with has opted into receiving a service account token for image pulls. These tokens conform to OIDC ID token semantics and are provided to the credential provider as part of the CredentialProviderRequest
. The credential provider can then use this token to authenticate with an external service.
2. Image registry authentication flow
- When a pod starts, the kubelet requests credentials from a credential provider.
- If the credential provider has opted in, the kubelet generates a service account token for the pod.
- The service account token is included in the
CredentialProviderRequest
, allowing the credential provider to authenticate and exchange it for temporary image pull credentials from a registry (e.g. AWS ECR, GCP Artifact Registry, Azure ACR). - The kubelet then uses these credentials to pull images on behalf of the pod.
Benefits of this approach
- Security: Eliminates long-lived image pull secrets, reducing attack surfaces.
- Granular Access Control: Credentials are tied to individual workloads rather than entire nodes or clusters.
- Operational Simplicity: No need for administrators to manage and rotate image pull secrets manually.
- Improved Compliance: Helps organizations meet security policies that prohibit persistent credentials in the cluster.
What's next?
For Kubernetes v1.34, we expect to ship this feature in beta while continuing to gather feedback from users.
In the coming releases, we will focus on:
- Implementing caching mechanisms to improve performance for token generation.
- Giving more flexibility to credential providers to decide how the registry credentials returned to the kubelet are cached.
- Making the feature work with Ensure Secret Pulled Images to ensure pods that use an image are authorized to access that image when service account tokens are used for authentication.
You can learn more about this feature on the service account token for image pulls page in the Kubernetes documentation.
You can also follow along on the KEP-4412 to track progress across the coming Kubernetes releases.
Try it out
To try out this feature:
- Ensure you are running Kubernetes v1.33 or later.
- Enable the
ServiceAccountTokenForKubeletCredentialProviders
feature gate on the kubelet. - Ensure credential provider support: Modify or update your credential provider to use service account tokens for authentication.
- Update the credential provider configuration to opt into receiving service account tokens for the credential provider by configuring the
tokenAttributes
field. - Deploy a pod that uses the credential provider to pull images from a private registry.
We would love to hear your feedback on this feature. Please reach out to us on the #sig-auth-authenticators-dev channel on Kubernetes Slack (for an invitation, visit https://slack.k8s.io/).
How to get involved
If you are interested in getting involved in the development of this feature, sharing feedback, or participating in any other ongoing SIG Auth projects, please reach out on the #sig-auth channel on Kubernetes Slack.
You are also welcome to join the bi-weekly SIG Auth meetings, held every other Wednesday.
07 May 2025 6:30pm GMT
06 May 2025
Kubernetes Blog
Kubernetes v1.33: Fine-grained SupplementalGroups Control Graduates to Beta
The new field, supplementalGroupsPolicy
, was introduced as an opt-in alpha feature for Kubernetes v1.31 and has graduated to beta in v1.33; the corresponding feature gate (SupplementalGroupsPolicy
) is now enabled by default. This feature enables to implement more precise control over supplemental groups in containers that can strengthen the security posture, particularly in accessing volumes. Moreover, it also enhances the transparency of UID/GID details in containers, offering improved security oversight.
Please be aware that this beta release contains some behavioral breaking change. See The Behavioral Changes Introduced In Beta and Upgrade Considerations sections for details.
Motivation: Implicit group memberships defined in /etc/group
in the container image
Although the majority of Kubernetes cluster admins/users may not be aware, kubernetes, by default, merges group information from the Pod with information defined in /etc/group
in the container image.
Let's see an example, below Pod manifest specifies runAsUser=1000
, runAsGroup=3000
and supplementalGroups=4000
in the Pod's security context.
apiVersion: v1
kind: Pod
metadata:
name: implicit-groups
spec:
securityContext:
runAsUser: 1000
runAsGroup: 3000
supplementalGroups: [4000]
containers:
- name: ctr
image: registry.k8s.io/e2e-test-images/agnhost:2.45
command: [ "sh", "-c", "sleep 1h" ]
securityContext:
allowPrivilegeEscalation: false
What is the result of id
command in the ctr
container? The output should be similar to this:
uid=1000 gid=3000 groups=3000,4000,50000
Where does group ID 50000
in supplementary groups (groups
field) come from, even though 50000
is not defined in the Pod's manifest at all? The answer is /etc/group
file in the container image.
Checking the contents of /etc/group
in the container image should show below:
user-defined-in-image:x:1000:
group-defined-in-image:x:50000:user-defined-in-image
This shows that the container's primary user 1000
belongs to the group 50000
in the last entry.
Thus, the group membership defined in /etc/group
in the container image for the container's primary user is implicitly merged to the information from the Pod. Please note that this was a design decision the current CRI implementations inherited from Docker, and the community never really reconsidered it until now.
What's wrong with it?
The implicitly merged group information from /etc/group
in the container image poses a security risk. These implicit GIDs can't be detected or validated by policy engines because there's no record of them in the Pod manifest. This can lead to unexpected access control issues, particularly when accessing volumes (see kubernetes/kubernetes#112879 for details) because file permission is controlled by UID/GIDs in Linux.
Fine-grained supplemental groups control in a Pod: supplementaryGroupsPolicy
To tackle the above problem, Pod's .spec.securityContext
now includes supplementalGroupsPolicy
field.
This field lets you control how Kubernetes calculates the supplementary groups for container processes within a Pod. The available policies are:
-
Merge: The group membership defined in
/etc/group
for the container's primary user will be merged. If not specified, this policy will be applied (i.e. as-is behavior for backward compatibility). -
Strict: Only the group IDs specified in
fsGroup
,supplementalGroups
, orrunAsGroup
are attached as supplementary groups to the container processes. Group memberships defined in/etc/group
for the container's primary user are ignored.
Let's see how Strict
policy works. Below Pod manifest specifies supplementalGroupsPolicy: Strict
:
apiVersion: v1
kind: Pod
metadata:
name: strict-supplementalgroups-policy
spec:
securityContext:
runAsUser: 1000
runAsGroup: 3000
supplementalGroups: [4000]
supplementalGroupsPolicy: Strict
containers:
- name: ctr
image: registry.k8s.io/e2e-test-images/agnhost:2.45
command: [ "sh", "-c", "sleep 1h" ]
securityContext:
allowPrivilegeEscalation: false
The result of id
command in the ctr
container should be similar to this:
uid=1000 gid=3000 groups=3000,4000
You can see Strict
policy can exclude group 50000
from groups
!
Thus, ensuring supplementalGroupsPolicy: Strict
(enforced by some policy mechanism) helps prevent the implicit supplementary groups in a Pod.
Note:
A container with sufficient privileges can change its process identity. ThesupplementalGroupsPolicy
only affect the initial process identity. See the following section for details.Attached process identity in Pod status
This feature also exposes the process identity attached to the first container process of the container via .status.containerStatuses[].user.linux
field. It would be helpful to see if implicit group IDs are attached.
...
status:
containerStatuses:
- name: ctr
user:
linux:
gid: 3000
supplementalGroups:
- 3000
- 4000
uid: 1000
...
Note:
Please note that the values instatus.containerStatuses[].user.linux
field is the firstly attached process identity to the first container process in the container. If the container has sufficient privilege to call system calls related to process identity (e.g. setuid(2)
, setgid(2)
or setgroups(2)
, etc.), the container process can change its identity. Thus, the actual process identity will be dynamic.Strict
Policy requires newer CRI versions
Actually, CRI runtime (e.g. containerd, CRI-O) plays a core role for calculating supplementary group ids to be attached to the containers. Thus, SupplementalGroupsPolicy=Strict
requires a CRI runtime that support this feature (SupplementalGroupsPolicy: Merge
can work with the CRI runtime which does not support this feature because this policy is fully backward compatible policy).
Here are some CRI runtimes that support this feature, and the versions you need to be running:
- containerd: v2.0 or later
- CRI-O: v1.31 or later
And, you can see if the feature is supported in the Node's .status.features.supplementalGroupsPolicy
field.
apiVersion: v1
kind: Node
...
status:
features:
supplementalGroupsPolicy: true
The behavioral changes introduced in beta
In the alpha release, when a Pod with supplementalGroupsPolicy: Strict
was scheduled to a node that did not support the feature (i.e., .status.features.supplementalGroupsPolicy=false
), the Pod's supplemental groups policy silently fell back to Merge
.
In v1.33, this has entered beta to enforce the policy more strictly, where kubelet rejects pods whose nodes cannot ensure the specified policy. If your pod is rejected, you will see warning events with reason=SupplementalGroupsPolicyNotSupported
like below:
apiVersion: v1
kind: Event
...
type: Warning
reason: SupplementalGroupsPolicyNotSupported
message: "SupplementalGroupsPolicy=Strict is not supported in this node"
involvedObject:
apiVersion: v1
kind: Pod
...
Upgrade consideration
If you're already using this feature, especially the supplementalGroupsPolicy: Strict
policy, we assume that your cluster's CRI runtimes already support this feature. In that case, you don't need to worry about the pod rejections described above.
However, if your cluster:
- uses the
supplementalGroupsPolicy: Strict
policy, but - its CRI runtimes do NOT yet support the feature (i.e.,
.status.features.supplementalGroupsPolicy=false
),
you need to prepare the behavioral changes (pod rejection) when upgrading your cluster.
We recommend several ways to avoid unexpected pod rejections:
- Upgrading your cluster's CRI runtimes together with kubernetes or before the upgrade
- Putting some label to your nodes describing CRI runtime supports this feature or not and also putting label selector to pods with
Strict
policy to select such nodes (but, you will need to monitor the number ofPending
pods in this case instead of pod rejections).
Getting involved
This feature is driven by the SIG Node community. Please join us to connect with the community and share your ideas and feedback around the above feature and beyond. We look forward to hearing from you!
How can I learn more?
- Configure a Security Context for a Pod or Container for the further details of
supplementalGroupsPolicy
- KEP-3619: Fine-grained SupplementalGroups control
06 May 2025 6:30pm GMT
05 May 2025
Kubernetes Blog
Kubernetes v1.33: Prevent PersistentVolume Leaks When Deleting out of Order graduates to GA
I am thrilled to announce that the feature to prevent PersistentVolume (or PVs for short) leaks when deleting out of order has graduated to General Availability (GA) in Kubernetes v1.33! This improvement, initially introduced as a beta feature in Kubernetes v1.31, ensures that your storage resources are properly reclaimed, preventing unwanted leaks.
How did reclaim work in previous Kubernetes releases?
PersistentVolumeClaim (or PVC for short) is a user's request for storage. A PV and PVC are considered Bound if a newly created PV or a matching PV is found. The PVs themselves are backed by volumes allocated by the storage backend.
Normally, if the volume is to be deleted, then the expectation is to delete the PVC for a bound PV-PVC pair. However, there are no restrictions on deleting a PV before deleting a PVC.
For a Bound
PV-PVC pair, the ordering of PV-PVC deletion determines whether the PV reclaim policy is honored. The reclaim policy is honored if the PVC is deleted first; however, if the PV is deleted prior to deleting the PVC, then the reclaim policy is not exercised. As a result of this behavior, the associated storage asset in the external infrastructure is not removed.
PV reclaim policy with Kubernetes v1.33
With the graduation to GA in Kubernetes v1.33, this issue is now resolved. Kubernetes now reliably honors the configured Delete
reclaim policy, even when PVs are deleted before their bound PVCs. This is achieved through the use of finalizers, ensuring that the storage backend releases the allocated storage resource as intended.
How does it work?
For CSI volumes, the new behavior is achieved by adding a finalizer external-provisioner.volume.kubernetes.io/finalizer
on new and existing PVs. The finalizer is only removed after the storage from the backend is deleted. Addition or removal of finalizer is handled by external-provisioner
`
An example of a PV with the finalizer, notice the new finalizer in the finalizers list
kubectl get pv pvc-a7b7e3ba-f837-45ba-b243-dec7d8aaed53 -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
pv.kubernetes.io/provisioned-by: csi.example.driver.com
creationTimestamp: "2021-11-17T19:28:56Z"
finalizers:
- kubernetes.io/pv-protection
- external-provisioner.volume.kubernetes.io/finalizer
name: pvc-a7b7e3ba-f837-45ba-b243-dec7d8aaed53
resourceVersion: "194711"
uid: 087f14f2-4157-4e95-8a70-8294b039d30e
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 1Gi
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: example-vanilla-block-pvc
namespace: default
resourceVersion: "194677"
uid: a7b7e3ba-f837-45ba-b243-dec7d8aaed53
csi:
driver: csi.example.driver.com
fsType: ext4
volumeAttributes:
storage.kubernetes.io/csiProvisionerIdentity: 1637110610497-8081-csi.example.driver.com
type: CNS Block Volume
volumeHandle: 2dacf297-803f-4ccc-afc7-3d3c3f02051e
persistentVolumeReclaimPolicy: Delete
storageClassName: example-vanilla-block-sc
volumeMode: Filesystem
status:
phase: Bound
The finalizer prevents this PersistentVolume from being removed from the cluster. As stated previously, the finalizer is only removed from the PV object after it is successfully deleted from the storage backend. To learn more about finalizers, please refer to Using Finalizers to Control Deletion.
Similarly, the finalizer kubernetes.io/pv-controller
is added to dynamically provisioned in-tree plugin volumes.
Important note
The fix does not apply to statically provisioned in-tree plugin volumes.
How to enable new behavior?
To take advantage of the new behavior, you must have upgraded your cluster to the v1.33 release of Kubernetes and run the CSI external-provisioner
version 5.0.1
or later. The feature was released as beta in v1.31 release of Kubernetes, where it was enabled by default.
References
How do I get involved?
The Kubernetes Slack channel SIG Storage communication channels are great mediums to reach out to the SIG Storage and migration working group teams.
Special thanks to the following people for the insightful reviews, thorough consideration and valuable contribution:
- Fan Baofa (carlory)
- Jan Šafránek (jsafrane)
- Xing Yang (xing-yang)
- Matthew Wong (wongma7)
Join the Kubernetes Storage Special Interest Group (SIG) if you're interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system. We're rapidly growing and always welcome new contributors.
05 May 2025 6:30pm GMT
02 May 2025
Kubernetes Blog
Kubernetes v1.33: Mutable CSI Node Allocatable Count
Scheduling stateful applications reliably depends heavily on accurate information about resource availability on nodes. Kubernetes v1.33 introduces an alpha feature called mutable CSI node allocatable count, allowing Container Storage Interface (CSI) drivers to dynamically update the reported maximum number of volumes that a node can handle. This capability significantly enhances the accuracy of pod scheduling decisions and reduces scheduling failures caused by outdated volume capacity information.
Background
Traditionally, Kubernetes CSI drivers report a static maximum volume attachment limit when initializing. However, actual attachment capacities can change during a node's lifecycle for various reasons, such as:
- Manual or external operations attaching/detaching volumes outside of Kubernetes control.
- Dynamically attached network interfaces or specialized hardware (GPUs, NICs, etc.) consuming available slots.
- Multi-driver scenarios, where one CSI driver's operations affect available capacity reported by another.
Static reporting can cause Kubernetes to schedule pods onto nodes that appear to have capacity but don't, leading to pods stuck in a ContainerCreating
state.
Dynamically adapting CSI volume limits
With the new feature gate MutableCSINodeAllocatableCount
, Kubernetes enables CSI drivers to dynamically adjust and report node attachment capacities at runtime. This ensures that the scheduler has the most accurate, up-to-date view of node capacity.
How it works
When this feature is enabled, Kubernetes supports two mechanisms for updating the reported node volume limits:
- Periodic Updates: CSI drivers specify an interval to periodically refresh the node's allocatable capacity.
- Reactive Updates: An immediate update triggered when a volume attachment fails due to exhausted resources (
ResourceExhausted
error).
Enabling the feature
To use this alpha feature, you must enable the MutableCSINodeAllocatableCount
feature gate in these components:
kube-apiserver
kubelet
Example CSI driver configuration
Below is an example of configuring a CSI driver to enable periodic updates every 60 seconds:
apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
name: example.csi.k8s.io
spec:
nodeAllocatableUpdatePeriodSeconds: 60
This configuration directs Kubelet to periodically call the CSI driver's NodeGetInfo
method every 60 seconds, updating the node's allocatable volume count. Kubernetes enforces a minimum update interval of 10 seconds to balance accuracy and resource usage.
Immediate updates on attachment failures
In addition to periodic updates, Kubernetes now reacts to attachment failures. Specifically, if a volume attachment fails with a ResourceExhausted
error (gRPC code 8
), an immediate update is triggered to correct the allocatable count promptly.
This proactive correction prevents repeated scheduling errors and helps maintain cluster health.
Getting started
To experiment with mutable CSI node allocatable count in your Kubernetes v1.33 cluster:
- Enable the feature gate
MutableCSINodeAllocatableCount
on thekube-apiserver
andkubelet
components. - Update your CSI driver configuration by setting
nodeAllocatableUpdatePeriodSeconds
. - Monitor and observe improvements in scheduling accuracy and pod placement reliability.
Next steps
This feature is currently in alpha and the Kubernetes community welcomes your feedback. Test it, share your experiences, and help guide its evolution toward beta and GA stability.
Join discussions in the Kubernetes Storage Special Interest Group (SIG-Storage) to shape the future of Kubernetes storage capabilities.
02 May 2025 6:30pm GMT
01 May 2025
Kubernetes Blog
Kubernetes v1.33: New features in DRA
Kubernetes Dynamic Resource Allocation (DRA) was originally introduced as an alpha feature in the v1.26 release, and then went through a significant redesign for Kubernetes v1.31. The main DRA feature went to beta in v1.32, and the project hopes it will be generally available in Kubernetes v1.34.
The basic feature set of DRA provides a far more powerful and flexible API for requesting devices than Device Plugin. And while DRA remains a beta feature for v1.33, the DRA team has been hard at work implementing a number of new features and UX improvements. One feature has been promoted to beta, while a number of new features have been added in alpha. The team has also made progress towards getting DRA ready for GA.
Features promoted to beta
Driver-owned Resource Claim Status was promoted to beta. This allows the driver to report driver-specific device status data for each allocated device in a resource claim, which is particularly useful for supporting network devices.
New alpha features
Partitionable Devices lets a driver advertise several overlapping logical devices ("partitions"), and the driver can reconfigure the physical device dynamically based on the actual devices allocated. This makes it possible to partition devices on-demand to meet the needs of the workloads and therefore increase the utilization.
Device Taints and Tolerations allow devices to be tainted and for workloads to tolerate those taints. This makes it possible for drivers or cluster administrators to mark devices as unavailable. Depending on the effect of the taint, this can prevent devices from being allocated or cause eviction of pods that are using the device.
Prioritized List lets users specify a list of acceptable devices for their workloads, rather than just a single type of device. So while the workload might run best on a single high-performance GPU, it might also be able to run on 2 mid-level GPUs. The scheduler will attempt to satisfy the alternatives in the list in order, so the workload will be allocated the best set of devices available in the cluster.
Admin Access has been updated so that only users with access to a namespace with the resource.k8s.io/admin-access: "true"
label are authorized to create ResourceClaim or ResourceClaimTemplates objects with the adminAccess
field within the namespace. This grants administrators access to in-use devices and may enable additional permissions when making the device available in a container. This ensures that non-admin users cannot misuse the feature.
Preparing for general availability
A new v1beta2 API has been added to simplify the user experience and to prepare for additional features being added in the future. The RBAC rules for DRA have been improved and support has been added for seamless upgrades of DRA drivers.
What's next?
The plan for v1.34 is even more ambitious than for v1.33. Most importantly, we (the Kubernetes device management working group) hope to bring DRA to general availability, which will make it available by default on all v1.34 Kubernetes clusters. This also means that many, perhaps all, of the DRA features that are still beta in v1.34 will become enabled by default, making it much easier to use them.
The alpha features that were added in v1.33 will be brought to beta in v1.34.
Getting involved
A good starting point is joining the WG Device Management Slack channel and meetings, which happen at US/EU and EU/APAC friendly time slots.
Not all enhancement ideas are tracked as issues yet, so come talk to us if you want to help or have some ideas yourself! We have work to do at all levels, from difficult core changes to usability enhancements in kubectl, which could be picked up by newcomers.
Acknowledgments
A huge thanks to everyone who has contributed:
- Cici Huang (cici37)
- Ed Bartosh (bart0sh
- John Belamaric (johnbelamaric)
- Jon Huhn (nojnhuh)
- Kevin Klues (klueska)
- Morten Torkildsen (mortent)
- Patrick Ohly (pohly)
- Rita Zhang (ritazh)
- Shingo Omura (everpeace)
01 May 2025 6:30pm GMT
30 Apr 2025
Kubernetes Blog
Kubernetes v1.33: Storage Capacity Scoring of Nodes for Dynamic Provisioning (alpha)
Kubernetes v1.33 introduces a new alpha feature called StorageCapacityScoring
. This feature adds a scoring method for pod scheduling with the topology-aware volume provisioning. This feature eases to schedule pods on nodes with either the most or least available storage capacity.
About this feature
This feature extends the kube-scheduler's VolumeBinding plugin to perform scoring using node storage capacity information obtained from Storage Capacity. Currently, you can only filter out nodes with insufficient storage capacity. So, you have to use a scheduler extender to achieve storage-capacity-based pod scheduling.
This feature is useful for provisioning node-local PVs, which have size limits based on the node's storage capacity. By using this feature, you can assign the PVs to the nodes with the most available storage space so that you can expand the PVs later as much as possible.
In another use case, you might want to reduce the number of nodes as much as possible for low operation costs in cloud environments by choosing the least storage capacity node. This feature helps maximize resource utilization by filling up nodes more sequentially, starting with the most utilized nodes first that still have enough storage capacity for the requested volume size.
How to use
Enabling the feature
In the alpha phase, StorageCapacityScoring
is disabled by default. To use this feature, add StorageCapacityScoring=true
to the kube-scheduler command line option --feature-gates
.
Configuration changes
You can configure node priorities based on storage utilization using the shape
parameter in the VolumeBinding plugin configuration. This allows you to prioritize nodes with higher available storage capacity (default) or, conversely, nodes with lower available storage capacity. For example, to prioritize lower available storage capacity, configure KubeSchedulerConfiguration
as follows:
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
...
pluginConfig:
- name: VolumeBinding
args:
...
shape:
- utilization: 0
score: 0
- utilization: 100
score: 10
For more details, please refer to the documentation.
Further reading
Additional note: Relationship with VolumeCapacityPriority
The alpha feature gate VolumeCapacityPriority
, which performs node scoring based on available storage capacity during static provisioning, will be deprecated and replaced by StorageCapacityScoring
.
Please note that while VolumeCapacityPriority
prioritizes nodes with lower available storage capacity by default, StorageCapacityScoring
prioritizes nodes with higher available storage capacity by default.
30 Apr 2025 6:30pm GMT
29 Apr 2025
Kubernetes Blog
Kubernetes v1.33: Image Volumes graduate to beta!
Image Volumes were introduced as an Alpha feature with the Kubernetes v1.31 release as part of KEP-4639. In Kubernetes v1.33, this feature graduates to beta.
Please note that the feature is still disabled by default, because not all container runtimes have full support for it. CRI-O supports the initial feature since version v1.31 and will add support for Image Volumes as beta in v1.33. containerd merged support for the alpha feature which will be part of the v2.1.0 release and is working on beta support as part of PR #11578.
What's new
The major change for the beta graduation of Image Volumes is the support for subPath
and subPathExpr
mounts for containers via spec.containers[*].volumeMounts.[subPath,subPathExpr]
. This allows end-users to mount a certain subdirectory of an image volume, which is still mounted as readonly (noexec
). This means that non-existing subdirectories cannot be mounted by default. As for other subPath
and subPathExpr
values, Kubernetes will ensure that there are no absolute path or relative path components part of the specified sub path. Container runtimes are also required to double check those requirements for safety reasons. If a specified subdirectory does not exist within a volume, then runtimes should fail on container creation and provide user feedback by using existing kubelet events.
Besides that, there are also three new kubelet metrics available for image volumes:
kubelet_image_volume_requested_total
: Outlines the number of requested image volumes.kubelet_image_volume_mounted_succeed_total
: Counts the number of successful image volume mounts.kubelet_image_volume_mounted_errors_total
: Accounts the number of failed image volume mounts.
To use an existing subdirectory for a specific image volume, just use it as subPath
(or subPathExpr
) value of the containers volumeMounts
:
apiVersion: v1
kind: Pod
metadata:
name: image-volume
spec:
containers:
- name: shell
command: ["sleep", "infinity"]
image: debian
volumeMounts:
- name: volume
mountPath: /volume
subPath: dir
volumes:
- name: volume
image:
reference: quay.io/crio/artifact:v2
pullPolicy: IfNotPresent
Then, create the pod on your cluster:
kubectl apply -f image-volumes-subpath.yaml
Now you can attach to the container:
kubectl attach -it image-volume bash
And check the content of the file from the dir
sub path in the volume:
cat /volume/file
The output will be similar to:
1
Thank you for reading through the end of this blog post! SIG Node is proud and happy to deliver this feature graduation as part of Kubernetes v1.33.
As writer of this blog post, I would like to emphasize my special thanks to all involved individuals out there!
If you would like to provide feedback or suggestions feel free to reach out to SIG Node using the Kubernetes Slack (#sig-node) channel or the SIG Node mailing list.
Further reading
29 Apr 2025 6:30pm GMT
28 Apr 2025
Kubernetes Blog
Kubernetes v1.33: HorizontalPodAutoscaler Configurable Tolerance
This post describes configurable tolerance for horizontal Pod autoscaling, a new alpha feature first available in Kubernetes 1.33.
What is it?
Horizontal Pod Autoscaling is a well-known Kubernetes feature that allows your workload to automatically resize by adding or removing replicas based on resource utilization.
Let's say you have a web application running in a Kubernetes cluster with 50 replicas. You configure the HorizontalPodAutoscaler (HPA) to scale based on CPU utilization, with a target of 75% utilization. Now, imagine that the current CPU utilization across all replicas is 90%, which is higher than the desired 75%. The HPA will calculate the required number of replicas using the formula:
In this example:
So, the HPA will increase the number of replicas from 50 to 60 to reduce the load on each pod. Similarly, if the CPU utilization were to drop below 75%, the HPA would scale down the number of replicas accordingly. The Kubernetes documentation provides a detailed description of the scaling algorithm.
In order to avoid replicas being created or deleted whenever a small metric fluctuation occurs, Kubernetes applies a form of hysteresis: it only changes the number of replicas when the current and desired metric values differ by more than 10%. In the example above, since the ratio between the current and desired metric values is \(90/75\), or 20% above target, exceeding the 10% tolerance, the scale-up action will proceed.
This default tolerance of 10% is cluster-wide; in older Kubernetes releases, it could not be fine-tuned. It's a suitable value for most usage, but too coarse for large deployments, where a 10% tolerance represents tens of pods. As a result, the community has long asked to be able to tune this value.
In Kubernetes v1.33, this is now possible.
How do I use it?
After enabling the HPAConfigurableTolerance
feature gate in your Kubernetes v1.33 cluster, you can add your desired tolerance for your HorizontalPodAutoscaler object.
Tolerances appear under the spec.behavior.scaleDown
and spec.behavior.scaleUp
fields and can thus be different for scale up and scale down. A typical usage would be to specify a small tolerance on scale up (to react quickly to spikes), but higher on scale down (to avoid adding and removing replicas too quickly in response to small metric fluctuations).
For example, an HPA with a tolerance of 5% on scale-down, and no tolerance on scale-up, would look like the following:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app
spec:
...
behavior:
scaleDown:
tolerance: 0.05
scaleUp:
tolerance: 0
I want all the details!
Get all the technical details by reading KEP-4951 and follow issue 4951 to be notified of the feature graduation.
28 Apr 2025 6:30pm GMT