Kubernetes

28 Jul 2025

Kubernetes v1.34 is coming at the end of August 2025. This release will not include any removal or deprecation, but it is packed with an impressive number of enhancements. Here are some of the features we are most excited about in this cycle!

Please note that this information reflects the current state of v1.34 development and may change before release.

Featured enhancements of Kubernetes v1.34

The following list highlights some of the notable enhancements likely to be included in the v1.34 release, but is not an exhaustive list of all planned changes. This is not a commitment and the release content is subject to change.

The core of DRA targets stable

Dynamic Resource Allocation (DRA) provides a flexible way to categorize, request, and use devices like GPUs or custom hardware in your Kubernetes cluster.

Since the v1.30 release, DRA has been based around claiming devices using structured parameters that are opaque to the core of Kubernetes. The relevant enhancement proposal, KEP-4381, took inspiration from dynamic provisioning for storage volumes. DRA with structured parameters relies on a set of supporting API kinds: ResourceClaim, DeviceClass, ResourceClaimTemplate, and ResourceSlice API types under resource.k8s.io, while extending the .spec for Pods with a new resourceClaims field. The core of DRA is targeting graduation to stable in Kubernetes v1.34.

With DRA, device drivers and cluster admins define device classes that are available for use. Workloads can claim devices from a device class within device requests. Kubernetes allocates matching devices to specific claims and places the corresponding Pods on nodes that can access the allocated devices. This framework provides flexible device filtering using CEL, centralized device categorization, and simplified Pod requests, among other benefits.

Once this feature has graduated, the resource.k8s.io/v1 APIs will be available by default.

ServiceAccount tokens for image pull authentication

The ServiceAccount token integration for kubelet credential providers is likely to reach beta and be enabled by default in Kubernetes v1.34. This allows the kubelet to use these tokens when pulling container images from registries that require authentication.

That support already exists as alpha, and is tracked as part of KEP-4412.

The existing alpha integration allows the kubelet to use short-lived, automatically rotated ServiceAccount tokens (that follow OIDC-compliant semantics) to authenticate to a container image registry. Each token is scoped to one associated Pod; the overall mechanism replaces the need for long-lived image pull Secrets.

Adopting this new approach reduces security risks, supports workload-level identity, and helps cut operational overhead. It brings image pull authentication closer to modern, identity-aware good practice.

Pod replacement policy for Deployments

After a change to a Deployment, terminating pods may stay up for a considerable amount of time and may consume additional resources. As part of KEP-3973, the .spec.podReplacementPolicy field will be introduced (as alpha) for Deployments.

If your cluster has the feature enabled, you'll be able to select one of two policies:

TerminationStarted: Creates new pods as soon as old ones start terminating, resulting in faster rollouts at the cost of potentially higher resource consumption.
TerminationComplete: Waits until old pods fully terminate before creating new ones, resulting in slower rollouts but ensuring controlled resource consumption.

This feature makes Deployment behavior more predictable by letting you choose when new pods should be created during updates or scaling. It's beneficial when working in clusters with tight resource constraints or with workloads with long termination periods.

It's expected to be available as an alpha feature and can be enabled using the DeploymentPodReplacementPolicy and DeploymentReplicaSetTerminatingReplicas feature gates in the API server and kube-controller-manager.

Production-ready tracing for `kubelet` and API Server

To address the longstanding challenge of debugging node-level issues by correlating disconnected logs, KEP-2831 provides deep, contextual insights into the kubelet.

This feature instruments critical kubelet operations, particularly its gRPC calls to the Container Runtime Interface (CRI), using the vendor-agnostic OpenTelemetry standard. It allows operators to visualize the entire lifecycle of events (for example: a Pod startup) to pinpoint sources of latency and errors. Its most powerful aspect is the propagation of trace context; the kubelet passes a trace ID with its requests to the container runtime, enabling runtimes to link their own spans.

This effort is complemented by a parallel enhancement, KEP-647, which brings the same tracing capabilities to the Kubernetes API server. Together, these enhancements provide a more unified, end-to-end view of events, simplifying the process of pinpointing latency and errors from the control plane down to the node. These features have matured through the official Kubernetes release process. KEP-2831 was introduced as an alpha feature in v1.25, while KEP-647 debuted as alpha in v1.22. Both enhancements were promoted to beta together in the v1.27 release. Looking forward, Kubelet Tracing (KEP-2831) and API Server Tracing (KEP-647) are now targeting graduation to stable in the upcoming v1.34 release.

`PreferSameZone` and `PreferSameNode` traffic distribution for Services

The spec.trafficDistribution field within a Kubernetes Service allows users to express preferences for how traffic should be routed to Service endpoints.

KEP-3015 deprecates PreferClose and introduces two additional values: PreferSameZone and PreferSameNode. PreferSameZone is equivalent to the current PreferClose. PreferSameNode prioritizes sending traffic to endpoints on the same node as the client.

This feature was introduced in v1.33 behind the PreferSameTrafficDistribution feature gate. It is targeting graduation to beta in v1.34 with its feature gate enabled by default.

Support for KYAML: a Kubernetes dialect of YAML

KYAML aims to be a safer and less ambiguous YAML subset, and was designed specifically for Kubernetes. Whatever version of Kubernetes you use, you'll be able use KYAML for writing manifests and/or Helm charts. You can write KYAML and pass it as an input to any version of kubectl, because all KYAML files are also valid as YAML. With kubectl v1.34, we expect you'll also be able to request KYAML output from kubectl (as in kubectl get -o kyaml …). If you prefer, you can still request the output in JSON or YAML format.

KYAML addresses specific challenges with both YAML and JSON. YAML's significant whitespace requires careful attention to indentation and nesting, while its optional string-quoting can lead to unexpected type coercion (for example: "The Norway Bug"). Meanwhile, JSON lacks comment support and has strict requirements for trailing commas and quoted keys.

KEP-5295 introduces KYAML, which tries to address the most significant problems by:

Always double-quoting value strings
Leaving keys unquoted unless they are potentially ambiguous
Always using {} for mappings (associative arrays)
Always using [] for lists

This might sound a lot like JSON, because it is! But unlike JSON, KYAML supports comments, allows trailing commas, and doesn't require quoted keys.

We're hoping to see KYAML introduced as a new output format for kubectl v1.34. As with all these features, none of these changes are 100% confirmed; watch this space!

As a format, KYAML is and will remain a strict subset of YAML, ensuring that any compliant YAML parser can parse KYAML documents. Kubernetes does not require you to provide input specifically formatted as KYAML, and we have no plans to change that.

Fine-grained autoscaling control with HPA configurable tolerance

KEP-4951 introduces a new feature that allows users to configure autoscaling tolerance on a per-HPA basis, overriding the default cluster-wide 10% tolerance setting that often proves too coarse-grained for diverse workloads. The enhancement adds an optional tolerance field to the HPA's spec.behavior.scaleUp and spec.behavior.scaleDown sections, enabling different tolerance values for scale-up and scale-down operations, which is particularly valuable since scale-up responsiveness is typically more critical than scale-down speed for handling traffic surges.

Released as alpha in Kubernetes v1.33 behind the HPAConfigurableTolerance feature gate, this feature is expected to graduate to beta in v1.34. This improvement helps to address scaling challenges with large deployments, where for scaling in, a 10% tolerance might mean leaving hundreds of unnecessary Pods running. Using the new, more flexible approach would enable workload-specific optimization for both responsive and conservative scaling behaviors.

Want to know more?

New features and deprecations are also announced in the Kubernetes release notes. We will formally announce what's new in Kubernetes v1.34 as part of the CHANGELOG for that release.

The Kubernetes v1.34 release is planned for Wednesday 27th August 2025. Stay tuned for updates!

Get involved

The simplest way to get involved with Kubernetes is to join one of the many Special Interest Groups (SIGs) that align with your interests. Have something you'd like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.

Follow us on Bluesky @kubernetes.io for the latest updates
Join the community discussion on Discuss
Join the community on Slack
Post questions (or answer questions) on Server Fault or Stack Overflow
Share your Kubernetes story
Read more about what's happening with Kubernetes on the blog
Learn more about the Kubernetes Release Team

28 Jul 2025 12:00am GMT

18 Jul 2025

Kubernetes Blog

Post-Quantum Cryptography in Kubernetes

The world of cryptography is on the cusp of a major shift with the advent of quantum computing. While powerful quantum computers are still largely theoretical for many applications, their potential to break current cryptographic standards is a serious concern, especially for long-lived systems. This is where Post-Quantum Cryptography (PQC) comes in. In this article, I'll dive into what PQC means for TLS and, more specifically, for the Kubernetes ecosystem. I'll explain what the (suprising) state of PQC in Kubernetes is and what the implications are for current and future clusters.

What is Post-Quantum Cryptography

Post-Quantum Cryptography refers to cryptographic algorithms that are thought to be secure against attacks by both classical and quantum computers. The primary concern is that quantum computers, using algorithms like Shor's Algorithm, could efficiently break widely used public-key cryptosystems such as RSA and Elliptic Curve Cryptography (ECC), which underpin much of today's secure communication, including TLS. The industry is actively working on standardizing and adopting PQC algorithms. One of the first to be standardized by NIST is the Module-Lattice Key Encapsulation Mechanism (ML-KEM), formerly known as Kyber, and now standardized as FIPS-203 (PDF download).

It is difficult to predict when quantum computers will be able to break classical algorithms. However, it is clear that we need to start migrating to PQC algorithms now, as the next section shows. To get a feeling for the predicted timeline we can look at a NIST report covering the transition to post-quantum cryptography standards. It declares that system with classical crypto should be deprecated after 2030 and disallowed after 2035.

Key exchange vs. digital signatures: different needs, different timelines

In TLS, there are two main cryptographic operations we need to secure:

Key Exchange: This is how the client and server agree on a shared secret to encrypt their communication. If an attacker records encrypted traffic today, they could decrypt it in the future, if they gain access to a quantum computer capable of breaking the key exchange. This makes migrating KEMs to PQC an immediate priority.

Digital Signatures: These are primarily used to authenticate the server (and sometimes the client) via certificates. The authenticity of a server is verified at the time of connection. While important, the risk of an attack today is much lower, because the decision of trusting a server cannot be abused after the fact. Additionally, current PQC signature schemes often come with significant computational overhead and larger key/signature sizes compared to their classical counterparts.

Another significant hurdle in the migration to PQ certificates is the upgrade of root certificates. These certificates have long validity periods and are installed in many devices and operating systems as trust anchors.

Given these differences, the focus for immediate PQC adoption in TLS has been on hybrid key exchange mechanisms. These combine a classical algorithm (such as Elliptic Curve Diffie-Hellman Ephemeral (ECDHE)) with a PQC algorithm (such as ML-KEM). The resulting shared secret is secure as long as at least one of the component algorithms remains unbroken. The X25519MLKEM768 hybrid scheme is the most widely supported one.

State of PQC key exchange mechanisms (KEMs) today

Support for PQC KEMs is rapidly improving across the ecosystem.

Go: The Go standard library's crypto/tls package introduced support for X25519MLKEM768 in version 1.24 (released February 2025). Crucially, it's enabled by default when there is no explicit configuration, i.e., Config.CurvePreferences is nil.

Browsers & OpenSSL: Major browsers like Chrome (version 131, November 2024) and Firefox (version 135, February 2025), as well as OpenSSL (version 3.5.0, April 2025), have also added support for the ML-KEM based hybrid scheme.

Apple is also rolling out support for X25519MLKEM768 in version 26 of their operating systems. Given the proliferation of Apple devices, this will have a significant impact on the global PQC adoption.

For a more detailed overview of the state of PQC in the wider industry, see this blog post by Cloudflare.

Post-quantum KEMs in Kubernetes: an unexpected arrival

So, what does this mean for Kubernetes? Kubernetes components, including the API server and kubelet, are built with Go.

As of Kubernetes v1.33, released in April 2025, the project uses Go 1.24. A quick check of the Kubernetes codebase reveals that Config.CurvePreferences is not explicitly set. This leads to a fascinating conclusion: Kubernetes v1.33, by virtue of using Go 1.24, supports hybrid post-quantum X25519MLKEM768 for TLS connections by default!

You can test this yourself. If you set up a Minikube cluster running Kubernetes v1.33.0, you can connect to the API server using a recent OpenSSL client:

$ minikube start --kubernetes-version=v1.33.0
$ kubectl cluster-info
Kubernetes control plane is running at https://127.0.0.1:<PORT>
$ kubectl config view --minify --raw -o jsonpath=\'{.clusters[0].cluster.certificate-authority-data}\' | base64 -d > ca.crt
$ openssl version
OpenSSL 3.5.0 8 Apr 2025 (Library: OpenSSL 3.5.0 8 Apr 2025)
$ echo -n "Q" | openssl s_client -connect 127.0.0.1:<PORT> -CAfile ca.crt
[...]
Negotiated TLS1.3 group: X25519MLKEM768
[...]
DONE

Lo and behold, the negotiated group is X25519MLKEM768! This is a significant step towards making Kubernetes quantum-safe, seemingly without a major announcement or dedicated KEP (Kubernetes Enhancement Proposal).

The Go version mismatch pitfall

An interesting wrinkle emerged with Go versions 1.23 and 1.24. Go 1.23 included experimental support for a draft version of ML-KEM, identified as X25519Kyber768Draft00. This was also enabled by default if Config.CurvePreferences was nil. Kubernetes v1.32 used Go 1.23. However, Go 1.24 removed the draft support and replaced it with the standardized version X25519MLKEM768.

What happens if a client and server are using mismatched Go versions (one on 1.23, the other on 1.24)? They won't have a common PQC KEM to negotiate, and the handshake will fall back to classical ECC curves (e.g., X25519). How could this happen in practice?

Consider a scenario:

A Kubernetes cluster is running v1.32 (using Go 1.23 and thus X25519Kyber768Draft00). A developer upgrades their kubectl to v1.33, compiled with Go 1.24, only supporting X25519MLKEM768. Now, when kubectl communicates with the v1.32 API server, they no longer share a common PQC algorithm. The connection will downgrade to classical cryptography, silently losing the PQC protection that has been in place. This highlights the importance of understanding the implications of Go version upgrades, and the details of the TLS stack.

Limitations: packet size

One practical consideration with ML-KEM is the size of its public keys with encoded key sizes of around 1.2 kilobytes for ML-KEM-768. This can cause the initial TLS ClientHello message not to fit inside a single TCP/IP packet, given the typical networking constraints (most commonly, the standard Ethernet frame size limit of 1500 bytes). Some TLS libraries or network appliances might not handle this gracefully, assuming the Client Hello always fits in one packet. This issue has been observed in some Kubernetes-related projects and networking components, potentially leading to connection failures when PQC KEMs are used. More details can be found at tldr.fail.

State of Post-Quantum Signatures

While KEMs are seeing broader adoption, PQC digital signatures are further behind in terms of widespread integration into standard toolchains. NIST has published standards for PQC signatures, such as ML-DSA (FIPS-204) and SLH-DSA (FIPS-205). However, implementing these in a way that's broadly usable (e.g., for PQC Certificate Authorities) presents challenges:

Larger Keys and Signatures: PQC signature schemes often have significantly larger public keys and signature sizes compared to classical algorithms like Ed25519 or RSA. For instance, Dilithium2 keys can be 30 times larger than Ed25519 keys, and certificates can be 12 times larger.

Performance: Signing and verification operations can be substantially slower. While some algorithms are on par with classical algorithms, others may have a much higher overhead, sometimes on the order of 10x to 1000x worse performance. To improve this situation, NIST is running a second round of standardization for PQC signatures.

Toolchain Support: Mainstream TLS libraries and CA software do not yet have mature, built-in support for these new signature algorithms. The Go team, for example, has indicated that ML-DSA support is a high priority, but the soonest it might appear in the standard library is Go 1.26 (as of May 2025).

Cloudflare's CIRCL (Cloudflare Interoperable Reusable Cryptographic Library) library implements some PQC signature schemes like variants of Dilithium, and they maintain a fork of Go (cfgo) that integrates CIRCL. Using cfgo, it's possible to experiment with generating certificates signed with PQC algorithms like Ed25519-Dilithium2. However, this requires using a custom Go toolchain and is not yet part of the mainstream Kubernetes or Go distributions.

Conclusion

The journey to a post-quantum secure Kubernetes is underway, and perhaps further along than many realize, thanks to the proactive adoption of ML-KEM in Go. With Kubernetes v1.33, users are already benefiting from hybrid post-quantum key exchange in many TLS connections by default.

However, awareness of potential pitfalls, such as Go version mismatches leading to downgrades and issues with Client Hello packet sizes, is crucial. While PQC for KEMs is becoming a reality, PQC for digital signatures and certificate hierarchies is still in earlier stages of development and adoption for mainstream use. As Kubernetes maintainers and contributors, staying informed about these developments will be key to ensuring the long-term security of the platform.

18 Jul 2025 12:00am GMT

03 Jul 2025

Kubernetes Blog

Navigating Failures in Pods With Devices

Kubernetes is the de facto standard for container orchestration, but when it comes to handling specialized hardware like GPUs and other accelerators, things get a bit complicated. This blog post dives into the challenges of managing failure modes when operating pods with devices in Kubernetes, based on insights from Sergey Kanzhelev and Mrunal Patel's talk at KubeCon NA 2024. You can follow the links to slides and recording.

The AI/ML boom and its impact on Kubernetes

The rise of AI/ML workloads has brought new challenges to Kubernetes. These workloads often rely heavily on specialized hardware, and any device failure can significantly impact performance and lead to frustrating interruptions. As highlighted in the 2024 Llama paper, hardware issues, particularly GPU failures, are a major cause of disruption in AI/ML training. You can also learn how much effort NVIDIA spends on handling devices failures and maintenance in the KubeCon talk by Ryan Hallisey and Piotr Prokop All-Your-GPUs-Are-Belong-to-Us: An Inside Look at NVIDIA's Self-Healing GeForce NOW Infrastructure (recording) as they see 19 remediation requests per 1000 nodes a day! We also see data centers offering spot consumption models and overcommit on power, making device failures commonplace and a part of the business model.

However, Kubernetes's view on resources is still very static. The resource is either there or not. And if it is there, the assumption is that it will stay there fully functional - Kubernetes lacks good support for handling full or partial hardware failures. These long-existing assumptions combined with the overall complexity of a setup lead to a variety of failure modes, which we discuss here.

Understanding AI/ML workloads

Generally, all AI/ML workloads require specialized hardware, have challenging scheduling requirements, and are expensive when idle. AI/ML workloads typically fall into two categories - training and inference. Here is an oversimplified view of those categories' characteristics, which are different from traditional workloads like web services:

Training: These workloads are resource-intensive, often consuming entire machines and running as gangs of pods. Training jobs are usually "run to completion" - but that could be days, weeks or even months. Any failure in a single pod can necessitate restarting the entire step across all the pods.
Inference: These workloads are usually long-running or run indefinitely, and can be small enough to consume a subset of a Node's devices or large enough to span multiple nodes. They often require downloading huge files with the model weights.

These workload types specifically break many past assumptions:

Workload assumptions before and now
Before	Now
Can get a better CPU and the app will work faster.	Require a specific device (or class of devices) to run.
When something doesn't work, just recreate it.	Allocation or reallocation is expensive.
Any node will work. No need to coordinate between Pods.	Scheduled in a special way - devices often connected in a cross-node topology.
Each Pod can be plug-and-play replaced if failed.	Pods are a part of a larger task. Lifecycle of an entire task depends on each Pod.
Container images are slim and easily available.	Container images may be so big that they require special handling.
Long initialization can be offset by slow rollout.	Initialization may be long and should be optimized, sometimes across many Pods together.
Compute nodes are commoditized and relatively inexpensive, so some idle time is acceptable.	Nodes with specialized hardware can be an order of magnitude more expensive than those without, so idle time is very wasteful.

The existing failure model was relying on old assumptions. It may still work for the new workload types, but it has limited knowledge about devices and is very expensive for them. In some cases, even prohibitively expensive. You will see more examples later in this article.

Why Kubernetes still reigns supreme

This article is not going deeper into the question: why not start fresh for
AI/ML workloads since they are so different from the traditional Kubernetes workloads. Despite many challenges, Kubernetes remains the platform of choice for AI/ML workloads. Its maturity, security, and rich ecosystem of tools make it a compelling option. While alternatives exist, they often lack the years of development and refinement that Kubernetes offers. And the Kubernetes developers are actively addressing the gaps identified in this article and beyond.

The current state of device failure handling

This section outlines different failure modes and the best practices and DIY (Do-It-Yourself) solutions used today. The next session will describe a roadmap of improving things for those failure modes.

Failure modes: K8s infrastructure

In order to understand the failures related to the Kubernetes infrastructure, you need to understand how many moving parts are involved in scheduling a Pod on the node. The sequence of events when the Pod is scheduled in the Node is as follows:

Device plugin is scheduled on the Node
Device plugin is registered with the kubelet via local gRPC
Kubelet uses device plugin to watch for devices and updates capacity of the node
Scheduler places a user Pod on a Node based on the updated capacity
Kubelet asks Device plugin to Allocate devices for a User Pod
Kubelet creates a User Pod with the allocated devices attached to it

This diagram shows some of those actors involved:

The diagram shows relationships between the kubelet, Device plugin, and a user Pod. It shows that kubelet connects to the Device plugin named my-device, kubelet reports the node status with the my-device availability, and the user Pod requesting the 2 of my-device.

As there are so many actors interconnected, every one of them and every connection may experience interruptions. This leads to many exceptional situations that are often considered failures, and may cause serious workload interruptions:

Pods failing admission at various stages of its lifecycle
Pods unable to run on perfectly fine hardware
Scheduling taking unexpectedly long time

The same diagram as one above it, however it has an overlayed orange bang drawings over individual components with the text indicating what can break in that component. Over the kubelet text reads: 'kubelet restart: looses all devices info before re-Watch'. Over the Device plugin text reads: 'device plugin update, evictIon, restart: kubelet cannot Allocate devices or loses all devices state'. Over the user Pod text reads: 'slow pod termination: devices are unavailable'.

The goal for Kubernetes is to make the interruption between these components as reliable as possible. Kubelet already implements retries, grace periods, and other techniques to improve it. The roadmap section goes into details on other edge cases that the Kubernetes project tracks. However, all these improvements only work when these best practices are followed:

Configure and restart kubelet and the container runtime (such as containerd or CRI-O) as early as possible to not interrupt the workload.
Monitor device plugin health and carefully plan for upgrades.
Do not overload the node with less-important workloads to prevent interruption of device plugin and other components.
Configure user pods tolerations to handle node readiness flakes.
Configure and code graceful termination logic carefully to not block devices for too long.

Another class of Kubernetes infra-related issues is driver-related. With traditional resources like CPU and memory, no compatibility checks between the application and hardware were needed. With special devices like hardware accelerators, there are new failure modes. Device drivers installed on the node:

Must match the hardware
Be compatible with an app
Must work with other drivers (like nccl, etc.)

Best practices for handling driver versions:

Monitor driver installer health
Plan upgrades of infrastructure and Pods to match the version
Have canary deployments whenever possible

Following the best practices in this section and using device plugins and device driver installers from trusted and reliable sources generally eliminate this class of failures. Kubernetes is tracking work to make this space even better.

Failure modes: device failed

There is very little handling of device failure in Kubernetes today. Device plugins report the device failure only by changing the count of allocatable devices. And Kubernetes relies on standard mechanisms like liveness probes or container failures to allow Pods to communicate the failure condition to the kubelet. However, Kubernetes does not correlate device failures with container crashes and does not offer any mitigation beyond restarting the container while being attached to the same device.

This is why many plugins and DIY solutions exist to handle device failures based on various signals.

Health controller

In many cases a failed device will result in unrecoverable and very expensive nodes doing nothing. A simple DIY solution is a node health controller. The controller could compare the device allocatable count with the capacity and if the capacity is greater, it starts a timer. Once the timer reaches a threshold, the health controller kills and recreates a node.

There are problems with the health controller approach:

Root cause of the device failure is typically not known
The controller is not workload aware
Failed device might not be in use and you want to keep other devices running
The detection may be too slow as it is very generic
The node may be part of a bigger set of nodes and simply cannot be deleted in isolation without other nodes

There are variations of the health controller solving some of the problems above. The overall theme here though is that to best handle failed devices, you need customized handling for the specific workload. Kubernetes doesn't yet offer enough abstraction to express how critical the device is for a node, for the cluster, and for the Pod it is assigned to.

Pod failure policy

Another DIY approach for device failure handling is a per-pod reaction on a failed device. This approach is applicable for training workloads that are implemented as Jobs.

Pod can define special error codes for device failures. For example, whenever unexpected device behavior is encountered, Pod exits with a special exit code. Then the Pod failure policy can handle the device failure in a special way. Read more on Handling retriable and non-retriable pod failures with Pod failure policy

There are some problems with the Pod failure policy approach for Jobs:

There is no well-known device failed condition, so this approach does not work for the generic Pod case
Error codes must be coded carefully and in some cases are hard to guarantee.
Only works with Jobs with restartPolicy: Never, due to the limitation of a pod failure policy feature.

So, this solution has limited applicability.

Custom pod watcher

A little more generic approach is to implement the Pod watcher as a DIY solution or use some third party tools offering this functionality. The pod watcher is most often used to handle device failures for inference workloads.

Since Kubernetes just keeps a pod assigned to a device, even if the device is reportedly unhealthy, the idea is to detect this situation with the pod watcher and apply some remediation. It often involves obtaining device health status and its mapping to the Pod using Pod Resources API on the node. If a device fails, it can then delete the attached Pod as a remediation. The replica set will handle the Pod recreation on a healthy device.

The other reasons to implement this watcher:

Without it, the Pod will keep being assigned to the failed device forever.
There is no descheduling for a pod with restartPolicy=Always.
There are no built-in controllers that delete Pods in CrashLoopBackoff.

Problems with the custom pod watcher:

The signal for the pod watcher is expensive to get, and involves some privileged actions.
It is a custom solution and it assumes the importance of a device for a Pod.
The pod watcher relies on external controllers to reschedule a Pod.

There are more variations of DIY solutions for handling device failures or upcoming maintenance. Overall, Kubernetes has enough extension points to implement these solutions. However, some extension points require higher privilege than users may be comfortable with or are too disruptive. The roadmap section goes into more details on specific improvements in handling the device failures.

Failure modes: container code failed

When the container code fails or something bad happens with it, like out of memory conditions, Kubernetes knows how to handle those cases. There is either the restart of a container, or a crash of a Pod if it has restartPolicy: Never and scheduling it on another node. Kubernetes has limited expressiveness on what is a failure (for example, non-zero exit code or liveness probe failure) and how to react on such a failure (mostly either Always restart or immediately fail the Pod).

This level of expressiveness is often not enough for the complicated AI/ML workloads. AI/ML pods are better rescheduled locally or even in-place as that would save on image pulling time and device allocation. AI/ML pods are often interconnected and need to be restarted together. This adds another level of complexity and optimizing it often brings major savings in running AI/ML workloads.

There are various DIY solutions to handle Pod failures orchestration. The most typical one is to wrap a main executable in a container by some orchestrator. And this orchestrator will be able to restart the main executable whenever the job needs to be restarted because some other pod has failed.

Solutions like this are very fragile and elaborate. They are often worth the money saved comparing to a regular JobSet delete/recreate cycle when used in large training jobs. Making these solutions less fragile and more streamlined by developing new hooks and extension points in Kubernetes will make it easy to apply to smaller jobs, benefiting everybody.

Failure modes: device degradation

Not all device failures are terminal for the overall workload or batch job. As the hardware stack gets more and more complex, misconfiguration on one of the hardware stack layers, or driver failures, may result in devices that are functional, but lagging on performance. One device that is lagging behind can slow down the whole training job.

We see reports of such cases more and more often. Kubernetes has no way to express this type of failures today and since it is the newest type of failure mode, there is not much of a best practice offered by hardware vendors for detection and third party tooling for remediation of these situations.

Typically, these failures are detected based on observed workload characteristics. For example, the expected speed of AI/ML training steps on particular hardware. Remediation for those issues is highly depend on a workload needs.

Roadmap

As outlined in a section above, Kubernetes offers a lot of extension points which are used to implement various DIY solutions. The space of AI/ML is developing very fast, with changing requirements and usage patterns. SIG Node is taking a measured approach of enabling more extension points to implement the workload-specific scenarios over introduction of new semantics to support specific scenarios. This means prioritizing making information about failures readily available over implementing automatic remediations for those failures that might only be suitable for a subset of workloads.

This approach ensures there are no drastic changes for workload handling which may break existing, well-oiled DIY solutions or experiences with the existing more traditional workloads.

Many error handling techniques used today work for AI/ML, but are very expensive. SIG Node will invest in extension points to make those cheaper, with the understanding that the price cutting for AI/ML is critical.

The following is the set of specific investments we envision for various failure modes.

Roadmap for failure modes: K8s infrastructure

The area of Kubernetes infrastructure is the easiest to understand and very important to make right for the upcoming transition from Device Plugins to DRA. SIG Node is tracking many work items in this area, most notably the following:

Basically, every interaction of Kubernetes components must be reliable via either the kubelet improvements or the best practices in plugins development and deployment.

Roadmap for failure modes: device failed

For the device failures some patterns are already emerging in common scenarios that Kubernetes can support. However, the very first step is to make information about failed devices available easier. The very first step here is the work in KEP 4680 (Add Resource Health Status to the Pod Status for Device Plugin and DRA).

Longer term ideas include to be tested:

Integrate device failures into Pod Failure Policy.
Node-local retry policies, enabling pod failure policies for Pods with restartPolicy=OnFailure and possibly beyond that.
Ability to deschedule pod, including with the restartPolicy: Always, so it can get a new device allocated.
Add device health to the ResourceSlice used to represent devices in DRA, rather than simply withdrawing an unhealthy device from the ResourceSlice.

Roadmap for failure modes: container code failed

The main improvements to handle container code failures for AI/ML workloads are all targeting cheaper error handling and recovery. The cheapness is mostly coming from reuse of pre-allocated resources as much as possible. From reusing the Pods by restarting containers in-place, to node local restart of containers instead of rescheduling whenever possible, to snapshotting support, and re-scheduling prioritizing the same node to save on image pulls.

Consider this scenario: A big training job needs 512 Pods to run. And one of the pods failed. It means that all Pods need to be interrupted and synced up to restart the failed step. The most efficient way to achieve this generally is to reuse as many Pods as possible by restarting them in-place, while replacing the failed pod to clear up the error from it. Like demonstrated in this picture:

The picture shows 512 pod, most ot them are green and have a recycle sign next to them indicating that they can be reused, and one Pod drawn in red, and a new green replacement Pod next to it indicating that it needs to be replaced.

It is possible to implement this scenario, but all solutions implementing it are fragile due to lack of certain extension points in Kubernetes. Adding these extension points to implement this scenario is on the Kubernetes roadmap.

Roadmap for failure modes: device degradation

There is very little done in this area - there is no clear detection signal, very limited troubleshooting tooling, and no built-in semantics to express the "degraded" device on Kubernetes. There has been discussion of adding data on device performance or degradation in the ResourceSlice used by DRA to represent devices, but it is not yet clearly defined. There are also projects like node-healthcheck-operator that can be used for some scenarios.

We expect developments in this area from hardware vendors and cloud providers, and we expect to see mostly DIY solutions in the near future. As more users get exposed to AI/ML workloads, this is a space needing feedback on patterns used here.

Join the conversation

The Kubernetes community encourages feedback and participation in shaping the future of device failure handling. Join SIG Node and contribute to the ongoing discussions!

This blog post provides a high-level overview of the challenges and future directions for device failure management in Kubernetes. By addressing these issues, Kubernetes can solidify its position as the leading platform for AI/ML workloads, ensuring resilience and reliability for applications that depend on specialized hardware.

03 Jul 2025 12:00am GMT

25 Jun 2025

Kubernetes Blog

Image Compatibility In Cloud Native Environments

In industries where systems must run very reliably and meet strict performance criteria such as telecommunication, high-performance or AI computing, containerized applications often need specific operating system configuration or hardware presence. It is common practice to require the use of specific versions of the kernel, its configuration, device drivers, or system components. Despite the existence of the Open Container Initiative (OCI), a governing community to define standards and specifications for container images, there has been a gap in expression of such compatibility requirements. The need to address this issue has led to different proposals and, ultimately, an implementation in Kubernetes' Node Feature Discovery (NFD).

NFD is an open source Kubernetes project that automatically detects and reports hardware and system features of cluster nodes. This information helps users to schedule workloads on nodes that meet specific system requirements, which is especially useful for applications with strict hardware or operating system dependencies.

The need for image compatibility specification

Dependencies between containers and host OS

A container image is built on a base image, which provides a minimal runtime environment, often a stripped-down Linux userland, completely empty or distroless. When an application requires certain features from the host OS, compatibility issues arise. These dependencies can manifest in several ways:

Drivers: Host driver versions must match the supported range of a library version inside the container to avoid compatibility problems. Examples include GPUs and network drivers.
Libraries or Software: The container must come with a specific version or range of versions for a library or software to run optimally in the environment. Examples from high performance computing are MPI, EFA, or Infiniband.
Kernel Modules or Features: Specific kernel features or modules must be present. Examples include having support of write protected huge page faults, or the presence of VFIO
And more…

While containers in Kubernetes are the most likely unit of abstraction for these needs, the definition of compatibility can extend further to include other container technologies such as Singularity and other OCI artifacts such as binaries from a spack binary cache.

Multi-cloud and hybrid cloud challenges

Containerized applications are deployed across various Kubernetes distributions and cloud providers, where different host operating systems introduce compatibility challenges. Often those have to be pre-configured before workload deployment or are immutable. For instance, different cloud providers will include different operating systems like:

RHCOS/RHEL
Photon OS
Amazon Linux 2
Container-Optimized OS
Azure Linux OS
And more...

Each OS comes with unique kernel versions, configurations, and drivers, making compatibility a non-trivial issue for applications requiring specific features. It must be possible to quickly assess a container for its suitability to run on any specific environment.

Image compatibility initiative

An effort was made within the Open Containers Initiative Image Compatibility working group to introduce a standard for image compatibility metadata. A specification for compatibility would allow container authors to declare required host OS features, making compatibility requirements discoverable and programmable. The specification implemented in Kubernetes Node Feature Discovery is one of the discussed proposals. It aims to:

Define a structured way to express compatibility in OCI image manifests.
Support a compatibility specification alongside container images in image registries.
Allow automated validation of compatibility before scheduling containers.

The concept has since been implemented in the Kubernetes Node Feature Discovery project.

Implementation in Node Feature Discovery

The solution integrates compatibility metadata into Kubernetes via NFD features and the NodeFeatureGroup API. This interface enables the user to match containers to nodes based on exposing features of hardware and software, allowing for intelligent scheduling and workload optimization.

Compatibility specification

The compatibility specification is a structured list of compatibility objects containing Node Feature Groups. These objects define image requirements and facilitate validation against host nodes. The feature requirements are described by using the list of available features from the NFD project. The schema has the following structure:

version (string) - Specifies the API version.
compatibilities (array of objects) - List of compatibility sets.
- rules (object) - Specifies NodeFeatureGroup to define image requirements.
- weight (int, optional) - Node affinity weight.
- tag (string, optional) - Categorization tag.
- description (string, optional) - Short description.

An example might look like the following:

version: v1alpha1
compatibilities:
- description: "My image requirements"
 rules:
 - name: "kernel and cpu"
 matchFeatures:
 - feature: kernel.loadedmodule
 matchExpressions:
 vfio-pci: {op: Exists}
 - feature: cpu.model
 matchExpressions:
 vendor_id: {op: In, value: ["Intel", "AMD"]}
 - name: "one of available nics"
 matchAny:
 - matchFeatures:
 - feature: pci.device
 matchExpressions:
 vendor: {op: In, value: ["0eee"]}
 class: {op: In, value: ["0200"]}
 - matchFeatures:
 - feature: pci.device
 matchExpressions:
 vendor: {op: In, value: ["0fff"]}
 class: {op: In, value: ["0200"]}

Client implementation for node validation

To streamline compatibility validation, we implemented a client tool that allows for node validation based on an image's compatibility artifact. In this workflow, the image author would generate a compatibility artifact that points to the image it describes in a registry via the referrers API. When a need arises to assess the fit of an image to a host, the tool can discover the artifact and verify compatibility of an image to a node before deployment. The client can validate nodes both inside and outside a Kubernetes cluster, extending the utility of the tool beyond the single Kubernetes use case. In the future, image compatibility could play a crucial role in creating specific workload profiles based on image compatibility requirements, aiding in more efficient scheduling. Additionally, it could potentially enable automatic node configuration to some extent, further optimizing resource allocation and ensuring seamless deployment of specialized workloads.

Examples of usage

Define image compatibility metadata

A container image can have metadata that describes its requirements based on features discovered from nodes, like kernel modules or CPU models. The previous compatibility specification example in this article exemplified this use case.
Attach the artifact to the image

The image compatibility specification is stored as an OCI artifact. You can attach this metadata to your container image using the oras tool. The registry only needs to support OCI artifacts, support for arbitrary types is not required. Keep in mind that the container image and the artifact must be stored in the same registry. Use the following command to attach the artifact to the image:
```
oras attach \
--artifact-type application/vnd.nfd.image-compatibility.v1alpha1 <image-url> \ 
<path-to-spec>.yaml:application/vnd.nfd.image-compatibility.spec.v1alpha1+yaml
```
Validate image compatibility

After attaching the compatibility specification, you can validate whether a node meets the image's requirements. This validation can be done using the nfd client:
```
nfd compat validate-node --image <image-url>
```
Read the output from the client

Finally you can read the report generated by the tool or use your own tools to act based on the generated JSON report.

Conclusion

The addition of image compatibility to Kubernetes through Node Feature Discovery underscores the growing importance of addressing compatibility in cloud native environments. It is only a start, as further work is needed to integrate compatibility into scheduling of workloads within and outside of Kubernetes. However, by integrating this feature into Kubernetes, mission-critical workloads can now define and validate host OS requirements more efficiently. Moving forward, the adoption of compatibility metadata within Kubernetes ecosystems will significantly enhance the reliability and performance of specialized containerized applications, ensuring they meet the stringent requirements of industries like telecommunications, high-performance computing or any environment that requires special hardware or host OS configuration.

Get involved

Join the Kubernetes Node Feature Discovery project if you're interested in getting involved with the design and development of Image Compatibility API and tools. We always welcome new contributors.

25 Jun 2025 12:00am GMT

16 Jun 2025

Kubernetes Blog

Changes to Kubernetes Slack

UPDATE: We've received notice from Salesforce that our Slack workspace WILL NOT BE DOWNGRADED on June 20th. Stand by for more details, but for now, there is no urgency to back up private channels or direct messages.

~~Kubernetes Slack will lose its special status and will be changing into a standard free Slack on June 20, 2025~~. Sometime later this year, our community may move to a new platform. If you are responsible for a channel or private channel, or a member of a User Group, you will need to take some actions as soon as you can.

For the last decade, Slack has supported our project with a free customized enterprise account. They have let us know that they can no longer do so, particularly since our Slack is one of the largest and more active ones on the platform. As such, they will be downgrading it to a standard free Slack while we decide on, and implement, other options.

On Friday, June 20, we will be subject to the feature limitations of free Slack. The primary ones which will affect us will be only retaining 90 days of history, and having to disable several apps and workflows which we are currently using. The Slack Admin team will do their best to manage these limitations.

Responsible channel owners, members of private channels, and members of User Groups should take some actions to prepare for the upgrade and preserve information as soon as possible.

The CNCF Projects Staff have proposed that our community look at migrating to Discord. Because of existing issues where we have been pushing the limits of Slack, they have already explored what a Kubernetes Discord would look like. Discord would allow us to implement new tools and integrations which would help the community, such as GitHub group membership synchronization. The Steering Committee will discuss and decide on our future platform.

Please see our FAQ, and check the kubernetes-dev mailing list and the #announcements channel for further news. If you have specific feedback on our Slack status join the discussion on GitHub.

16 Jun 2025 12:00am GMT

10 Jun 2025

Kubernetes Blog

Enhancing Kubernetes Event Management with Custom Aggregation

Kubernetes Events provide crucial insights into cluster operations, but as clusters grow, managing and analyzing these events becomes increasingly challenging. This blog post explores how to build custom event aggregation systems that help engineering teams better understand cluster behavior and troubleshoot issues more effectively.

The challenge with Kubernetes events

In a Kubernetes cluster, events are generated for various operations - from pod scheduling and container starts to volume mounts and network configurations. While these events are invaluable for debugging and monitoring, several challenges emerge in production environments:

Volume: Large clusters can generate thousands of events per minute
Retention: Default event retention is limited to one hour
Correlation: Related events from different components are not automatically linked
Classification: Events lack standardized severity or category classifications
Aggregation: Similar events are not automatically grouped

To learn more about Events in Kubernetes, read the Event API reference.

Real-World value

Consider a production environment with tens of microservices where the users report intermittent transaction failures:

Traditional event aggregation process: Engineers are wasting hours sifting through thousands of standalone events spread across namespaces. By the time they look into it, the older events have long since purged, and correlating pod restarts to node-level issues is practically impossible.

With its event aggregation in its custom events: The system groups events across resources, instantly surfacing correlation patterns such as volume mount timeouts before pod restarts. History indicates it occurred during past record traffic spikes, highlighting a storage scalability issue in minutes rather than hours.

The beneﬁt of this approach is that organizations that implement it commonly cut down their troubleshooting time significantly along with increasing the reliability of systems by detecting patterns early.

Building an Event aggregation system

This post explores how to build a custom event aggregation system that addresses these challenges, aligned to Kubernetes best practices. I've picked the Go programming language for my example.

Architecture overview

This event aggregation system consists of three main components:

Event Watcher: Monitors the Kubernetes API for new events
Event Processor: Processes, categorizes, and correlates events
Storage Backend: Stores processed events for longer retention

Here's a sketch for how to implement the event watcher:

package main

import (
 "context"
 metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 "k8s.io/client-go/kubernetes"
 "k8s.io/client-go/rest"
 eventsv1 "k8s.io/api/events/v1"
)

type EventWatcher struct {
 clientset *kubernetes.Clientset
}

func NewEventWatcher(config *rest.Config) (*EventWatcher, error) {
 clientset, err := kubernetes.NewForConfig(config)
 if err != nil {
 return nil, err
 }
 return &EventWatcher{clientset: clientset}, nil
}

func (w *EventWatcher) Watch(ctx context.Context) (<-chan *eventsv1.Event, error) {
 events := make(chan *eventsv1.Event)

 watcher, err := w.clientset.EventsV1().Events("").Watch(ctx, metav1.ListOptions{})
 if err != nil {
 return nil, err
 }

 go func() {
 defer close(events)
 for {
 select {
 case event := <-watcher.ResultChan():
 if e, ok := event.Object.(*eventsv1.Event); ok {
 events <- e
 }
 case <-ctx.Done():
 watcher.Stop()
 return
 }
 }
 }()

 return events, nil
}

Event processing and classification

The event processor enriches events with additional context and classification:

type EventProcessor struct {
 categoryRules []CategoryRule
 correlationRules []CorrelationRule
}

type ProcessedEvent struct {
 Event *eventsv1.Event
 Category string
 Severity string
 CorrelationID string
 Metadata map[string]string
}

func (p *EventProcessor) Process(event *eventsv1.Event) *ProcessedEvent {
 processed := &ProcessedEvent{
 Event: event,
 Metadata: make(map[string]string),
 }

 // Apply classification rules
 processed.Category = p.classifyEvent(event)
 processed.Severity = p.determineSeverity(event)

 // Generate correlation ID for related events
 processed.CorrelationID = p.correlateEvent(event)

 // Add useful metadata
 processed.Metadata = p.extractMetadata(event)

 return processed
}

Implementing Event correlation

One of the key features you could implement is a way of correlating related Events. Here's an example correlation strategy:

func (p *EventProcessor) correlateEvent(event *eventsv1.Event) string {
 // Correlation strategies:
 // 1. Time-based: Events within a time window
 // 2. Resource-based: Events affecting the same resource
 // 3. Causation-based: Events with cause-effect relationships

 correlationKey := generateCorrelationKey(event)
 return correlationKey
}

func generateCorrelationKey(event *eventsv1.Event) string {
 // Example: Combine namespace, resource type, and name
 return fmt.Sprintf("%s/%s/%s",
 event.InvolvedObject.Namespace,
 event.InvolvedObject.Kind,
 event.InvolvedObject.Name,
 )
}

Event storage and retention

For long-term storage and analysis, you'll probably want a backend that supports:

Efficient querying of large event volumes
Flexible retention policies
Support for aggregation queries

Here's a sample storage interface:

type EventStorage interface {
 Store(context.Context, *ProcessedEvent) error
 Query(context.Context, EventQuery) ([]ProcessedEvent, error)
 Aggregate(context.Context, AggregationParams) ([]EventAggregate, error)
}

type EventQuery struct {
 TimeRange TimeRange
 Categories []string
 Severity []string
 CorrelationID string
 Limit int
}

type AggregationParams struct {
 GroupBy []string
 TimeWindow string
 Metrics []string
}

Good practices for Event management

Resource Efficiency
- Implement rate limiting for event processing
- Use efficient filtering at the API server level
- Batch events for storage operations
Scalability
- Distribute event processing across multiple workers
- Use leader election for coordination
- Implement backoff strategies for API rate limits
Reliability
- Handle API server disconnections gracefully
- Buffer events during storage backend unavailability
- Implement retry mechanisms with exponential backoff

Advanced features

Pattern detection

Implement pattern detection to identify recurring issues:

type PatternDetector struct {
 patterns map[string]*Pattern
 threshold int
}

func (d *PatternDetector) Detect(events []ProcessedEvent) []Pattern {
 // Group similar events
 groups := groupSimilarEvents(events)

 // Analyze frequency and timing
 patterns := identifyPatterns(groups)

 return patterns
}

func groupSimilarEvents(events []ProcessedEvent) map[string][]ProcessedEvent {
 groups := make(map[string][]ProcessedEvent)

 for _, event := range events {
 // Create similarity key based on event characteristics
 similarityKey := fmt.Sprintf("%s:%s:%s",
 event.Event.Reason,
 event.Event.InvolvedObject.Kind,
 event.Event.InvolvedObject.Namespace,
 )

 // Group events with the same key
 groups[similarityKey] = append(groups[similarityKey], event)
 }

 return groups
}


func identifyPatterns(groups map[string][]ProcessedEvent) []Pattern {
 var patterns []Pattern

 for key, events := range groups {
 // Only consider groups with enough events to form a pattern
 if len(events) < 3 {
 continue
 }

 // Sort events by time
 sort.Slice(events, func(i, j int) bool {
 return events[i].Event.LastTimestamp.Time.Before(events[j].Event.LastTimestamp.Time)
 })

 // Calculate time range and frequency
 firstSeen := events[0].Event.FirstTimestamp.Time
 lastSeen := events[len(events)-1].Event.LastTimestamp.Time
 duration := lastSeen.Sub(firstSeen).Minutes()

 var frequency float64
 if duration > 0 {
 frequency = float64(len(events)) / duration
 }

 // Create a pattern if it meets threshold criteria
 if frequency > 0.5 { // More than 1 event per 2 minutes
 pattern := Pattern{
 Type: key,
 Count: len(events),
 FirstSeen: firstSeen,
 LastSeen: lastSeen,
 Frequency: frequency,
 EventSamples: events[:min(3, len(events))], // Keep up to 3 samples
 }
 patterns = append(patterns, pattern)
 }
 }

 return patterns
}

With this implementation, the system can identify recurring patterns such as node pressure events, pod scheduling failures, or networking issues that occur with a specific frequency.

Real-time alerts

The following example provides a starting point for building an alerting system based on event patterns. It is not a complete solution but a conceptual sketch to illustrate the approach.

type AlertManager struct {
 rules []AlertRule
 notifiers []Notifier
}

func (a *AlertManager) EvaluateEvents(events []ProcessedEvent) {
 for _, rule := range a.rules {
 if rule.Matches(events) {
 alert := rule.GenerateAlert(events)
 a.notify(alert)
 }
 }
}

Conclusion

A well-designed event aggregation system can significantly improve cluster observability and troubleshooting capabilities. By implementing custom event processing, correlation, and storage, operators can better understand cluster behavior and respond to issues more effectively.

The solutions presented here can be extended and customized based on specific requirements while maintaining compatibility with the Kubernetes API and following best practices for scalability and reliability.

Next steps

Future enhancements could include:

Machine learning for anomaly detection
Integration with popular observability platforms
Custom event APIs for application-specific events
Enhanced visualization and reporting capabilities

For more information on Kubernetes events and custom controllers, refer to the official Kubernetes documentation.

10 Jun 2025 12:00am GMT

05 Jun 2025

Kubernetes Blog

Introducing Gateway API Inference Extension

Modern generative AI and large language model (LLM) services create unique traffic-routing challenges on Kubernetes. Unlike typical short-lived, stateless web requests, LLM inference sessions are often long-running, resource-intensive, and partially stateful. For example, a single GPU-backed model server may keep multiple inference sessions active and maintain in-memory token caches.

Traditional load balancers focused on HTTP path or round-robin lack the specialized capabilities needed for these workloads. They also don't account for model identity or request criticality (e.g., interactive chat vs. batch jobs). Organizations often patch together ad-hoc solutions, but a standardized approach is missing.

Gateway API Inference Extension

Gateway API Inference Extension was created to address this gap by building on the existing Gateway API, adding inference-specific routing capabilities while retaining the familiar model of Gateways and HTTPRoutes. By adding an inference extension to your existing gateway, you effectively transform it into an Inference Gateway, enabling you to self-host GenAI/LLMs with a "model-as-a-service" mindset.

The project's goal is to improve and standardize routing to inference workloads across the ecosystem. Key objectives include enabling model-aware routing, supporting per-request criticalities, facilitating safe model roll-outs, and optimizing load balancing based on real-time model metrics. By achieving these, the project aims to reduce latency and improve accelerator (GPU) utilization for AI workloads.

How it works

The design introduces two new Custom Resources (CRDs) with distinct responsibilities, each aligning with a specific user persona in the AI/ML serving workflow:

InferencePool Defines a pool of pods (model servers) running on shared compute (e.g., GPU nodes). The platform admin can configure how these pods are deployed, scaled, and balanced. An InferencePool ensures consistent resource usage and enforces platform-wide policies. An InferencePool is similar to a Service but specialized for AI/ML serving needs and aware of the model-serving protocol.
InferenceModel A user-facing model endpoint managed by AI/ML owners. It maps a public name (e.g., "gpt-4-chat") to the actual model within an InferencePool. This lets workload owners specify which models (and optional fine-tuning) they want served, plus a traffic-splitting or prioritization policy.

In summary, the InferenceModel API lets AI/ML owners manage what is served, while the InferencePool lets platform operators manage where and how it's served.

Request flow

The flow of a request builds on the Gateway API model (Gateways and HTTPRoutes) with one or more extra inference-aware steps (extensions) in the middle. Here's a high-level example of the request flow with the Endpoint Selection Extension (ESE):

Gateway Routing
A client sends a request (e.g., an HTTP POST to /completions). The Gateway (like Envoy) examines the HTTPRoute and identifies the matching InferencePool backend.
Endpoint Selection
Instead of simply forwarding to any available pod, the Gateway consults an inference-specific routing extension- the Endpoint Selection Extension-to pick the best of the available pods. This extension examines live pod metrics (queue lengths, memory usage, loaded adapters) to choose the ideal pod for the request.
Inference-Aware Scheduling
The chosen pod is the one that can handle the request with the lowest latency or highest efficiency, given the user's criticality or resource needs. The Gateway then forwards traffic to that specific pod.

This extra step provides a smarter, model-aware routing mechanism that still feels like a normal single request to the client. Additionally, the design is extensible-any Inference Gateway can be enhanced with additional inference-specific extensions to handle new routing strategies, advanced scheduling logic, or specialized hardware needs. As the project continues to grow, contributors are encouraged to develop new extensions that are fully compatible with the same underlying Gateway API model, further expanding the possibilities for efficient and intelligent GenAI/LLM routing.

Benchmarks

We evaluated this extension against a standard Kubernetes Service for a vLLM‐based model serving deployment. The test environment consisted of multiple H100 (80 GB) GPU pods running vLLM (version 1) on a Kubernetes cluster, with 10 Llama2 model replicas. The Latency Profile Generator (LPG) tool was used to generate traffic and measure throughput, latency, and other metrics. The ShareGPT dataset served as the workload, and traffic was ramped from 100 Queries per Second (QPS) up to 1000 QPS.

Key results

Comparable Throughput: Throughout the tested QPS range, the ESE delivered throughput roughly on par with a standard Kubernetes Service.
Lower Latency:
- Per‐Output‐Token Latency: The ESE showed significantly lower p90 latency at higher QPS (500+), indicating that its model-aware routing decisions reduce queueing and resource contention as GPU memory approaches saturation.
- Overall p90 Latency: Similar trends emerged, with the ESE reducing end‐to‐end tail latencies compared to the baseline, particularly as traffic increased beyond 400-500 QPS.

These results suggest that this extension's model‐aware routing significantly reduced latency for GPU‐backed LLM workloads. By dynamically selecting the least‐loaded or best‐performing model server, it avoids hotspots that can appear when using traditional load balancing methods for large, long‐running inference requests.

Roadmap

As the Gateway API Inference Extension heads toward GA, planned features include:

Prefix-cache aware load balancing for remote caches
LoRA adapter pipelines for automated rollout
Fairness and priority between workloads in the same criticality band
HPA support for scaling based on aggregate, per-model metrics
Support for large multi-modal inputs/outputs
Additional model types (e.g., diffusion models)
Heterogeneous accelerators (serving on multiple accelerator types with latency- and cost-aware load balancing)
Disaggregated serving for independently scaling pools

Summary

By aligning model serving with Kubernetes-native tooling, Gateway API Inference Extension aims to simplify and standardize how AI/ML traffic is routed. With model-aware routing, criticality-based prioritization, and more, it helps ops teams deliver the right LLM services to the right users-smoothly and efficiently.

Ready to learn more? Visit the project docs to dive deeper, give an Inference Gateway extension a try with a few simple steps, and get involved if you're interested in contributing to the project!

05 Jun 2025 12:00am GMT

03 Jun 2025

Kubernetes Blog

Start Sidecar First: How To Avoid Snags

From the Kubernetes Multicontainer Pods: An Overview blog post you know what their job is, what are the main architectural patterns, and how they are implemented in Kubernetes. The main thing I'll cover in this article is how to ensure that your sidecar containers start before the main app. It's more complicated than you might think!

A gentle refresher

I'd just like to remind readers that the v1.29.0 release of Kubernetes added native support for sidecar containers, which can now be defined within the .spec.initContainers field, but with restartPolicy: Always. You can see that illustrated in the following example Pod manifest snippet:

initContainers:
 - name: logshipper
 image: alpine:latest
 restartPolicy: Always # this is what makes it a sidecar container
 command: ['sh', '-c', 'tail -F /opt/logs.txt']
 volumeMounts:
 - name: data
 mountPath: /opt

What are the specifics of defining sidecars with a .spec.initContainers block, rather than as a legacy multi-container pod with multiple .spec.containers? Well, all .spec.initContainers are always launched before the main application. If you define Kubernetes-native sidecars, those are terminated after the main application. Furthermore, when used with Jobs, a sidecar container should still be alive and could potentially even restart after the owning Job is complete; Kubernetes-native sidecar containers do not block pod completion.

To learn more, you can also read the official Pod sidecar containers tutorial.

The problem

Now you know that defining a sidecar with this native approach will always start it before the main application. From the kubelet source code, it's visible that this often means being started almost in parallel, and this is not always what an engineer wants to achieve. What I'm really interested in is whether I can delay the start of the main application until the sidecar is not just started, but fully running and ready to serve. It might be a bit tricky because the problem with sidecars is there's no obvious success signal, contrary to init containers - designed to run only for a specified period of time. With an init container, exit status 0 is unambiguously "I succeeded". With a sidecar, there are lots of points at which you can say "a thing is running". Starting one container only after the previous one is ready is part of a graceful deployment strategy, ensuring proper sequencing and stability during startup. It's also actually how I'd expect sidecar containers to work as well, to cover the scenario where the main application is dependent on the sidecar. For example, it may happen that an app errors out if the sidecar isn't available to serve requests (e.g., logging with DataDog). Sure, one could change the application code (and it would actually be the "best practice" solution), but sometimes they can't - and this post focuses on this use case.

I'll explain some ways that you might try, and show you what approaches will really work.

Readiness probe

To check whether Kubernetes native sidecar delays the start of the main application until the sidecar is ready, let's simulate a short investigation. Firstly, I'll simulate a sidecar container which will never be ready by implementing a readiness probe which will never succeed. As a reminder, a readiness probe checks if the container is ready to start accepting traffic and therefore, if the pod can be used as a backend for services.

(Unlike standard init containers, sidecar containers can have probes so that the kubelet can supervise the sidecar and intervene if there are problems. For example, restarting a sidecar container if it fails a health check.)

apiVersion: apps/v1
kind: Deployment
metadata:
 name: myapp
 labels:
 app: myapp
spec:
 replicas: 1
 selector:
 matchLabels:
 app: myapp
 template:
 metadata:
 labels:
 app: myapp
 spec:
 containers:
 - name: myapp
 image: alpine:latest
 command: ["sh", "-c", "sleep 3600"]
 initContainers:
 - name: nginx
 image: nginx:latest
 restartPolicy: Always
 ports:
 - containerPort: 80
 protocol: TCP
 readinessProbe:
 exec:
 command:
 - /bin/sh
 - -c
 - exit 1 # this command always fails, keeping the container "Not Ready"
 periodSeconds: 5
 volumes:
 - name: data
 emptyDir: {}

The result is:

controlplane $ kubectl get pods -w
NAME READY STATUS RESTARTS AGE
myapp-db5474f45-htgw5 1/2 Running 0 9m28s

controlplane $ kubectl describe pod myapp-db5474f45-htgw5
Name: myapp-db5474f45-htgw5
Namespace: default
(...)
Events:
 Type Reason Age From Message
 ---- ------ ---- ---- -------
 Normal Scheduled 17s default-scheduler Successfully assigned default/myapp-db5474f45-htgw5 to node01
 Normal Pulling 16s kubelet Pulling image "nginx:latest"
 Normal Pulled 16s kubelet Successfully pulled image "nginx:latest" in 163ms (163ms including waiting). Image size: 72080558 bytes.
 Normal Created 16s kubelet Created container nginx
 Normal Started 16s kubelet Started container nginx
 Normal Pulling 15s kubelet Pulling image "alpine:latest"
 Normal Pulled 15s kubelet Successfully pulled image "alpine:latest" in 159ms (160ms including waiting). Image size: 3652536 bytes.
 Normal Created 15s kubelet Created container myapp
 Normal Started 15s kubelet Started container myapp
 Warning Unhealthy 1s (x6 over 15s) kubelet Readiness probe failed:

From these logs it's evident that only one container is ready - and I know it can't be the sidecar, because I've defined it so it'll never be ready (you can also check container statuses in kubectl get pod -o json). I also saw that myapp has been started before the sidecar is ready. That was not the result I wanted to achieve; in this case, the main app container has a hard dependency on its sidecar.

Maybe a startup probe?

To ensure that the sidecar is ready before the main app container starts, I can define a startupProbe. It will delay the start of the main container until the command is successfully executed (returns 0 exit status). If you're wondering why I've added it to my initContainer, let's analyse what happens If I'd added it to myapp container. I wouldn't have guaranteed the probe would run before the main application code - and this one, can potentially error out without the sidecar being up and running.

apiVersion: apps/v1
kind: Deployment
metadata:
 name: myapp
 labels:
 app: myapp
spec:
 replicas: 1
 selector:
 matchLabels:
 app: myapp
 template:
 metadata:
 labels:
 app: myapp
 spec:
 containers:
 - name: myapp
 image: alpine:latest
 command: ["sh", "-c", "sleep 3600"]
 initContainers:
 - name: nginx
 image: nginx:latest
 ports:
 - containerPort: 80
 protocol: TCP
 restartPolicy: Always
 startupProbe:
 httpGet:
 path: /
 port: 80
 initialDelaySeconds: 5
 periodSeconds: 30
 failureThreshold: 10
 timeoutSeconds: 20
 volumes:
 - name: data
 emptyDir: {}

This results in 2/2 containers being ready and running, and from events, it can be inferred that the main application started only after nginx had already been started. But to confirm whether it waited for the sidecar readiness, let's change the startupProbe to the exec type of command:

startupProbe:
 exec:
 command:
 - /bin/sh
 - -c
 - sleep 15

and run kubectl get pods -w to watch in real time whether the readiness of both containers only changes after a 15 second delay. Again, events confirm the main application starts after the sidecar. That means that using the startupProbe with a correct startupProbe.httpGet request helps to delay the main application start until the sidecar is ready. It's not optimal, but it works.

What about the postStart lifecycle hook?

Fun fact: using the postStart lifecycle hook block will also do the job, but I'd have to write my own mini-shell script, which is even less efficient.

initContainers:
 - name: nginx
 image: nginx:latest
 restartPolicy: Always
 ports:
 - containerPort: 80
 protocol: TCP
 lifecycle:
 postStart:
 exec:
 command:
 - /bin/sh
 - -c
 - |
 echo "Waiting for readiness at http://localhost:80"
 until curl -sf http://localhost:80; do
 echo "Still waiting for http://localhost:80..."
 sleep 5
 done
 echo "Service is ready at http://localhost:80"

Liveness probe

An interesting exercise would be to check the sidecar container behavior with a liveness probe. A liveness probe behaves and is configured similarly to a readiness probe - only with the difference that it doesn't affect the readiness of the container but restarts it in case the probe fails.

livenessProbe:
 exec:
 command:
 - /bin/sh
 - -c
 - exit 1 # this command always fails, keeping the container "Not Ready"
 periodSeconds: 5

After adding the liveness probe configured just as the previous readiness probe and checking events of the pod by kubectl describe pod it's visible that the sidecar has a restart count above 0. Nevertheless, the main application is not restarted nor influenced at all, even though I'm aware that (in our imaginary worst-case scenario) it can error out when the sidecar is not there serving requests. What if I'd used a livenessProbe without lifecycle postStart? Both containers will be immediately ready: at the beginning, this behavior will not be different from the one without any additional probes since the liveness probe doesn't affect readiness at all. After a while, the sidecar will begin to restart itself, but it won't influence the main container.

Findings summary

I'll summarize the startup behavior in the table below:

Probe/Hook	Sidecar starts before the main app?	Main app waits for the sidecar to be ready?	What if the check doesn't pass?
`readinessProbe`	Yes, but it's almost in parallel (effectively no)	No	Sidecar is not ready; main app continues running
`livenessProbe`	Yes, but it's almost in parallel (effectively no)	No	Sidecar is restarted, main app continues running
`startupProbe`	Yes	Yes	Main app is not started
postStart	Yes, main app container starts after `postStart` completes	Yes, but you have to provide custom logic for that	Main app is not started

To summarize: with sidecars often being a dependency of the main application, you may want to delay the start of the latter until the sidecar is healthy. The ideal pattern is to start both containers simultaneously and have the app container logic delay at all levels, but it's not always possible. If that's what you need, you have to use the right kind of customization to the Pod definition. Thankfully, it's nice and quick, and you have the recipe ready above.

Happy deploying!

03 Jun 2025 12:00am GMT

02 Jun 2025

Kubernetes Blog

Gateway API v1.3.0: Advancements in Request Mirroring, CORS, Gateway Merging, and Retry Budgets

Join us in the Kubernetes SIG Network community in celebrating the general availability of Gateway API v1.3.0! We are also pleased to announce that there are already a number of conformant implementations to try, made possible by postponing this blog announcement. Version 1.3.0 of the API was released about a month ago on April 24, 2025.

Gateway API v1.3.0 brings a new feature to the Standard channel (Gateway API's GA release channel): percentage-based request mirroring, and introduces three new experimental features: cross-origin resource sharing (CORS) filters, a standardized mechanism for listener and gateway merging, and retry budgets.

Also see the full release notes and applaud the v1.3.0 release team next time you see them.

Graduation to Standard channel

Graduation to the Standard channel is a notable achievement for Gateway API features, as inclusion in the Standard release channel denotes a high level of confidence in the API surface and provides guarantees of backward compatibility. Of course, as with any other Kubernetes API, Standard channel features can continue to evolve with backward-compatible additions over time, and we (SIG Network) certainly expect further refinements and improvements in the future. For more information on how all of this works, refer to the Gateway API Versioning Policy.

Percentage-based request mirroring

Leads: Lior Lieberman,Jake Bennert

GEP-3171: Percentage-Based Request Mirroring

Percentage-based request mirroring is an enhancement to the existing support for HTTP request mirroring, which allows HTTP requests to be duplicated to another backend using the RequestMirror filter type. Request mirroring is particularly useful in blue-green deployment. It can be used to assess the impact of request scaling on application performance without impacting responses to clients.

The previous mirroring capability worked on all the requests to a backendRef.
Percentage-based request mirroring allows users to specify a subset of requests they want to be mirrored, either by percentage or fraction. This can be particularly useful when services are receiving a large volume of requests. Instead of mirroring all of those requests, this new feature can be used to mirror a smaller subset of them.

Here's an example with 42% of the requests to "foo-v1" being mirrored to "foo-v2":

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
 name: http-filter-mirror
 labels:
 gateway: mirror-gateway
spec:
 parentRefs:
 - name: mirror-gateway
 hostnames:
 - mirror.example
 rules:
 - backendRefs:
 - name: foo-v1
 port: 8080
 filters:
 - type: RequestMirror
 requestMirror:
 backendRef:
 name: foo-v2
 port: 8080
 percent: 42 # This value must be an integer.

You can also configure the partial mirroring using a fraction. Here is an example with 5 out of every 1000 requests to "foo-v1" being mirrored to "foo-v2".

 rules:
 - backendRefs:
 - name: foo-v1
 port: 8080
 filters:
 - type: RequestMirror
 requestMirror:
 backendRef:
 name: foo-v2
 port: 8080
 fraction:
 numerator: 5
 denominator: 1000

Additions to Experimental channel

The Experimental channel is Gateway API's channel for experimenting with new features and gaining confidence with them before allowing them to graduate to standard. Please note: the experimental channel may include features that are changed or removed later.

Starting in release v1.3.0, in an effort to distinguish Experimental channel resources from Standard channel resources, any new experimental API kinds have the prefix "X". For the same reason, experimental resources are now added to the API group gateway.networking.x-k8s.io instead of gateway.networking.k8s.io. Bear in mind that using new experimental channel resources means they can coexist with standard channel resources, but migrating these resources to the standard channel will require recreating them with the standard channel names and API group (both of which lack the "x-k8s" designator or "X" prefix).

The v1.3 release introduces two new experimental API kinds: XBackendTrafficPolicy and XListenerSet. To be able to use experimental API kinds, you need to install the Experimental channel Gateway API YAMLs from the locations listed below.

CORS filtering

Leads: Liang Li, Eyal Pazz, Rob Scott

GEP-1767: CORS Filter

Cross-origin resource sharing (CORS) is an HTTP-header based mechanism that allows a web page to access restricted resources from a server on an origin (domain, scheme, or port) different from the domain that served the web page. This feature adds a new HTTPRoute filter type, called "CORS", to configure the handling of cross-origin requests before the response is sent back to the client.

To be able to use experimental CORS filtering, you need to install the Experimental channel Gateway API HTTPRoute yaml.

Here's an example of a simple cross-origin configuration:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
 name: http-route-cors
spec:
 parentRefs:
 - name: http-gateway
 rules:
 - matches:
 - path:
 type: PathPrefix
 value: /resource/foo
 filters:
 - cors:
 - type: CORS
 allowOrigins:
 - *
 allowMethods:
 - GET
 - HEAD
 - POST
 allowHeaders:
 - Accept
 - Accept-Language
 - Content-Language
 - Content-Type
 - Range
 backendRefs:
 - kind: Service
 name: http-route-cors
 port: 80

In this case, the Gateway returns an origin header of "*", which means that the requested resource can be referenced from any origin, a methods header (Access-Control-Allow-Methods) that permits the GET, HEAD, and POST verbs, and a headers header allowing Accept, Accept-Language, Content-Language, Content-Type, and Range.

HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET, HEAD, POST
Access-Control-Allow-Headers: Accept,Accept-Language,Content-Language,Content-Type,Range

The complete list of fields in the new CORS filter:

allowOrigins
allowMethods
allowHeaders
allowCredentials
exposeHeaders
maxAge

See CORS protocol for details.

XListenerSets (standardized mechanism for Listener and Gateway merging)

Lead: Dave Protasowski

GEP-1713: ListenerSets - Standard Mechanism to Merge Multiple Gateways

This release adds a new experimental API kind, XListenerSet, that allows a shared list of listeners to be attached to one or more parent Gateway(s). In addition, it expands upon the existing suggestion that Gateway API implementations may merge configuration from multiple Gateway objects. It also:

adds a new field allowedListeners to the .spec of a Gateway. The allowedListeners field defines from which Namespaces to select XListenerSets that are allowed to attach to that Gateway: Same, All, None, or Selector based.
increases the previous maximum number (64) of listeners with the addition of XListenerSets.
allows the delegation of listener configuration, such as TLS, to applications in other namespaces.

To be able to use experimental XListenerSet, you need to install the Experimental channel Gateway API XListenerSet yaml.

The following example shows a Gateway with an HTTP listener and two child HTTPS XListenerSets with unique hostnames and certificates. The combined set of listeners attached to the Gateway includes the two additional HTTPS listeners in the XListenerSets that attach to the Gateway. This example illustrates the delegation of listener TLS config to application owners in different namespaces ("store" and "app"). The HTTPRoute has both the Gateway listener named "foo" and one XListenerSet listener named "second" as parentRefs.

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
 name: prod-external
 namespace: infra
spec:
 gatewayClassName: example
 allowedListeners:
 - from: All
 listeners:
 - name: foo
 hostname: foo.com
 protocol: HTTP
 port: 80
---
apiVersion: gateway.networking.x-k8s.io/v1alpha1
kind: XListenerSet
metadata:
 name: store
 namespace: store
spec:
 parentRef:
 name: prod-external
 listeners:
 - name: first
 hostname: first.foo.com
 protocol: HTTPS
 port: 443
 tls:
 mode: Terminate
 certificateRefs:
 - kind: Secret
 group: ""
 name: first-workload-cert
---
apiVersion: gateway.networking.x-k8s.io/v1alpha1
kind: XListenerSet
metadata:
 name: app
 namespace: app
spec:
 parentRef:
 name: prod-external
 listeners:
 - name: second
 hostname: second.foo.com
 protocol: HTTPS
 port: 443
 tls:
 mode: Terminate
 certificateRefs:
 - kind: Secret
 group: ""
 name: second-workload-cert
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
 name: httproute-example
spec:
 parentRefs:
 - name: app
 kind: XListenerSet
 sectionName: second
 - name: parent-gateway
 kind: Gateway
 sectionName: foo
 ...

Each listener in a Gateway must have a unique combination of port, protocol, (and hostname if supported by the protocol) in order for all listeners to be compatible and not conflicted over which traffic they should receive.

Furthermore, implementations can merge separate Gateways into a single set of listener addresses if all listeners across those Gateways are compatible. The management of merged listeners was under-specified in releases prior to v1.3.0.

With the new feature, the specification on merging is expanded. Implementations must treat the parent Gateways as having the merged list of all listeners from itself and from attached XListenerSets, and validation of this list of listeners must behave the same as if the list were part of a single Gateway. Within a single Gateway, listeners are ordered using the following precedence:

Single Listeners (not a part of an XListenerSet) first,
Remaining listeners ordered by:
- object creation time (oldest first), and if two listeners are defined in objects that have the same timestamp, then
- alphabetically based on "{namespace}/{name of listener}"

Retry budgets (XBackendTrafficPolicy)

Leads: Eric Bishop, Mike Morris

GEP-3388: Retry Budgets

This feature allows you to configure a retry budget across all endpoints of a destination Service. This is used to limit additional client-side retries after reaching a configured threshold. When configuring the budget, the maximum percentage of active requests that may consist of retries may be specified, as well as the interval over which requests will be considered when calculating the threshold for retries. The development of this specification changed the existing experimental API kind BackendLBPolicy into a new experimental API kind, XBackendTrafficPolicy, in the interest of reducing the proliferation of policy resources that had commonalities.

To be able to use experimental retry budgets, you need to install the Experimental channel Gateway API XBackendTrafficPolicy yaml.

The following example shows an XBackendTrafficPolicy that applies a retryConstraint that represents a budget that limits the retries to a maximum of 20% of requests, over a duration of 10 seconds, and to a minimum of 3 retries over 1 second.

apiVersion: gateway.networking.x-k8s.io/v1alpha1
kind: XBackendTrafficPolicy
metadata:
 name: traffic-policy-example
spec:
 retryConstraint:
 budget:
 percent: 20
 interval: 10s
 minRetryRate:
 count: 3
 interval: 1s
 ...

Try it out

Unlike other Kubernetes APIs, you don't need to upgrade to the latest version of Kubernetes to get the latest version of Gateway API. As long as you're running Kubernetes 1.26 or later, you'll be able to get up and running with this version of Gateway API.

To try out the API, follow the Getting Started Guide. As of this writing, four implementations are already conformant with Gateway API v1.3 experimental channel features. In alphabetical order:

Get involved

Wondering when a feature will be added? There are lots of opportunities to get involved and help define the future of Kubernetes routing APIs for both ingress and service mesh.

Check out the user guides to see what use-cases can be addressed.
Try out one of the existing Gateway controllers.
Or join us in the community and help us build the future of Gateway API together!

The maintainers would like to thank everyone who's contributed to Gateway API, whether in the form of commits to the repo, discussion, ideas, or general support. We could never have made this kind of progress without the support of this dedicated and active community.

16 May 2025

Kubernetes Blog

Kubernetes v1.33: In-Place Pod Resize Graduated to Beta

On behalf of the Kubernetes project, I am excited to announce that the in-place Pod resize feature (also known as In-Place Pod Vertical Scaling), first introduced as alpha in Kubernetes v1.27, has graduated to Beta and will be enabled by default in the Kubernetes v1.33 release! This marks a significant milestone in making resource management for Kubernetes workloads more flexible and less disruptive.

What is in-place Pod resize?

Traditionally, changing the CPU or memory resources allocated to a container required restarting the Pod. While acceptable for many stateless applications, this could be disruptive for stateful services, batch jobs, or any workloads sensitive to restarts.

In-place Pod resizing allows you to change the CPU and memory requests and limits assigned to containers within a running Pod, often without requiring a container restart.

Here's the core idea:

The spec.containers[*].resources field in a Pod specification now represents the desired resources and is mutable for CPU and memory.
The status.containerStatuses[*].resources field reflects the actual resources currently configured on a running container.
You can trigger a resize by updating the desired resources in the Pod spec via the new resize subresource.

You can try it out on a v1.33 Kubernetes cluster by using kubectl to edit a Pod (requires kubectl v1.32+):

kubectl edit pod <pod-name> --subresource resize

For detailed usage instructions and examples, please refer to the official Kubernetes documentation: Resize CPU and Memory Resources assigned to Containers.

Why does in-place Pod resize matter?

Kubernetes still excels at scaling workloads horizontally (adding or removing replicas), but in-place Pod resizing unlocks several key benefits for vertical scaling:

Reduced Disruption: Stateful applications, long-running batch jobs, and sensitive workloads can have their resources adjusted without suffering the downtime or state loss associated with a Pod restart.
Improved Resource Utilization: Scale down over-provisioned Pods without disruption, freeing up resources in the cluster. Conversely, provide more resources to Pods under heavy load without needing a restart.
Faster Scaling: Address transient resource needs more quickly. For example Java applications often need more CPU during startup than during steady-state operation. Start with higher CPU and resize down later.

What's changed between Alpha and Beta?

Since the alpha release in v1.27, significant work has gone into maturing the feature, improving its stability, and refining the user experience based on feedback and further development. Here are the key changes:

Notable user-facing changes

resize Subresource: Modifying Pod resources must now be done via the Pod's resize subresource (kubectl patch pod <name> --subresource resize ...). kubectl versions v1.32+ support this argument.
Resize Status via Conditions: The old status.resize field is deprecated. The status of a resize operation is now exposed via two Pod conditions:
- PodResizePending: Indicates the Kubelet cannot grant the resize immediately (e.g., reason: Deferred if temporarily unable, reason: Infeasible if impossible on the node).
- PodResizeInProgress: Indicates the resize is accepted and being applied. Errors encountered during this phase are now reported in this condition's message with reason: Error.
Sidecar Support: Resizing sidecar containers in-place is now supported.

Stability and reliability enhancements

Refined Allocated Resources Management: The allocation management logic with the Kubelet was significantly reworked, making it more consistent and robust. The changes eliminated whole classes of bugs, and greatly improved the reliability of in-place Pod resize.
Improved Checkpointing & State Tracking: A more robust system for tracking "allocated" and "actuated" resources was implemented, using new checkpoint files (allocated_pods_state, actuated_pods_state) to reliably manage resize state across Kubelet restarts and handle edge cases where runtime-reported resources differ from requested ones. Several bugs related to checkpointing and state restoration were fixed. Checkpointing efficiency was also improved.
Faster Resize Detection: Enhancements to the Kubelet's Pod Lifecycle Event Generator (PLEG) allow the Kubelet to respond to and complete resizes much more quickly.
Enhanced CRI Integration: A new UpdatePodSandboxResources CRI call was added to better inform runtimes and plugins (like NRI) about Pod-level resource changes.
Numerous Bug Fixes: Addressed issues related to systemd cgroup drivers, handling of containers without limits, CPU minimum share calculations, container restart backoffs, error propagation, test stability, and more.

What's next?

Graduating to Beta means the feature is ready for broader adoption, but development doesn't stop here! Here's what the community is focusing on next:

Stability and Productionization: Continued focus on hardening the feature, improving performance, and ensuring it is robust for production environments.
Addressing Limitations: Working towards relaxing some of the current limitations noted in the documentation, such as allowing memory limit decreases.
VerticalPodAutoscaler (VPA) Integration: Work to enable VPA to leverage in-place Pod resize is already underway. A new InPlaceOrRecreate update mode will allow it to attempt non-disruptive resizes first, or fall back to recreation if needed. This will allow users to benefit from VPA's recommendations with significantly less disruption.
User Feedback: Gathering feedback from users adopting the beta feature is crucial for prioritizing further enhancements and addressing any uncovered issues or bugs.

Getting started and providing feedback

With the InPlacePodVerticalScaling feature gate enabled by default in v1.33, you can start experimenting with in-place Pod resizing right away!

Refer to the documentation for detailed guides and examples.

As this feature moves through Beta, your feedback is invaluable. Please report any issues or share your experiences via the standard Kubernetes communication channels (GitHub issues, mailing lists, Slack). You can also review the KEP-1287: In-place Update of Pod Resources for the full in-depth design details.

We look forward to seeing how the community leverages in-place Pod resize to build more efficient and resilient applications on Kubernetes!

16 May 2025 6:30pm GMT

Announcing etcd v3.6.0

This announcement originally appeared on the etcd blog.

Today, we are releasing etcd v3.6.0, the first minor release since etcd v3.5.0 on June 15, 2021. This release introduces several new features, makes significant progress on long-standing efforts like downgrade support and migration to v3store, and addresses numerous critical & major issues. It also includes major optimizations in memory usage, improving efficiency and performance.

In addition to the features of v3.6.0, etcd has joined Kubernetes as a SIG (sig-etcd), enabling us to improve project sustainability. We've introduced systematic robustness testing to ensure correctness and reliability. Through the etcd-operator Working Group, we plan to improve usability as well.

What follows are the most significant changes introduced in etcd v3.6.0, along with the discussion of the roadmap for future development. For a detailed list of changes, please refer to the CHANGELOG-3.6.

A heartfelt thank you to all the contributors who made this release possible!

Security

etcd takes security seriously. To enhance software security in v3.6.0, we have improved our workflow checks by integrating govulncheck to scan the source code and trivy to scan container images. These improvements have also been backported to supported stable releases.

etcd continues to follow the Security Release Process to ensure vulnerabilities are properly managed and addressed.

Features

Migration to v3store

The v2store has been deprecated since etcd v3.4 but could still be enabled via --enable-v2. It remained the source of truth for membership data. In etcd v3.6.0, v2store can no longer be enabled as the --enable-v2 flag has been removed, and v3store has become the sole source of truth for membership data.

While v2store still exists in v3.6.0, etcd will fail to start if it contains any data other than membership information. To assist with migration, etcd v3.5.18+ provides the etcdutl check v2store command, which verifies that v2store contains only membership data (see PR 19113).

Compared to v2store, v3store offers better performance and transactional support. It is also the actively maintained storage engine moving forward.

The removal of v2store is still ongoing and is tracked in issues/12913.

Downgrade

etcd v3.6.0 is the first version to fully support downgrade. The effort for this downgrade task spans both versions 3.5 and 3.6, and all related work is tracked in issues/11716.

At a high level, the process involves migrating the data schema to the target version (e.g., v3.5), followed by a rolling downgrade.

Ensure the cluster is healthy, and take a snapshot backup. Validate whether the downgrade is valid:

$ etcdctl downgrade validate 3.5
Downgrade validate success, cluster version 3.6

If the downgrade is valid, enable downgrade mode:

$ etcdctl downgrade enable 3.5
Downgrade enable success, cluster version 3.6

etcd will then migrate the data schema in the background. Once complete, proceed with the rolling downgrade.

For details, refer to the Downgrade-3.6 guide.

Feature gates

In etcd v3.6.0, we introduced Kubernetes-style feature gates for managing new features. Previously, we indicated unstable features through the --experimental prefix in feature flag names. The prefix was removed once the feature was stable, causing a breaking change. Now, features will start in Alpha, progress to Beta, then GA, or get deprecated. This ensures a much smoother upgrade and downgrade experience for users.

See feature-gates for details.

livez / readyz checks

etcd now supports /livez and /readyz endpoints, aligning with Kubernetes' Liveness and Readiness probes. /livez indicates whether the etcd instance is alive, while /readyz indicates when it is ready to serve requests. This feature has also been backported to release-3.5 (starting from v3.5.11) and release-3.4 (starting from v3.4.29). See livez/readyz for details.

The existing /health endpoint remains functional. /livez is similar to /health?serializable=true, while /readyz is similar to /health or /health?serializable=false. Clearly, the /livez and /readyz endpoints provide clearer semantics and are easier to understand.

v3discovery

In etcd v3.6.0, the new discovery protocol v3discovery was introduced, based on clientv3. It facilitates the discovery of all cluster members during the bootstrap phase.

The previous v2discovery protocol, based on clientv2, has been deprecated. Additionally, the public discovery service at https://discovery.etcd.io/, which relied on v2discovery, is no longer maintained.

Performance

Memory

In this release, we reduced average memory consumption by at least 50% (see Figure 1). This improvement is primarily due to two changes:

The default value of --snapshot-count has been reduced from 100,000 in v3.5 to 10,000 in v3.6. As a result, etcd v3.6 now retains only about 10% of the history records compared to v3.5.
Raft history is compacted more frequently, as introduced in PR/18825.

Figure 1: Memory usage comparison between etcd v3.5.20 and v3.6.0-rc.2 under different read/write ratios. Each subplot shows the memory usage over time with a specific read/write ratio. The red line represents etcd v3.5.20, while the teal line represents v3.6.0-rc.2. Across all tested ratios, v3.6.0-rc.2 exhibits lower and more stable memory usage.

Throughput

Compared to v3.5, etcd v3.6 delivers an average performance improvement of approximately 10% in both read and write throughput (see Figure 2, 3, 4 and 5). This improvement is not attributed to any single major change, but rather the cumulative effect of multiple minor enhancements. One such example is the optimization of the free page queries introduced in PR/419.

etcd read transaction performance with a high write ratio

Figure 2: Read throughput comparison between etcd v3.5.20 and v3.6.0-rc.2 under a high write ratio. The read/write ratio is 0.0078, meaning 1 read per 128 writes. The right bar shows the percentage improvement in read throughput of v3.6.0-rc.2 over v3.5.20, ranging from 3.21% to 25.59%.

etcd read transaction performance with a high read ratio

Figure 3: Read throughput comparison between etcd v3.5.20 and v3.6.0-rc.2 under a high read ratio. The read/write ratio is 8, meaning 8 reads per write. The right bar shows the percentage improvement in read throughput of v3.6.0-rc.2 over v3.5.20, ranging from 4.38% to 27.20%.

etcd write transaction performance with a high write ratio

Figure 4: Write throughput comparison between etcd v3.5.20 and v3.6.0-rc.2 under a high write ratio. The read/write ratio is 0.0078, meaning 1 read per 128 writes. The right bar shows the percentage improvement in write throughput of v3.6.0-rc.2 over v3.5.20, ranging from 2.95% to 24.24%.

etcd write transaction performance with a high read ratio

Figure 5: Write throughput comparison between etcd v3.5.20 and v3.6.0-rc.2 under a high read ratio. The read/write ratio is 8, meaning 8 reads per write. The right bar shows the percentage improvement in write throughput of v3.6.0-rc.2 over v3.5.20, ranging from 3.86% to 28.37%.

Breaking changes

This section highlights a few notable breaking changes. For a complete list, please refer to the Upgrade etcd from v3.5 to v3.6 and the CHANGELOG-3.6.

Old binaries are incompatible with new schema versions

Old etcd binaries are not compatible with newer data schema versions. For example, etcd 3.5 cannot start with data created by etcd 3.6, and etcd 3.4 cannot start with data created by either 3.5 or 3.6.

When downgrading etcd, it's important to follow the documented downgrade procedure. Simply replacing the binary or image will result in the incompatibility issue.

Peer endpoints no longer serve client requests

Client endpoints (--advertise-client-urls) are intended to serve client requests only, while peer endpoints (--initial-advertise-peer-urls) are intended solely for peer communication. However, due to an implementation oversight, the peer endpoints were also able to handle client requests in etcd 3.4 and 3.5. This behavior was misleading and encouraged incorrect usage patterns. In etcd 3.6, this misleading behavior was corrected via PR/13565; peer endpoints no longer serve client requests.

Clear boundary between etcdctl and etcdutl

Both etcdctl and etcdutl are command line tools. etcdutl is an offline utility designed to operate directly on etcd data files, while etcdctl is an online tool that interacts with etcd over a network. Previously, there were some overlapping functionalities between the two, but these overlaps were removed in 3.6.0.

Removed etcdctl defrag --data-dir

The etcdctl defrag command only support online defragmentation and no longer supports offline defragmentation. To perform offline defragmentation, use the etcdutl defrag --data-dir command instead.
Removed etcdctl snapshot status

etcdctl no longer supports retrieving the status of a snapshot. Use the etcdutl snapshot status command instead.
Removed etcdctl snapshot restore

etcdctl no longer supports restoring from a snapshot. Use the etcdutl snapshot restore command instead.

Critical bug fixes

Correctness has always been a top priority for the etcd project. In the process of developing 3.6.0, we found and fixed a few notable bugs that could lead to data inconsistency in specific cases. These fixes have been backported to previous releases, but we believe they deserve special mention here.

Data Inconsistency when Crashing Under Load

Previously, when etcd was applying data, it would update the consistent-index first, followed by committing the data. However, these operations were not atomic. If etcd crashed in between, it could lead to data inconsistency (see issue/13766). The issue was introduced in v3.5.0, and fixed in v3.5.3 with PR/13854.

Durability API guarantee broken in single node cluster

When a client writes data and receives a success response, the data is expected to be persisted. However, the data might be lost if etcd crashes immediately after sending the success response to the client. This was a legacy issue (see issue/14370) affecting all previous releases. It was addressed in v3.4.21 and v3.5.5 with PR/14400, and fixed in raft side in main branch (now release-3.6) with PR/14413.

Revision Inconsistency when Crashing During Defragmentation

If etcd crashed during the defragmentation operation, upon restart, it might reapply some entries which had already been applied, accordingly leading to the revision inconsistency issue (see the discussions in PR/14685). The issue was introduced in v3.5.0, and fixed in v3.5.6 with PR/14730.

Upgrade issue

This section highlights a common issue issues/19557 in the etcd v3.5 to v3.6 upgrade that may cause the upgrade process to fail. For a complete upgrade guide, refer to Upgrade etcd from v3.5 to v3.6.

The issue was introduced in etcd v3.5.1, and resolved in v3.5.20.

Key takeaway: users are required to first upgrade to etcd v3.5.20 (or a higher patch version) before upgrading to etcd v3.6.0; otherwise, the upgrade may fail.

For more background and technical context, see upgrade_from_3.5_to_3.6_issue.

Testing

We introduced the Robustness testing to verify correctness, which has always been our top priority. It plays traffic of various types and volumes against an etcd cluster, concurrently injects a random failpoint, records all operations (including both requests and responses), and finally performs a linearizability check. It also verifies that the Watch APIs guarantees have not been violated. The robustness test increases our confidence in ensuring the quality of each etcd release.

We have migrated most of the etcd workflow tests to Kubernetes' Prow testing infrastructure to take advantage of its benefit, such as nice dashboards for viewing test results and the ability for contributors to rerun failed tests themselves.

Platforms

While retaining all existing supported platforms, we have promoted Linux/ARM64 to Tier 1 support. For more details, please refer to issues/15951. For the complete list of supported platforms, see supported-platform.

Dependencies

Dependency bumping guide

We have published an official guide on how to bump dependencies for etcd's main branch and stable releases. It also covers how to update the Go version. For more details, please refer to dependency_management. With this guide available, any contributors can now help with dependency upgrades.

Core Dependency Updates

bbolt and raft are two core dependencies of etcd.

Both etcd v3.4 and v3.5 depend on bbolt v1.3, while etcd v3.6 depends on bbolt v1.4.

For the release-3.4 and release-3.5 branches, raft is included in the etcd repository itself, so etcd v3.4 and v3.5 do not depend on an external raft module. Starting from etcd v3.6, raft was moved to a separate repository (raft), and the first standalone raft release is v3.6.0. As a result, etcd v3.6.0 depends on raft v3.6.0.

Please see the table below for a summary:

etcd versions	bbolt versions	raft versions
3.4.x	v1.3.x	N/A
3.5.x	v1.3.x	N/A
3.6.x	v1.4.x	v3.6.x

grpc-gateway@v2

We upgraded grpc-gateway from v1 to v2 via PR/16595 in etcd v3.6.0. This is a major step toward migrating to protobuf-go, the second major version of the Go protocol buffer API implementation.

grpc-gateway@v2 is designed to work with protobuf-go. However, etcd v3.6 still depends on the deprecated gogo/protobuf, which is actually protocol buffer v1 implementation. To resolve this incompatibility, we applied a patch to the generated *.pb.gw.go files to convert v1 messages to v2 messages.

grpc-ecosystem/go-grpc-middleware/providers/prometheus

We switched from the deprecated (and archived) grpc-ecosystem/go-grpc-prometheus to grpc-ecosystem/go-grpc-middleware/providers/prometheus via PR/19195. This change ensures continued support and access to the latest features and improvements in the gRPC Prometheus integration.

Community

There are exciting developments in the etcd community that reflect our ongoing commitment to strengthening collaboration, improving maintainability, and evolving the project's governance.

etcd Becomes a Kubernetes SIG

etcd has officially become a Kubernetes Special Interest Group: SIG-etcd. This change reflects etcd's critical role as the primary datastore for Kubernetes and establishes a more structured and transparent home for long-term stewardship and cross-project collaboration. The new SIG designation will help streamline decision-making, align roadmaps with Kubernetes needs, and attract broader community involvement.

New contributors, maintainers, and reviewers

We've seen increasing engagement from contributors, which has resulted in the addition of three new maintainers:

Their continued contributions have been instrumental in driving the project forward.

We also welcome two new reviewers to the project:

We appreciate their dedication to code quality and their willingness to take on broader review responsibilities within the community.

New release team

We've formed a new release team led by ivanvc and jmhbnz, streamlining the release process by automating many previously manual steps. Inspired by Kubernetes SIG Release, we've adopted several best practices, including clearly defined release team roles and the introduction of release shadows to support knowledge sharing and team sustainability. These changes have made our releases smoother and more reliable, allowing us to approach each release with greater confidence and consistency.

Introducing the etcd Operator Working Group

To further advance etcd's operational excellence, we have formed a new working group: WG-etcd-operator. The working group is dedicated to enabling the automatic and efficient operation of etcd clusters that run in the Kubernetes environment using an etcd-operator.

Future Development

The legacy v2store has been deprecated since etcd v3.4, and the flag --enable-v2 was removed entirely in v3.6. This means that starting from v3.6, there is no longer a way to enable or use the v2store. However, etcd still bootstraps internally from the legacy v2 snapshots. To address this inconsistency, We plan to change etcd to bootstrap from the v3store and replay the WAL entries based on the consistent-index. The work is being tracked in issues/12913.

One of the most persistent challenges remains the large range of queries from the kube-apiserver, which can lead to process crashes due to their unpredictable nature. The range stream feature, originally outlined in the v3.5 release blog/Future roadmaps, remains an idea worth revisiting to address the challenges of large range queries.

For more details and upcoming plans, please refer to the etcd roadmap.

16 May 2025 12:00am GMT

15 May 2025

Kubernetes Blog

Kubernetes 1.33: Job's SuccessPolicy Goes GA

On behalf of the Kubernetes project, I'm pleased to announce that Job success policy has graduated to General Availability (GA) as part of the v1.33 release.

About Job's Success Policy

In batch workloads, you might want to use leader-follower patterns like MPI, in which the leader controls the execution, including the followers' lifecycle.

In this case, you might want to mark it as succeeded even if some of the indexes failed. Unfortunately, a leader-follower Kubernetes Job that didn't use a success policy, in most cases, would have to require all Pods to finish successfully for that Job to reach an overall succeeded state.

For Kubernetes Jobs, the API allows you to specify the early exit criteria using the .spec.successPolicy field (you can only use the .spec.successPolicy field for an indexed Job). Which describes a set of rules either using a list of succeeded indexes for a job, or defining a minimal required size of succeeded indexes.

This newly stable field is especially valuable for scientific simulation, AI/ML and High-Performance Computing (HPC) batch workloads. Users in these areas often run numerous experiments and may only need a specific number to complete successfully, rather than requiring all of them to succeed. In this case, the leader index failure is the only relevant Job exit criteria, and the outcomes for individual follower Pods are handled only indirectly via the status of the leader index. Moreover, followers do not know when they can terminate themselves.

After Job meets any Success Policy, the Job is marked as succeeded, and all Pods are terminated including the running ones.

How it works

The following excerpt from a Job manifest, using .successPolicy.rules[0].succeededCount, shows an example of using a custom success policy:

 parallelism: 10
 completions: 10
 completionMode: Indexed
 successPolicy:
 rules:
 - succeededCount: 1

Here, the Job is marked as succeeded when one index succeeded regardless of its number. Additionally, you can constrain index numbers against succeededCount in .successPolicy.rules[0].succeededCount as shown below:

parallelism: 10
completions: 10
completionMode: Indexed
successPolicy:
 rules:
 - succeededIndexes: 0 # index of the leader Pod
 succeededCount: 1

This example shows that the Job will be marked as succeeded once a Pod with a specific index (Pod index 0) has succeeded.

Once the Job either reaches one of the successPolicy rules, or achieves its Complete criteria based on .spec.completions, the Job controller within kube-controller-manager adds the SuccessCriteriaMet condition to the Job status. After that, the job-controller initiates cleanup and termination of Pods for Jobs with SuccessCriteriaMet condition. Eventually, Jobs obtain Complete condition when the job-controller finished cleanup and termination.

Learn more

Read the documentation for success policy.
Read the KEP for the Job success/completion policy

Get involved

This work was led by the Kubernetes batch working group in close collaboration with the SIG Apps community.

If you are interested in working on new features in the space I recommend subscribing to our Slack channel and attending the regular community meetings.

15 May 2025 6:30pm GMT

14 May 2025

Kubernetes Blog

Kubernetes v1.33: Updates to Container Lifecycle

Kubernetes v1.33 introduces a few updates to the lifecycle of containers. The Sleep action for container lifecycle hooks now supports a zero sleep duration (feature enabled by default). There is also alpha support for customizing the stop signal sent to containers when they are being terminated.

This blog post goes into the details of these new aspects of the container lifecycle, and how you can use them.

Zero value for Sleep action

Kubernetes v1.29 introduced the Sleep action for container PreStop and PostStart Lifecycle hooks. The Sleep action lets your containers pause for a specified duration after the container is started or before it is terminated. This was needed to provide a straightforward way to manage graceful shutdowns. Before the Sleep action, folks used to run the sleep command using the exec action in their container lifecycle hooks. If you wanted to do this you'd need to have the binary for the sleep command in your container image. This is difficult if you're using third party images.

The sleep action when it was added initially didn't have support for a sleep duration of zero seconds. The time.Sleep which the Sleep action uses under the hood supports a duration of zero seconds. Using a negative or a zero value for the sleep returns immediately, resulting in a no-op. We wanted the same behaviour with the sleep action. This support for the zero duration was later added in v1.32, with the PodLifecycleSleepActionAllowZero feature gate.

The PodLifecycleSleepActionAllowZero feature gate has graduated to beta in v1.33, and is now enabled by default. The original Sleep action for preStop and postStart hooks is been enabled by default, starting from Kubernetes v1.30. With a cluster running Kubernetes v1.33, you are able to set a zero duration for sleep lifecycle hooks. For a cluster with default configuration, you don't need to enable any feature gate to make that possible.

Container stop signals

Container runtimes such as containerd and CRI-O honor a StopSignal instruction in the container image definition. This can be used to specify a custom stop signal that the runtime will used to terminate containers based on that image. Stop signal configuration was not originally part of the Pod API in Kubernetes. Until Kubernetes v1.33, the only way to override the stop signal for containers was by rebuilding your container image with the new custom stop signal (for example, specifying STOPSIGNAL in a Containerfile or Dockerfile).

The ContainerStopSignals feature gate which is newly added in Kubernetes v1.33 adds stop signals to the Kubernetes API. This allows users to specify a custom stop signal in the container spec. Stop signals are added to the API as a new lifecycle along with the existing PreStop and PostStart lifecycle handlers. In order to use this feature, we expect the Pod to have the operating system specified with spec.os.name. This is enforced so that we can cross-validate the stop signal against the operating system and make sure that the containers in the Pod are created with a valid stop signal for the operating system the Pod is being scheduled to. For Pods scheduled on Windows nodes, only SIGTERM and SIGKILL are allowed as valid stop signals. Find the full list of signals supported in Linux nodes here.

Default behaviour

If a container has a custom stop signal defined in its lifecycle, the container runtime would use the signal defined in the lifecycle to kill the container, given that the container runtime also supports custom stop signals. If there is no custom stop signal defined in the container lifecycle, the runtime would fallback to the stop signal defined in the container image. If there is no stop signal defined in the container image, the default stop signal of the runtime would be used. The default signal is SIGTERM for both containerd and CRI-O.

Version skew

For the feature to work as intended, both the versions of Kubernetes and the container runtime should support container stop signals. The changes to the Kuberentes API and kubelet are available in alpha stage from v1.33, which can be enabled with the ContainerStopSignals feature gate. The container runtime implementations for containerd and CRI-O are still a work in progress and will be rolled out soon.

Using container stop signals

To enable this feature, you need to turn on the ContainerStopSignals feature gate in both the kube-apiserver and the kubelet. Once you have nodes where the feature gate is turned on, you can create Pods with a StopSignal lifecycle and a valid OS name like so:

apiVersion: v1
kind: Pod
metadata:
 name: nginx
spec:
 os:
 name: linux
 containers:
 - name: nginx
 image: nginx:latest
 lifecycle:
 stopSignal: SIGUSR1

Do note that the SIGUSR1 signal in this example can only be used if the container's Pod is scheduled to a Linux node. Hence we need to specify spec.os.name as linux to be able to use the signal. You will only be able to configure SIGTERM and SIGKILL signals if the Pod is being scheduled to a Windows node. You cannot specify a containers[*].lifecycle.stopSignal if the spec.os.name field is nil or unset either.

How do I get involved?

This feature is driven by the SIG Node. If you are interested in helping develop this feature, sharing feedback, or participating in any other ongoing SIG Node projects, please reach out to us!

You can reach SIG Node by several means:

You can also contact me directly:

GitHub: @sreeram-venkitesh
Slack: @sreeram.venkitesh

14 May 2025 6:30pm GMT

13 May 2025

Kubernetes Blog

Kubernetes v1.33: Job's Backoff Limit Per Index Goes GA

In Kubernetes v1.33, the Backoff Limit Per Index feature reaches general availability (GA). This blog describes the Backoff Limit Per Index feature and its benefits.

About backoff limit per index

When you run workloads on Kubernetes, you must consider scenarios where Pod failures can affect the completion of your workloads. Ideally, your workload should tolerate transient failures and continue running.

To achieve failure tolerance in a Kubernetes Job, you can set the spec.backoffLimit field. This field specifies the total number of tolerated failures.

However, for workloads where every index is considered independent, like embarassingly parallel workloads - the spec.backoffLimit field is often not flexible enough. For example, you may choose to run multiple suites of integration tests by representing each suite as an index within an Indexed Job. In that setup, a fast-failing index (test suite) is likely to consume your entire budget for tolerating Pod failures, and you might not be able to run the other indexes.

In order to address this limitation, Kubernetes introduced backoff limit per index, which allows you to control the number of retries per index.

How backoff limit per index works

To use Backoff Limit Per Index for Indexed Jobs, specify the number of tolerated Pod failures per index with the spec.backoffLimitPerIndex field. When you set this field, the Job executes all indexes by default.

Additionally, to fine-tune the error handling:

Specify the cap on the total number of failed indexes by setting the spec.maxFailedIndexes field. When the limit is exceeded the entire Job is terminated.
Define a short-circuit to detect a failed index by using the FailIndex action in the Pod Failure Policy mechanism.

When the number of tolerated failures is exceeded, the Job marks that index as failed and lists it in the Job's status.failedIndexes field.

Example

The following Job spec snippet is an example of how to combine backoff limit per index with the Pod Failure Policy feature:

completions: 10
parallelism: 10
completionMode: Indexed
backoffLimitPerIndex: 1
maxFailedIndexes: 5
podFailurePolicy:
 rules:
 - action: Ignore
 onPodConditions:
 - type: DisruptionTarget
 - action: FailIndex
 onExitCodes:
 operator: In
 values: [ 42 ]

In this example, the Job handles Pod failures as follows:

Ignores any failed Pods that have the built-in disruption condition, called DisruptionTarget. These Pods don't count towards Job backoff limits.
Fails the index corresponding to the failed Pod if any of the failed Pod's containers finished with the exit code 42 - based on the matching "FailIndex" rule.
Retries the first failure of any index, unless the index failed due to the matching FailIndex rule.
Fails the entire Job if the number of failed indexes exceeded 5 (set by the spec.maxFailedIndexes field).

Learn more

Read the blog post on the closely related feature of Pod Failure Policy Kubernetes 1.31: Pod Failure Policy for Jobs Goes GA
For a hands-on guide to using Pod failure policy, including the use of FailIndex, see Handling retriable and non-retriable pod failures with Pod failure policy
Read the documentation for Backoff limit per index and Pod failure policy
Read the KEP for the Backoff Limits Per Index For Indexed Jobs

Get involved

This work was sponsored by the Kubernetes batch working group in close collaboration with the SIG Apps community.

If you are interested in working on new features in the space we recommend subscribing to our Slack channel and attending the regular community meetings.

13 May 2025 6:30pm GMT

12 May 2025

Kubernetes Blog

Kubernetes v1.33: Image Pull Policy the way you always thought it worked!

Image Pull Policy the way you always thought it worked!

Some things in Kubernetes are surprising, and the way imagePullPolicy behaves might be one of them. Given Kubernetes is all about running pods, it may be peculiar to learn that there has been a caveat to restricting pod access to authenticated images for over 10 years in the form of issue 18787! It is an exciting release when you can resolve a ten-year-old issue.

Note:

Throughout this blog post, the term "pod credentials" will be used often. In this context, the term generally encapsulates the authentication material that is available to a pod to authenticate a container image pull.

IfNotPresent, even if I'm not supposed to have it

The gist of the problem is that the imagePullPolicy: IfNotPresent strategy has done precisely what it says, and nothing more. Let's set up a scenario. To begin, Pod A in Namespace X is scheduled to Node 1 and requires image Foo from a private repository. For it's image pull authentication material, the pod references Secret 1 in its imagePullSecrets. Secret 1 contains the necessary credentials to pull from the private repository. The Kubelet will utilize the credentials from Secret 1 as supplied by Pod A and it will pull container image Foo from the registry. This is the intended (and secure) behavior.

But now things get curious. If Pod B in Namespace Y happens to also be scheduled to Node 1, unexpected (and potentially insecure) things happen. Pod B may reference the same private image, specifying the IfNotPresent image pull policy. Pod B does not reference Secret 1 (or in our case, any secret) in its imagePullSecrets. When the Kubelet tries to run the pod, it honors the IfNotPresent policy. The Kubelet sees that the image Foo is already present locally, and will provide image Foo to Pod B. Pod B gets to run the image even though it did not provide credentials authorizing it to pull the image in the first place.

Illustration of the process of two pods trying to access a private image, the first one with a pull secret, the second one without it — Using a private image pulled by a different pod

While IfNotPresent should not pull image Foo if it is already present on the node, it is an incorrect security posture to allow all pods scheduled to a node to have access to previously pulled private image. These pods were never authorized to pull the image in the first place.

IfNotPresent, but only if I am supposed to have it

In Kubernetes v1.33, we - SIG Auth and SIG Node - have finally started to address this (really old) problem and getting the verification right! The basic expected behavior is not changed. If an image is not present, the Kubelet will attempt to pull the image. The credentials each pod supplies will be utilized for this task. This matches behavior prior to 1.33.

If the image is present, then the behavior of the Kubelet changes. The Kubelet will now verify the pod's credentials before allowing the pod to use the image.

Performance and service stability have been a consideration while revising the feature. Pods utilizing the same credential will not be required to re-authenticate. This is also true when pods source credentials from the same Kubernetes Secret object, even when the credentials are rotated.

Never pull, but use if authorized

The imagePullPolicy: Never option does not fetch images. However, if the container image is already present on the node, any pod attempting to use the private image will be required to provide credentials, and those credentials require verification.

Pods utilizing the same credential will not be required to re-authenticate. Pods that do not supply credentials previously used to successfully pull an image will not be allowed to use the private image.

Always pull, if authorized

The imagePullPolicy: Always has always worked as intended. Each time an image is requested, the request goes to the registry and the registry will perform an authentication check.

In the past, forcing the Always image pull policy via pod admission was the only way to ensure that your private container images didn't get reused by other pods on nodes which already pulled the images.

Fortunately, this was somewhat performant. Only the image manifest was pulled, not the image. However, there was still a cost and a risk. During a new rollout, scale up, or pod restart, the image registry that provided the image MUST be available for the auth check, putting the image registry in the critical path for stability of services running inside of the cluster.

How it all works

The feature is based on persistent, file-based caches that are present on each of the nodes. The following is a simplified description of how the feature works. For the complete version, please see KEP-2535.

The process of requesting an image for the first time goes like this:

A pod requesting an image from a private registry is scheduled to a node.
The image is not present on the node.
The Kubelet makes a record of the intention to pull the image.
The Kubelet extracts credentials from the Kubernetes Secret referenced by the pod as an image pull secret, and uses them to pull the image from the private registry.
After the image has been successfully pulled, the Kubelet makes a record of the successful pull. This record includes details about credentials used (in the form of a hash) as well as the Secret from which they originated.
The Kubelet removes the original record of intent.
The Kubelet retains the record of successful pull for later use.

When future pods scheduled to the same node request the previously pulled private image:

The Kubelet checks the credentials that the new pod provides for the pull.
If the hash of these credentials, or the source Secret of the credentials match the hash or source Secret which were recorded for a previous successful pull, the pod is allowed to use the previously pulled image.
If the credentials or their source Secret are not found in the records of successful pulls for that image, the Kubelet will attempt to use these new credentials to request a pull from the remote registry, triggering the authorization flow.

Try it out

In Kubernetes v1.33 we shipped the alpha version of this feature. To give it a spin, enable the KubeletEnsureSecretPulledImages feature gate for your 1.33 Kubelets.

You can learn more about the feature and additional optional configuration on the concept page for Images in the official Kubernetes documentation.

What's next?

In future releases we are going to:

Make this feature work together with Projected service account tokens for Kubelet image credential providers which adds a new, workload-specific source of image pull credentials.
Write a benchmarking suite to measure the performance of this feature and assess the impact of any future changes.
Implement an in-memory caching layer so that we don't need to read files for each image pull request.
Add support for credential expirations, thus forcing previously validated credentials to be re-authenticated.

How to get involved

Reading KEP-2535 is a great way to understand these changes in depth.

If you are interested in further involvement, reach out to us on the #sig-auth-authenticators-dev channel on Kubernetes Slack (for an invitation, visit https://slack.k8s.io/). You are also welcome to join the bi-weekly SIG Auth meetings, held every other Wednesday.

12 May 2025 6:30pm GMT

09 May 2025

Kubernetes Blog

Kubernetes v1.33: Streaming List responses

Managing Kubernetes cluster stability becomes increasingly critical as your infrastructure grows. One of the most challenging aspects of operating large-scale clusters has been handling List requests that fetch substantial datasets - a common operation that could unexpectedly impact your cluster's stability.

Today, the Kubernetes community is excited to announce a significant architectural improvement: streaming encoding for List responses.

The problem: unnecessary memory consumption with large resources

Current API response encoders just serialize an entire response into a single contiguous memory and perform one ResponseWriter.Write call to transmit data to the client. Despite HTTP/2's capability to split responses into smaller frames for transmission, the underlying HTTP server continues to hold the complete response data as a single buffer. Even as individual frames are transmitted to the client, the memory associated with these frames cannot be freed incrementally.

When cluster size grows, the single response body can be substantial - like hundreds of megabytes in size. At large scale, the current approach becomes particularly inefficient, as it prevents incremental memory release during transmission. Imagining that when network congestion occurs, that large response body's memory block stays active for tens of seconds or even minutes. This limitation leads to unnecessarily high and prolonged memory consumption in the kube-apiserver process. If multiple large List requests occur simultaneously, the cumulative memory consumption can escalate rapidly, potentially leading to an Out-of-Memory (OOM) situation that compromises cluster stability.

The encoding/json package uses sync.Pool to reuse memory buffers during serialization. While efficient for consistent workloads, this mechanism creates challenges with sporadic large List responses. When processing these large responses, memory pools expand significantly. But due to sync.Pool's design, these oversized buffers remain reserved after use. Subsequent small List requests continue utilizing these large memory allocations, preventing garbage collection and maintaining persistently high memory consumption in the kube-apiserver even after the initial large responses complete.

Additionally, Protocol Buffers are not designed to handle large datasets. But it's great for handling individual messages within a large data set. This highlights the need for streaming-based approaches that can process and transmit large collections incrementally rather than as monolithic blocks.

As a general rule of thumb, if you are dealing in messages larger than a megabyte each, it may be time to consider an alternate strategy.

From https://protobuf.dev/programming-guides/techniques/

Streaming encoder for List responses

The streaming encoding mechanism is specifically designed for List responses, leveraging their common well-defined collection structures. The core idea focuses exclusively on the Items field within collection structures, which represents the bulk of memory consumption in large responses. Rather than encoding the entire Items array as one contiguous memory block, the new streaming encoder processes and transmits each item individually, allowing memory to be freed progressively as frame or chunk is transmitted. As a result, encoding items one by one significantly reduces the memory footprint required by the API server.

With Kubernetes objects typically limited to 1.5 MiB (from ETCD), streaming encoding keeps memory consumption predictable and manageable regardless of how many objects are in a List response. The result is significantly improved API server stability, reduced memory spikes, and better overall cluster performance - especially in environments where multiple large List operations might occur simultaneously.

To ensure perfect backward compatibility, the streaming encoder validates Go struct tags rigorously before activation, guaranteeing byte-for-byte consistency with the original encoder. Standard encoding mechanisms process all fields except Items, maintaining identical output formatting throughout. This approach seamlessly supports all Kubernetes List types-from built-in *List objects to Custom Resource UnstructuredList objects - requiring zero client-side modifications or awareness that the underlying encoding method has changed.

Performance gains you'll notice

Reduced Memory Consumption: Significantly lowers the memory footprint of the API server when handling large list requests, especially when dealing with large resources.
Improved Scalability: Enables the API server to handle more concurrent requests and larger datasets without running out of memory.
Increased Stability: Reduces the risk of OOM kills and service disruptions.
Efficient Resource Utilization: Optimizes memory usage and improves overall resource efficiency.

Benchmark results

To validate results Kubernetes has introduced a new list benchmark which executes concurrently 10 list requests each returning 1GB of data.

The benchmark has showed 20x improvement, reducing memory usage from 70-80GB to 3GB.

Screenshot of a K8s performance dashboard showing memory usage for benchmark list going down from 60GB to 3GB — List benchmark memory usage

09 May 2025 6:30pm GMT

fidzu : Kubernetes

28 Jul 2025

Featured enhancements of Kubernetes v1.34

The core of DRA targets stable

ServiceAccount tokens for image pull authentication

Pod replacement policy for Deployments

Production-ready tracing for kubelet and API Server

PreferSameZone and PreferSameNode traffic distribution for Services

Support for KYAML: a Kubernetes dialect of YAML

Fine-grained autoscaling control with HPA configurable tolerance

Want to know more?

Get involved

18 Jul 2025

What is Post-Quantum Cryptography

Key exchange vs. digital signatures: different needs, different timelines

State of PQC key exchange mechanisms (KEMs) today

Post-quantum KEMs in Kubernetes: an unexpected arrival

The Go version mismatch pitfall

Limitations: packet size

State of Post-Quantum Signatures

Conclusion

03 Jul 2025

The AI/ML boom and its impact on Kubernetes

Understanding AI/ML workloads

Why Kubernetes still reigns supreme

The current state of device failure handling

Failure modes: K8s infrastructure

Failure modes: device failed

Health controller

Pod failure policy

Custom pod watcher

Failure modes: container code failed

Failure modes: device degradation

Roadmap

Roadmap for failure modes: K8s infrastructure

Roadmap for failure modes: device failed

Roadmap for failure modes: container code failed

Roadmap for failure modes: device degradation

Join the conversation

25 Jun 2025

The need for image compatibility specification

Dependencies between containers and host OS

Multi-cloud and hybrid cloud challenges

Image compatibility initiative

Implementation in Node Feature Discovery

Compatibility specification

Client implementation for node validation

Examples of usage

Conclusion

Get involved

16 Jun 2025

10 Jun 2025

The challenge with Kubernetes events

Real-World value

Building an Event aggregation system

Architecture overview

Event processing and classification

Implementing Event correlation

Event storage and retention

Good practices for Event management

Advanced features

Pattern detection

Real-time alerts

Conclusion

Next steps

05 Jun 2025

Gateway API Inference Extension

How it works

Request flow

Benchmarks

Key results

Roadmap

Summary

03 Jun 2025

A gentle refresher

The problem

Readiness probe

Maybe a startup probe?

What about the postStart lifecycle hook?

Liveness probe

Production-ready tracing for `kubelet` and API Server

`PreferSameZone` and `PreferSameNode` traffic distribution for Services