03 Jul 2025

feedKubernetes Blog

Navigating Failures in Pods With Devices

Kubernetes is the de facto standard for container orchestration, but when it comes to handling specialized hardware like GPUs and other accelerators, things get a bit complicated. This blog post dives into the challenges of managing failure modes when operating pods with devices in Kubernetes, based on insights from Sergey Kanzhelev and Mrunal Patel's talk at KubeCon NA 2024. You can follow the links to slides and recording.

The AI/ML boom and its impact on Kubernetes

The rise of AI/ML workloads has brought new challenges to Kubernetes. These workloads often rely heavily on specialized hardware, and any device failure can significantly impact performance and lead to frustrating interruptions. As highlighted in the 2024 Llama paper, hardware issues, particularly GPU failures, are a major cause of disruption in AI/ML training. You can also learn how much effort NVIDIA spends on handling devices failures and maintenance in the KubeCon talk by Ryan Hallisey and Piotr Prokop All-Your-GPUs-Are-Belong-to-Us: An Inside Look at NVIDIA's Self-Healing GeForce NOW Infrastructure (recording) as they see 19 remediation requests per 1000 nodes a day! We also see data centers offering spot consumption models and overcommit on power, making device failures commonplace and a part of the business model.

However, Kubernetes's view on resources is still very static. The resource is either there or not. And if it is there, the assumption is that it will stay there fully functional - Kubernetes lacks good support for handling full or partial hardware failures. These long-existing assumptions combined with the overall complexity of a setup lead to a variety of failure modes, which we discuss here.

Understanding AI/ML workloads

Generally, all AI/ML workloads require specialized hardware, have challenging scheduling requirements, and are expensive when idle. AI/ML workloads typically fall into two categories - training and inference. Here is an oversimplified view of those categories' characteristics, which are different from traditional workloads like web services:

Training
These workloads are resource-intensive, often consuming entire machines and running as gangs of pods. Training jobs are usually "run to completion" - but that could be days, weeks or even months. Any failure in a single pod can necessitate restarting the entire step across all the pods.
Inference
These workloads are usually long-running or run indefinitely, and can be small enough to consume a subset of a Node's devices or large enough to span multiple nodes. They often require downloading huge files with the model weights.

These workload types specifically break many past assumptions:

Workload assumptions before and now
Before Now
Can get a better CPU and the app will work faster. Require a specific device (or class of devices) to run.
When something doesn't work, just recreate it. Allocation or reallocation is expensive.
Any node will work. No need to coordinate between Pods. Scheduled in a special way - devices often connected in a cross-node topology.
Each Pod can be plug-and-play replaced if failed. Pods are a part of a larger task. Lifecycle of an entire task depends on each Pod.
Container images are slim and easily available. Container images may be so big that they require special handling.
Long initialization can be offset by slow rollout. Initialization may be long and should be optimized, sometimes across many Pods together.
Compute nodes are commoditized and relatively inexpensive, so some idle time is acceptable. Nodes with specialized hardware can be an order of magnitude more expensive than those without, so idle time is very wasteful.

The existing failure model was relying on old assumptions. It may still work for the new workload types, but it has limited knowledge about devices and is very expensive for them. In some cases, even prohibitively expensive. You will see more examples later in this article.

Why Kubernetes still reigns supreme

This article is not going deeper into the question: why not start fresh for
AI/ML workloads since they are so different from the traditional Kubernetes workloads. Despite many challenges, Kubernetes remains the platform of choice for AI/ML workloads. Its maturity, security, and rich ecosystem of tools make it a compelling option. While alternatives exist, they often lack the years of development and refinement that Kubernetes offers. And the Kubernetes developers are actively addressing the gaps identified in this article and beyond.

The current state of device failure handling

This section outlines different failure modes and the best practices and DIY (Do-It-Yourself) solutions used today. The next session will describe a roadmap of improving things for those failure modes.

Failure modes: K8s infrastructure

In order to understand the failures related to the Kubernetes infrastructure, you need to understand how many moving parts are involved in scheduling a Pod on the node. The sequence of events when the Pod is scheduled in the Node is as follows:

  1. Device plugin is scheduled on the Node
  2. Device plugin is registered with the kubelet via local gRPC
  3. Kubelet uses device plugin to watch for devices and updates capacity of the node
  4. Scheduler places a user Pod on a Node based on the updated capacity
  5. Kubelet asks Device plugin to Allocate devices for a User Pod
  6. Kubelet creates a User Pod with the allocated devices attached to it

This diagram shows some of those actors involved:

The diagram shows relationships between the kubelet, Device plugin, and a user Pod. It shows that kubelet connects to the Device plugin named my-device, kubelet reports the node status with the my-device availability, and the user Pod requesting the 2 of my-device.

As there are so many actors interconnected, every one of them and every connection may experience interruptions. This leads to many exceptional situations that are often considered failures, and may cause serious workload interruptions:

The same diagram as one above it, however it has an overlayed orange bang drawings over individual components with the text indicating what can break in that component. Over the kubelet text reads: 'kubelet restart: looses all devices info before re-Watch'. Over the Device plugin text reads: 'device plugin update, evictIon, restart: kubelet cannot Allocate devices or loses all devices state'. Over the user Pod text reads: 'slow pod termination: devices are unavailable'.

The goal for Kubernetes is to make the interruption between these components as reliable as possible. Kubelet already implements retries, grace periods, and other techniques to improve it. The roadmap section goes into details on other edge cases that the Kubernetes project tracks. However, all these improvements only work when these best practices are followed:

Another class of Kubernetes infra-related issues is driver-related. With traditional resources like CPU and memory, no compatibility checks between the application and hardware were needed. With special devices like hardware accelerators, there are new failure modes. Device drivers installed on the node:

Best practices for handling driver versions:

Following the best practices in this section and using device plugins and device driver installers from trusted and reliable sources generally eliminate this class of failures. Kubernetes is tracking work to make this space even better.

Failure modes: device failed

There is very little handling of device failure in Kubernetes today. Device plugins report the device failure only by changing the count of allocatable devices. And Kubernetes relies on standard mechanisms like liveness probes or container failures to allow Pods to communicate the failure condition to the kubelet. However, Kubernetes does not correlate device failures with container crashes and does not offer any mitigation beyond restarting the container while being attached to the same device.

This is why many plugins and DIY solutions exist to handle device failures based on various signals.

Health controller

In many cases a failed device will result in unrecoverable and very expensive nodes doing nothing. A simple DIY solution is a node health controller. The controller could compare the device allocatable count with the capacity and if the capacity is greater, it starts a timer. Once the timer reaches a threshold, the health controller kills and recreates a node.

There are problems with the health controller approach:

There are variations of the health controller solving some of the problems above. The overall theme here though is that to best handle failed devices, you need customized handling for the specific workload. Kubernetes doesn't yet offer enough abstraction to express how critical the device is for a node, for the cluster, and for the Pod it is assigned to.

Pod failure policy

Another DIY approach for device failure handling is a per-pod reaction on a failed device. This approach is applicable for training workloads that are implemented as Jobs.

Pod can define special error codes for device failures. For example, whenever unexpected device behavior is encountered, Pod exits with a special exit code. Then the Pod failure policy can handle the device failure in a special way. Read more on Handling retriable and non-retriable pod failures with Pod failure policy

There are some problems with the Pod failure policy approach for Jobs:

So, this solution has limited applicability.

Custom pod watcher

A little more generic approach is to implement the Pod watcher as a DIY solution or use some third party tools offering this functionality. The pod watcher is most often used to handle device failures for inference workloads.

Since Kubernetes just keeps a pod assigned to a device, even if the device is reportedly unhealthy, the idea is to detect this situation with the pod watcher and apply some remediation. It often involves obtaining device health status and its mapping to the Pod using Pod Resources API on the node. If a device fails, it can then delete the attached Pod as a remediation. The replica set will handle the Pod recreation on a healthy device.

The other reasons to implement this watcher:

Problems with the custom pod watcher:

There are more variations of DIY solutions for handling device failures or upcoming maintenance. Overall, Kubernetes has enough extension points to implement these solutions. However, some extension points require higher privilege than users may be comfortable with or are too disruptive. The roadmap section goes into more details on specific improvements in handling the device failures.

Failure modes: container code failed

When the container code fails or something bad happens with it, like out of memory conditions, Kubernetes knows how to handle those cases. There is either the restart of a container, or a crash of a Pod if it has restartPolicy: Never and scheduling it on another node. Kubernetes has limited expressiveness on what is a failure (for example, non-zero exit code or liveness probe failure) and how to react on such a failure (mostly either Always restart or immediately fail the Pod).

This level of expressiveness is often not enough for the complicated AI/ML workloads. AI/ML pods are better rescheduled locally or even in-place as that would save on image pulling time and device allocation. AI/ML pods are often interconnected and need to be restarted together. This adds another level of complexity and optimizing it often brings major savings in running AI/ML workloads.

There are various DIY solutions to handle Pod failures orchestration. The most typical one is to wrap a main executable in a container by some orchestrator. And this orchestrator will be able to restart the main executable whenever the job needs to be restarted because some other pod has failed.

Solutions like this are very fragile and elaborate. They are often worth the money saved comparing to a regular JobSet delete/recreate cycle when used in large training jobs. Making these solutions less fragile and more streamlined by developing new hooks and extension points in Kubernetes will make it easy to apply to smaller jobs, benefiting everybody.

Failure modes: device degradation

Not all device failures are terminal for the overall workload or batch job. As the hardware stack gets more and more complex, misconfiguration on one of the hardware stack layers, or driver failures, may result in devices that are functional, but lagging on performance. One device that is lagging behind can slow down the whole training job.

We see reports of such cases more and more often. Kubernetes has no way to express this type of failures today and since it is the newest type of failure mode, there is not much of a best practice offered by hardware vendors for detection and third party tooling for remediation of these situations.

Typically, these failures are detected based on observed workload characteristics. For example, the expected speed of AI/ML training steps on particular hardware. Remediation for those issues is highly depend on a workload needs.

Roadmap

As outlined in a section above, Kubernetes offers a lot of extension points which are used to implement various DIY solutions. The space of AI/ML is developing very fast, with changing requirements and usage patterns. SIG Node is taking a measured approach of enabling more extension points to implement the workload-specific scenarios over introduction of new semantics to support specific scenarios. This means prioritizing making information about failures readily available over implementing automatic remediations for those failures that might only be suitable for a subset of workloads.

This approach ensures there are no drastic changes for workload handling which may break existing, well-oiled DIY solutions or experiences with the existing more traditional workloads.

Many error handling techniques used today work for AI/ML, but are very expensive. SIG Node will invest in extension points to make those cheaper, with the understanding that the price cutting for AI/ML is critical.

The following is the set of specific investments we envision for various failure modes.

Roadmap for failure modes: K8s infrastructure

The area of Kubernetes infrastructure is the easiest to understand and very important to make right for the upcoming transition from Device Plugins to DRA. SIG Node is tracking many work items in this area, most notably the following:

Basically, every interaction of Kubernetes components must be reliable via either the kubelet improvements or the best practices in plugins development and deployment.

Roadmap for failure modes: device failed

For the device failures some patterns are already emerging in common scenarios that Kubernetes can support. However, the very first step is to make information about failed devices available easier. The very first step here is the work in KEP 4680 (Add Resource Health Status to the Pod Status for Device Plugin and DRA).

Longer term ideas include to be tested:

Roadmap for failure modes: container code failed

The main improvements to handle container code failures for AI/ML workloads are all targeting cheaper error handling and recovery. The cheapness is mostly coming from reuse of pre-allocated resources as much as possible. From reusing the Pods by restarting containers in-place, to node local restart of containers instead of rescheduling whenever possible, to snapshotting support, and re-scheduling prioritizing the same node to save on image pulls.

Consider this scenario: A big training job needs 512 Pods to run. And one of the pods failed. It means that all Pods need to be interrupted and synced up to restart the failed step. The most efficient way to achieve this generally is to reuse as many Pods as possible by restarting them in-place, while replacing the failed pod to clear up the error from it. Like demonstrated in this picture:

The picture shows 512 pod, most ot them are green and have a recycle sign next to them indicating that they can be reused, and one Pod drawn in red, and a new green replacement Pod next to it indicating that it needs to be replaced.

It is possible to implement this scenario, but all solutions implementing it are fragile due to lack of certain extension points in Kubernetes. Adding these extension points to implement this scenario is on the Kubernetes roadmap.

Roadmap for failure modes: device degradation

There is very little done in this area - there is no clear detection signal, very limited troubleshooting tooling, and no built-in semantics to express the "degraded" device on Kubernetes. There has been discussion of adding data on device performance or degradation in the ResourceSlice used by DRA to represent devices, but it is not yet clearly defined. There are also projects like node-healthcheck-operator that can be used for some scenarios.

We expect developments in this area from hardware vendors and cloud providers, and we expect to see mostly DIY solutions in the near future. As more users get exposed to AI/ML workloads, this is a space needing feedback on patterns used here.

Join the conversation

The Kubernetes community encourages feedback and participation in shaping the future of device failure handling. Join SIG Node and contribute to the ongoing discussions!

This blog post provides a high-level overview of the challenges and future directions for device failure management in Kubernetes. By addressing these issues, Kubernetes can solidify its position as the leading platform for AI/ML workloads, ensuring resilience and reliability for applications that depend on specialized hardware.

03 Jul 2025 12:00am GMT

25 Jun 2025

feedKubernetes Blog

Image Compatibility In Cloud Native Environments

In industries where systems must run very reliably and meet strict performance criteria such as telecommunication, high-performance or AI computing, containerized applications often need specific operating system configuration or hardware presence. It is common practice to require the use of specific versions of the kernel, its configuration, device drivers, or system components. Despite the existence of the Open Container Initiative (OCI), a governing community to define standards and specifications for container images, there has been a gap in expression of such compatibility requirements. The need to address this issue has led to different proposals and, ultimately, an implementation in Kubernetes' Node Feature Discovery (NFD).

NFD is an open source Kubernetes project that automatically detects and reports hardware and system features of cluster nodes. This information helps users to schedule workloads on nodes that meet specific system requirements, which is especially useful for applications with strict hardware or operating system dependencies.

The need for image compatibility specification

Dependencies between containers and host OS

A container image is built on a base image, which provides a minimal runtime environment, often a stripped-down Linux userland, completely empty or distroless. When an application requires certain features from the host OS, compatibility issues arise. These dependencies can manifest in several ways:

While containers in Kubernetes are the most likely unit of abstraction for these needs, the definition of compatibility can extend further to include other container technologies such as Singularity and other OCI artifacts such as binaries from a spack binary cache.

Multi-cloud and hybrid cloud challenges

Containerized applications are deployed across various Kubernetes distributions and cloud providers, where different host operating systems introduce compatibility challenges. Often those have to be pre-configured before workload deployment or are immutable. For instance, different cloud providers will include different operating systems like:

Each OS comes with unique kernel versions, configurations, and drivers, making compatibility a non-trivial issue for applications requiring specific features. It must be possible to quickly assess a container for its suitability to run on any specific environment.

Image compatibility initiative

An effort was made within the Open Containers Initiative Image Compatibility working group to introduce a standard for image compatibility metadata. A specification for compatibility would allow container authors to declare required host OS features, making compatibility requirements discoverable and programmable. The specification implemented in Kubernetes Node Feature Discovery is one of the discussed proposals. It aims to:

The concept has since been implemented in the Kubernetes Node Feature Discovery project.

Implementation in Node Feature Discovery

The solution integrates compatibility metadata into Kubernetes via NFD features and the NodeFeatureGroup API. This interface enables the user to match containers to nodes based on exposing features of hardware and software, allowing for intelligent scheduling and workload optimization.

Compatibility specification

The compatibility specification is a structured list of compatibility objects containing Node Feature Groups. These objects define image requirements and facilitate validation against host nodes. The feature requirements are described by using the list of available features from the NFD project. The schema has the following structure:

An example might look like the following:

version: v1alpha1
compatibilities:
- description: "My image requirements"
 rules:
 - name: "kernel and cpu"
 matchFeatures:
 - feature: kernel.loadedmodule
 matchExpressions:
 vfio-pci: {op: Exists}
 - feature: cpu.model
 matchExpressions:
 vendor_id: {op: In, value: ["Intel", "AMD"]}
 - name: "one of available nics"
 matchAny:
 - matchFeatures:
 - feature: pci.device
 matchExpressions:
 vendor: {op: In, value: ["0eee"]}
 class: {op: In, value: ["0200"]}
 - matchFeatures:
 - feature: pci.device
 matchExpressions:
 vendor: {op: In, value: ["0fff"]}
 class: {op: In, value: ["0200"]}

Client implementation for node validation

To streamline compatibility validation, we implemented a client tool that allows for node validation based on an image's compatibility artifact. In this workflow, the image author would generate a compatibility artifact that points to the image it describes in a registry via the referrers API. When a need arises to assess the fit of an image to a host, the tool can discover the artifact and verify compatibility of an image to a node before deployment. The client can validate nodes both inside and outside a Kubernetes cluster, extending the utility of the tool beyond the single Kubernetes use case. In the future, image compatibility could play a crucial role in creating specific workload profiles based on image compatibility requirements, aiding in more efficient scheduling. Additionally, it could potentially enable automatic node configuration to some extent, further optimizing resource allocation and ensuring seamless deployment of specialized workloads.

Examples of usage

  1. Define image compatibility metadata

    A container image can have metadata that describes its requirements based on features discovered from nodes, like kernel modules or CPU models. The previous compatibility specification example in this article exemplified this use case.

  2. Attach the artifact to the image

    The image compatibility specification is stored as an OCI artifact. You can attach this metadata to your container image using the oras tool. The registry only needs to support OCI artifacts, support for arbitrary types is not required. Keep in mind that the container image and the artifact must be stored in the same registry. Use the following command to attach the artifact to the image:

    oras attach \
    --artifact-type application/vnd.nfd.image-compatibility.v1alpha1 <image-url> \ 
    <path-to-spec>.yaml:application/vnd.nfd.image-compatibility.spec.v1alpha1+yaml
    
  3. Validate image compatibility

    After attaching the compatibility specification, you can validate whether a node meets the image's requirements. This validation can be done using the nfd client:

    nfd compat validate-node --image <image-url>
    
  4. Read the output from the client

    Finally you can read the report generated by the tool or use your own tools to act based on the generated JSON report.

    validate-node command output

Conclusion

The addition of image compatibility to Kubernetes through Node Feature Discovery underscores the growing importance of addressing compatibility in cloud native environments. It is only a start, as further work is needed to integrate compatibility into scheduling of workloads within and outside of Kubernetes. However, by integrating this feature into Kubernetes, mission-critical workloads can now define and validate host OS requirements more efficiently. Moving forward, the adoption of compatibility metadata within Kubernetes ecosystems will significantly enhance the reliability and performance of specialized containerized applications, ensuring they meet the stringent requirements of industries like telecommunications, high-performance computing or any environment that requires special hardware or host OS configuration.

Get involved

Join the Kubernetes Node Feature Discovery project if you're interested in getting involved with the design and development of Image Compatibility API and tools. We always welcome new contributors.

25 Jun 2025 12:00am GMT

16 Jun 2025

feedKubernetes Blog

Changes to Kubernetes Slack

UPDATE: We've received notice from Salesforce that our Slack workspace WILL NOT BE DOWNGRADED on June 20th. Stand by for more details, but for now, there is no urgency to back up private channels or direct messages.

Kubernetes Slack will lose its special status and will be changing into a standard free Slack on June 20, 2025. Sometime later this year, our community may move to a new platform. If you are responsible for a channel or private channel, or a member of a User Group, you will need to take some actions as soon as you can.

For the last decade, Slack has supported our project with a free customized enterprise account. They have let us know that they can no longer do so, particularly since our Slack is one of the largest and more active ones on the platform. As such, they will be downgrading it to a standard free Slack while we decide on, and implement, other options.

On Friday, June 20, we will be subject to the feature limitations of free Slack. The primary ones which will affect us will be only retaining 90 days of history, and having to disable several apps and workflows which we are currently using. The Slack Admin team will do their best to manage these limitations.

Responsible channel owners, members of private channels, and members of User Groups should take some actions to prepare for the upgrade and preserve information as soon as possible.

The CNCF Projects Staff have proposed that our community look at migrating to Discord. Because of existing issues where we have been pushing the limits of Slack, they have already explored what a Kubernetes Discord would look like. Discord would allow us to implement new tools and integrations which would help the community, such as GitHub group membership synchronization. The Steering Committee will discuss and decide on our future platform.

Please see our FAQ, and check the kubernetes-dev mailing list and the #announcements channel for further news. If you have specific feedback on our Slack status join the discussion on GitHub.

16 Jun 2025 12:00am GMT

10 Jun 2025

feedKubernetes Blog

Enhancing Kubernetes Event Management with Custom Aggregation

Kubernetes Events provide crucial insights into cluster operations, but as clusters grow, managing and analyzing these events becomes increasingly challenging. This blog post explores how to build custom event aggregation systems that help engineering teams better understand cluster behavior and troubleshoot issues more effectively.

The challenge with Kubernetes events

In a Kubernetes cluster, events are generated for various operations - from pod scheduling and container starts to volume mounts and network configurations. While these events are invaluable for debugging and monitoring, several challenges emerge in production environments:

  1. Volume: Large clusters can generate thousands of events per minute
  2. Retention: Default event retention is limited to one hour
  3. Correlation: Related events from different components are not automatically linked
  4. Classification: Events lack standardized severity or category classifications
  5. Aggregation: Similar events are not automatically grouped

To learn more about Events in Kubernetes, read the Event API reference.

Real-World value

Consider a production environment with tens of microservices where the users report intermittent transaction failures:

Traditional event aggregation process: Engineers are wasting hours sifting through thousands of standalone events spread across namespaces. By the time they look into it, the older events have long since purged, and correlating pod restarts to node-level issues is practically impossible.

With its event aggregation in its custom events: The system groups events across resources, instantly surfacing correlation patterns such as volume mount timeouts before pod restarts. History indicates it occurred during past record traffic spikes, highlighting a storage scalability issue in minutes rather than hours.

The benefit of this approach is that organizations that implement it commonly cut down their troubleshooting time significantly along with increasing the reliability of systems by detecting patterns early.

Building an Event aggregation system

This post explores how to build a custom event aggregation system that addresses these challenges, aligned to Kubernetes best practices. I've picked the Go programming language for my example.

Architecture overview

This event aggregation system consists of three main components:

  1. Event Watcher: Monitors the Kubernetes API for new events
  2. Event Processor: Processes, categorizes, and correlates events
  3. Storage Backend: Stores processed events for longer retention

Here's a sketch for how to implement the event watcher:

package main

import (
 "context"
 metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 "k8s.io/client-go/kubernetes"
 "k8s.io/client-go/rest"
 eventsv1 "k8s.io/api/events/v1"
)

type EventWatcher struct {
 clientset *kubernetes.Clientset
}

func NewEventWatcher(config *rest.Config) (*EventWatcher, error) {
 clientset, err := kubernetes.NewForConfig(config)
 if err != nil {
 return nil, err
 }
 return &EventWatcher{clientset: clientset}, nil
}

func (w *EventWatcher) Watch(ctx context.Context) (<-chan *eventsv1.Event, error) {
 events := make(chan *eventsv1.Event)

 watcher, err := w.clientset.EventsV1().Events("").Watch(ctx, metav1.ListOptions{})
 if err != nil {
 return nil, err
 }

 go func() {
 defer close(events)
 for {
 select {
 case event := <-watcher.ResultChan():
 if e, ok := event.Object.(*eventsv1.Event); ok {
 events <- e
 }
 case <-ctx.Done():
 watcher.Stop()
 return
 }
 }
 }()

 return events, nil
}

Event processing and classification

The event processor enriches events with additional context and classification:

type EventProcessor struct {
 categoryRules []CategoryRule
 correlationRules []CorrelationRule
}

type ProcessedEvent struct {
 Event *eventsv1.Event
 Category string
 Severity string
 CorrelationID string
 Metadata map[string]string
}

func (p *EventProcessor) Process(event *eventsv1.Event) *ProcessedEvent {
 processed := &ProcessedEvent{
 Event: event,
 Metadata: make(map[string]string),
 }

 // Apply classification rules
 processed.Category = p.classifyEvent(event)
 processed.Severity = p.determineSeverity(event)

 // Generate correlation ID for related events
 processed.CorrelationID = p.correlateEvent(event)

 // Add useful metadata
 processed.Metadata = p.extractMetadata(event)

 return processed
}

Implementing Event correlation

One of the key features you could implement is a way of correlating related Events. Here's an example correlation strategy:

func (p *EventProcessor) correlateEvent(event *eventsv1.Event) string {
 // Correlation strategies:
 // 1. Time-based: Events within a time window
 // 2. Resource-based: Events affecting the same resource
 // 3. Causation-based: Events with cause-effect relationships

 correlationKey := generateCorrelationKey(event)
 return correlationKey
}

func generateCorrelationKey(event *eventsv1.Event) string {
 // Example: Combine namespace, resource type, and name
 return fmt.Sprintf("%s/%s/%s",
 event.InvolvedObject.Namespace,
 event.InvolvedObject.Kind,
 event.InvolvedObject.Name,
 )
}

Event storage and retention

For long-term storage and analysis, you'll probably want a backend that supports:

Here's a sample storage interface:

type EventStorage interface {
 Store(context.Context, *ProcessedEvent) error
 Query(context.Context, EventQuery) ([]ProcessedEvent, error)
 Aggregate(context.Context, AggregationParams) ([]EventAggregate, error)
}

type EventQuery struct {
 TimeRange TimeRange
 Categories []string
 Severity []string
 CorrelationID string
 Limit int
}

type AggregationParams struct {
 GroupBy []string
 TimeWindow string
 Metrics []string
}

Good practices for Event management

  1. Resource Efficiency

    • Implement rate limiting for event processing
    • Use efficient filtering at the API server level
    • Batch events for storage operations
  2. Scalability

    • Distribute event processing across multiple workers
    • Use leader election for coordination
    • Implement backoff strategies for API rate limits
  3. Reliability

    • Handle API server disconnections gracefully
    • Buffer events during storage backend unavailability
    • Implement retry mechanisms with exponential backoff

Advanced features

Pattern detection

Implement pattern detection to identify recurring issues:

type PatternDetector struct {
 patterns map[string]*Pattern
 threshold int
}

func (d *PatternDetector) Detect(events []ProcessedEvent) []Pattern {
 // Group similar events
 groups := groupSimilarEvents(events)

 // Analyze frequency and timing
 patterns := identifyPatterns(groups)

 return patterns
}

func groupSimilarEvents(events []ProcessedEvent) map[string][]ProcessedEvent {
 groups := make(map[string][]ProcessedEvent)

 for _, event := range events {
 // Create similarity key based on event characteristics
 similarityKey := fmt.Sprintf("%s:%s:%s",
 event.Event.Reason,
 event.Event.InvolvedObject.Kind,
 event.Event.InvolvedObject.Namespace,
 )

 // Group events with the same key
 groups[similarityKey] = append(groups[similarityKey], event)
 }

 return groups
}


func identifyPatterns(groups map[string][]ProcessedEvent) []Pattern {
 var patterns []Pattern

 for key, events := range groups {
 // Only consider groups with enough events to form a pattern
 if len(events) < 3 {
 continue
 }

 // Sort events by time
 sort.Slice(events, func(i, j int) bool {
 return events[i].Event.LastTimestamp.Time.Before(events[j].Event.LastTimestamp.Time)
 })

 // Calculate time range and frequency
 firstSeen := events[0].Event.FirstTimestamp.Time
 lastSeen := events[len(events)-1].Event.LastTimestamp.Time
 duration := lastSeen.Sub(firstSeen).Minutes()

 var frequency float64
 if duration > 0 {
 frequency = float64(len(events)) / duration
 }

 // Create a pattern if it meets threshold criteria
 if frequency > 0.5 { // More than 1 event per 2 minutes
 pattern := Pattern{
 Type: key,
 Count: len(events),
 FirstSeen: firstSeen,
 LastSeen: lastSeen,
 Frequency: frequency,
 EventSamples: events[:min(3, len(events))], // Keep up to 3 samples
 }
 patterns = append(patterns, pattern)
 }
 }

 return patterns
}

With this implementation, the system can identify recurring patterns such as node pressure events, pod scheduling failures, or networking issues that occur with a specific frequency.

Real-time alerts

The following example provides a starting point for building an alerting system based on event patterns. It is not a complete solution but a conceptual sketch to illustrate the approach.

type AlertManager struct {
 rules []AlertRule
 notifiers []Notifier
}

func (a *AlertManager) EvaluateEvents(events []ProcessedEvent) {
 for _, rule := range a.rules {
 if rule.Matches(events) {
 alert := rule.GenerateAlert(events)
 a.notify(alert)
 }
 }
}

Conclusion

A well-designed event aggregation system can significantly improve cluster observability and troubleshooting capabilities. By implementing custom event processing, correlation, and storage, operators can better understand cluster behavior and respond to issues more effectively.

The solutions presented here can be extended and customized based on specific requirements while maintaining compatibility with the Kubernetes API and following best practices for scalability and reliability.

Next steps

Future enhancements could include:

For more information on Kubernetes events and custom controllers, refer to the official Kubernetes documentation.

10 Jun 2025 12:00am GMT

05 Jun 2025

feedKubernetes Blog

Introducing Gateway API Inference Extension

Modern generative AI and large language model (LLM) services create unique traffic-routing challenges on Kubernetes. Unlike typical short-lived, stateless web requests, LLM inference sessions are often long-running, resource-intensive, and partially stateful. For example, a single GPU-backed model server may keep multiple inference sessions active and maintain in-memory token caches.

Traditional load balancers focused on HTTP path or round-robin lack the specialized capabilities needed for these workloads. They also don't account for model identity or request criticality (e.g., interactive chat vs. batch jobs). Organizations often patch together ad-hoc solutions, but a standardized approach is missing.

Gateway API Inference Extension

Gateway API Inference Extension was created to address this gap by building on the existing Gateway API, adding inference-specific routing capabilities while retaining the familiar model of Gateways and HTTPRoutes. By adding an inference extension to your existing gateway, you effectively transform it into an Inference Gateway, enabling you to self-host GenAI/LLMs with a "model-as-a-service" mindset.

The project's goal is to improve and standardize routing to inference workloads across the ecosystem. Key objectives include enabling model-aware routing, supporting per-request criticalities, facilitating safe model roll-outs, and optimizing load balancing based on real-time model metrics. By achieving these, the project aims to reduce latency and improve accelerator (GPU) utilization for AI workloads.

How it works

The design introduces two new Custom Resources (CRDs) with distinct responsibilities, each aligning with a specific user persona in the AI/ML serving workflow​:

Resource Model
  1. InferencePool Defines a pool of pods (model servers) running on shared compute (e.g., GPU nodes). The platform admin can configure how these pods are deployed, scaled, and balanced. An InferencePool ensures consistent resource usage and enforces platform-wide policies. An InferencePool is similar to a Service but specialized for AI/ML serving needs and aware of the model-serving protocol.

  2. InferenceModel A user-facing model endpoint managed by AI/ML owners. It maps a public name (e.g., "gpt-4-chat") to the actual model within an InferencePool. This lets workload owners specify which models (and optional fine-tuning) they want served, plus a traffic-splitting or prioritization policy.

In summary, the InferenceModel API lets AI/ML owners manage what is served, while the InferencePool lets platform operators manage where and how it's served.

Request flow

The flow of a request builds on the Gateway API model (Gateways and HTTPRoutes) with one or more extra inference-aware steps (extensions) in the middle. Here's a high-level example of the request flow with the Endpoint Selection Extension (ESE):

Request Flow
  1. Gateway Routing
    A client sends a request (e.g., an HTTP POST to /completions). The Gateway (like Envoy) examines the HTTPRoute and identifies the matching InferencePool backend.

  2. Endpoint Selection
    Instead of simply forwarding to any available pod, the Gateway consults an inference-specific routing extension- the Endpoint Selection Extension-to pick the best of the available pods. This extension examines live pod metrics (queue lengths, memory usage, loaded adapters) to choose the ideal pod for the request.

  3. Inference-Aware Scheduling
    The chosen pod is the one that can handle the request with the lowest latency or highest efficiency, given the user's criticality or resource needs. The Gateway then forwards traffic to that specific pod.

Endpoint Extension Scheduling

This extra step provides a smarter, model-aware routing mechanism that still feels like a normal single request to the client. Additionally, the design is extensible-any Inference Gateway can be enhanced with additional inference-specific extensions to handle new routing strategies, advanced scheduling logic, or specialized hardware needs. As the project continues to grow, contributors are encouraged to develop new extensions that are fully compatible with the same underlying Gateway API model, further expanding the possibilities for efficient and intelligent GenAI/LLM routing.

Benchmarks

We evaluated ​this extension against a standard Kubernetes Service for a vLLM‐based model serving deployment. The test environment consisted of multiple H100 (80 GB) GPU pods running vLLM (version 1) on a Kubernetes cluster, with 10 Llama2 model replicas. The Latency Profile Generator (LPG) tool was used to generate traffic and measure throughput, latency, and other metrics. The ShareGPT dataset served as the workload, and traffic was ramped from 100 Queries per Second (QPS) up to 1000 QPS.

Key results

Endpoint Extension Scheduling

These results suggest that this extension's model‐aware routing significantly reduced latency for GPU‐backed LLM workloads. By dynamically selecting the least‐loaded or best‐performing model server, it avoids hotspots that can appear when using traditional load balancing methods for large, long‐running inference requests.

Roadmap

As the Gateway API Inference Extension heads toward GA, planned features include:

  1. Prefix-cache aware load balancing for remote caches
  2. LoRA adapter pipelines for automated rollout
  3. Fairness and priority between workloads in the same criticality band
  4. HPA support for scaling based on aggregate, per-model metrics
  5. Support for large multi-modal inputs/outputs
  6. Additional model types (e.g., diffusion models)
  7. Heterogeneous accelerators (serving on multiple accelerator types with latency- and cost-aware load balancing)
  8. Disaggregated serving for independently scaling pools

Summary

By aligning model serving with Kubernetes-native tooling, Gateway API Inference Extension aims to simplify and standardize how AI/ML traffic is routed. With model-aware routing, criticality-based prioritization, and more, it helps ops teams deliver the right LLM services to the right users-smoothly and efficiently.

Ready to learn more? Visit the project docs to dive deeper, give an Inference Gateway extension a try with a few simple steps, and get involved if you're interested in contributing to the project!

05 Jun 2025 12:00am GMT

03 Jun 2025

feedKubernetes Blog

Start Sidecar First: How To Avoid Snags

From the Kubernetes Multicontainer Pods: An Overview blog post you know what their job is, what are the main architectural patterns, and how they are implemented in Kubernetes. The main thing I'll cover in this article is how to ensure that your sidecar containers start before the main app. It's more complicated than you might think!

A gentle refresher

I'd just like to remind readers that the v1.29.0 release of Kubernetes added native support for sidecar containers, which can now be defined within the .spec.initContainers field, but with restartPolicy: Always. You can see that illustrated in the following example Pod manifest snippet:

initContainers:
 - name: logshipper
 image: alpine:latest
 restartPolicy: Always # this is what makes it a sidecar container
 command: ['sh', '-c', 'tail -F /opt/logs.txt']
 volumeMounts:
 - name: data
 mountPath: /opt

What are the specifics of defining sidecars with a .spec.initContainers block, rather than as a legacy multi-container pod with multiple .spec.containers? Well, all .spec.initContainers are always launched before the main application. If you define Kubernetes-native sidecars, those are terminated after the main application. Furthermore, when used with Jobs, a sidecar container should still be alive and could potentially even restart after the owning Job is complete; Kubernetes-native sidecar containers do not block pod completion.

To learn more, you can also read the official Pod sidecar containers tutorial.

The problem

Now you know that defining a sidecar with this native approach will always start it before the main application. From the kubelet source code, it's visible that this often means being started almost in parallel, and this is not always what an engineer wants to achieve. What I'm really interested in is whether I can delay the start of the main application until the sidecar is not just started, but fully running and ready to serve. It might be a bit tricky because the problem with sidecars is there's no obvious success signal, contrary to init containers - designed to run only for a specified period of time. With an init container, exit status 0 is unambiguously "I succeeded". With a sidecar, there are lots of points at which you can say "a thing is running". Starting one container only after the previous one is ready is part of a graceful deployment strategy, ensuring proper sequencing and stability during startup. It's also actually how I'd expect sidecar containers to work as well, to cover the scenario where the main application is dependent on the sidecar. For example, it may happen that an app errors out if the sidecar isn't available to serve requests (e.g., logging with DataDog). Sure, one could change the application code (and it would actually be the "best practice" solution), but sometimes they can't - and this post focuses on this use case.

I'll explain some ways that you might try, and show you what approaches will really work.

Readiness probe

To check whether Kubernetes native sidecar delays the start of the main application until the sidecar is ready, let's simulate a short investigation. Firstly, I'll simulate a sidecar container which will never be ready by implementing a readiness probe which will never succeed. As a reminder, a readiness probe checks if the container is ready to start accepting traffic and therefore, if the pod can be used as a backend for services.

(Unlike standard init containers, sidecar containers can have probes so that the kubelet can supervise the sidecar and intervene if there are problems. For example, restarting a sidecar container if it fails a health check.)

apiVersion: apps/v1
kind: Deployment
metadata:
 name: myapp
 labels:
 app: myapp
spec:
 replicas: 1
 selector:
 matchLabels:
 app: myapp
 template:
 metadata:
 labels:
 app: myapp
 spec:
 containers:
 - name: myapp
 image: alpine:latest
 command: ["sh", "-c", "sleep 3600"]
 initContainers:
 - name: nginx
 image: nginx:latest
 restartPolicy: Always
 ports:
 - containerPort: 80
 protocol: TCP
 readinessProbe:
 exec:
 command:
 - /bin/sh
 - -c
 - exit 1 # this command always fails, keeping the container "Not Ready"
 periodSeconds: 5
 volumes:
 - name: data
 emptyDir: {}

The result is:

controlplane $ kubectl get pods -w
NAME READY STATUS RESTARTS AGE
myapp-db5474f45-htgw5 1/2 Running 0 9m28s

controlplane $ kubectl describe pod myapp-db5474f45-htgw5
Name: myapp-db5474f45-htgw5
Namespace: default
(...)
Events:
 Type Reason Age From Message
 ---- ------ ---- ---- -------
 Normal Scheduled 17s default-scheduler Successfully assigned default/myapp-db5474f45-htgw5 to node01
 Normal Pulling 16s kubelet Pulling image "nginx:latest"
 Normal Pulled 16s kubelet Successfully pulled image "nginx:latest" in 163ms (163ms including waiting). Image size: 72080558 bytes.
 Normal Created 16s kubelet Created container nginx
 Normal Started 16s kubelet Started container nginx
 Normal Pulling 15s kubelet Pulling image "alpine:latest"
 Normal Pulled 15s kubelet Successfully pulled image "alpine:latest" in 159ms (160ms including waiting). Image size: 3652536 bytes.
 Normal Created 15s kubelet Created container myapp
 Normal Started 15s kubelet Started container myapp
 Warning Unhealthy 1s (x6 over 15s) kubelet Readiness probe failed:

From these logs it's evident that only one container is ready - and I know it can't be the sidecar, because I've defined it so it'll never be ready (you can also check container statuses in kubectl get pod -o json). I also saw that myapp has been started before the sidecar is ready. That was not the result I wanted to achieve; in this case, the main app container has a hard dependency on its sidecar.

Maybe a startup probe?

To ensure that the sidecar is ready before the main app container starts, I can define a startupProbe. It will delay the start of the main container until the command is successfully executed (returns 0 exit status). If you're wondering why I've added it to my initContainer, let's analyse what happens If I'd added it to myapp container. I wouldn't have guaranteed the probe would run before the main application code - and this one, can potentially error out without the sidecar being up and running.

apiVersion: apps/v1
kind: Deployment
metadata:
 name: myapp
 labels:
 app: myapp
spec:
 replicas: 1
 selector:
 matchLabels:
 app: myapp
 template:
 metadata:
 labels:
 app: myapp
 spec:
 containers:
 - name: myapp
 image: alpine:latest
 command: ["sh", "-c", "sleep 3600"]
 initContainers:
 - name: nginx
 image: nginx:latest
 ports:
 - containerPort: 80
 protocol: TCP
 restartPolicy: Always
 startupProbe:
 httpGet:
 path: /
 port: 80
 initialDelaySeconds: 5
 periodSeconds: 30
 failureThreshold: 10
 timeoutSeconds: 20
 volumes:
 - name: data
 emptyDir: {}

This results in 2/2 containers being ready and running, and from events, it can be inferred that the main application started only after nginx had already been started. But to confirm whether it waited for the sidecar readiness, let's change the startupProbe to the exec type of command:

startupProbe:
 exec:
 command:
 - /bin/sh
 - -c
 - sleep 15

and run kubectl get pods -w to watch in real time whether the readiness of both containers only changes after a 15 second delay. Again, events confirm the main application starts after the sidecar. That means that using the startupProbe with a correct startupProbe.httpGet request helps to delay the main application start until the sidecar is ready. It's not optimal, but it works.

What about the postStart lifecycle hook?

Fun fact: using the postStart lifecycle hook block will also do the job, but I'd have to write my own mini-shell script, which is even less efficient.

initContainers:
 - name: nginx
 image: nginx:latest
 restartPolicy: Always
 ports:
 - containerPort: 80
 protocol: TCP
 lifecycle:
 postStart:
 exec:
 command:
 - /bin/sh
 - -c
 - |
 echo "Waiting for readiness at http://localhost:80"
 until curl -sf http://localhost:80; do
 echo "Still waiting for http://localhost:80..."
 sleep 5
 done
 echo "Service is ready at http://localhost:80"

Liveness probe

An interesting exercise would be to check the sidecar container behavior with a liveness probe. A liveness probe behaves and is configured similarly to a readiness probe - only with the difference that it doesn't affect the readiness of the container but restarts it in case the probe fails.

livenessProbe:
 exec:
 command:
 - /bin/sh
 - -c
 - exit 1 # this command always fails, keeping the container "Not Ready"
 periodSeconds: 5

After adding the liveness probe configured just as the previous readiness probe and checking events of the pod by kubectl describe pod it's visible that the sidecar has a restart count above 0. Nevertheless, the main application is not restarted nor influenced at all, even though I'm aware that (in our imaginary worst-case scenario) it can error out when the sidecar is not there serving requests. What if I'd used a livenessProbe without lifecycle postStart? Both containers will be immediately ready: at the beginning, this behavior will not be different from the one without any additional probes since the liveness probe doesn't affect readiness at all. After a while, the sidecar will begin to restart itself, but it won't influence the main container.

Findings summary

I'll summarize the startup behavior in the table below:

Probe/Hook Sidecar starts before the main app? Main app waits for the sidecar to be ready? What if the check doesn't pass?
readinessProbe Yes, but it's almost in parallel (effectively no) No Sidecar is not ready; main app continues running
livenessProbe Yes, but it's almost in parallel (effectively no) No Sidecar is restarted, main app continues running
startupProbe Yes Yes Main app is not started
postStart Yes, main app container starts after postStart completes Yes, but you have to provide custom logic for that Main app is not started

To summarize: with sidecars often being a dependency of the main application, you may want to delay the start of the latter until the sidecar is healthy. The ideal pattern is to start both containers simultaneously and have the app container logic delay at all levels, but it's not always possible. If that's what you need, you have to use the right kind of customization to the Pod definition. Thankfully, it's nice and quick, and you have the recipe ready above.

Happy deploying!

03 Jun 2025 12:00am GMT

02 Jun 2025

feedKubernetes Blog

Gateway API v1.3.0: Advancements in Request Mirroring, CORS, Gateway Merging, and Retry Budgets

Gateway API logo

Join us in the Kubernetes SIG Network community in celebrating the general availability of Gateway API v1.3.0! We are also pleased to announce that there are already a number of conformant implementations to try, made possible by postponing this blog announcement. Version 1.3.0 of the API was released about a month ago on April 24, 2025.

Gateway API v1.3.0 brings a new feature to the Standard channel (Gateway API's GA release channel): percentage-based request mirroring, and introduces three new experimental features: cross-origin resource sharing (CORS) filters, a standardized mechanism for listener and gateway merging, and retry budgets.

Also see the full release notes and applaud the v1.3.0 release team next time you see them.

Graduation to Standard channel

Graduation to the Standard channel is a notable achievement for Gateway API features, as inclusion in the Standard release channel denotes a high level of confidence in the API surface and provides guarantees of backward compatibility. Of course, as with any other Kubernetes API, Standard channel features can continue to evolve with backward-compatible additions over time, and we (SIG Network) certainly expect further refinements and improvements in the future. For more information on how all of this works, refer to the Gateway API Versioning Policy.

Percentage-based request mirroring

Leads: Lior Lieberman,Jake Bennert

GEP-3171: Percentage-Based Request Mirroring

Percentage-based request mirroring is an enhancement to the existing support for HTTP request mirroring, which allows HTTP requests to be duplicated to another backend using the RequestMirror filter type. Request mirroring is particularly useful in blue-green deployment. It can be used to assess the impact of request scaling on application performance without impacting responses to clients.

The previous mirroring capability worked on all the requests to a backendRef.
Percentage-based request mirroring allows users to specify a subset of requests they want to be mirrored, either by percentage or fraction. This can be particularly useful when services are receiving a large volume of requests. Instead of mirroring all of those requests, this new feature can be used to mirror a smaller subset of them.

Here's an example with 42% of the requests to "foo-v1" being mirrored to "foo-v2":

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
 name: http-filter-mirror
 labels:
 gateway: mirror-gateway
spec:
 parentRefs:
 - name: mirror-gateway
 hostnames:
 - mirror.example
 rules:
 - backendRefs:
 - name: foo-v1
 port: 8080
 filters:
 - type: RequestMirror
 requestMirror:
 backendRef:
 name: foo-v2
 port: 8080
 percent: 42 # This value must be an integer.

You can also configure the partial mirroring using a fraction. Here is an example with 5 out of every 1000 requests to "foo-v1" being mirrored to "foo-v2".

 rules:
 - backendRefs:
 - name: foo-v1
 port: 8080
 filters:
 - type: RequestMirror
 requestMirror:
 backendRef:
 name: foo-v2
 port: 8080
 fraction:
 numerator: 5
 denominator: 1000

Additions to Experimental channel

The Experimental channel is Gateway API's channel for experimenting with new features and gaining confidence with them before allowing them to graduate to standard. Please note: the experimental channel may include features that are changed or removed later.

Starting in release v1.3.0, in an effort to distinguish Experimental channel resources from Standard channel resources, any new experimental API kinds have the prefix "X". For the same reason, experimental resources are now added to the API group gateway.networking.x-k8s.io instead of gateway.networking.k8s.io. Bear in mind that using new experimental channel resources means they can coexist with standard channel resources, but migrating these resources to the standard channel will require recreating them with the standard channel names and API group (both of which lack the "x-k8s" designator or "X" prefix).

The v1.3 release introduces two new experimental API kinds: XBackendTrafficPolicy and XListenerSet. To be able to use experimental API kinds, you need to install the Experimental channel Gateway API YAMLs from the locations listed below.

CORS filtering

Leads: Liang Li, Eyal Pazz, Rob Scott

GEP-1767: CORS Filter

Cross-origin resource sharing (CORS) is an HTTP-header based mechanism that allows a web page to access restricted resources from a server on an origin (domain, scheme, or port) different from the domain that served the web page. This feature adds a new HTTPRoute filter type, called "CORS", to configure the handling of cross-origin requests before the response is sent back to the client.

To be able to use experimental CORS filtering, you need to install the Experimental channel Gateway API HTTPRoute yaml.

Here's an example of a simple cross-origin configuration:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
 name: http-route-cors
spec:
 parentRefs:
 - name: http-gateway
 rules:
 - matches:
 - path:
 type: PathPrefix
 value: /resource/foo
 filters:
 - cors:
 - type: CORS
 allowOrigins:
 - *
 allowMethods:
 - GET
 - HEAD
 - POST
 allowHeaders:
 - Accept
 - Accept-Language
 - Content-Language
 - Content-Type
 - Range
 backendRefs:
 - kind: Service
 name: http-route-cors
 port: 80

In this case, the Gateway returns an origin header of "*", which means that the requested resource can be referenced from any origin, a methods header (Access-Control-Allow-Methods) that permits the GET, HEAD, and POST verbs, and a headers header allowing Accept, Accept-Language, Content-Language, Content-Type, and Range.

HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET, HEAD, POST
Access-Control-Allow-Headers: Accept,Accept-Language,Content-Language,Content-Type,Range

The complete list of fields in the new CORS filter:

See CORS protocol for details.

XListenerSets (standardized mechanism for Listener and Gateway merging)

Lead: Dave Protasowski

GEP-1713: ListenerSets - Standard Mechanism to Merge Multiple Gateways

This release adds a new experimental API kind, XListenerSet, that allows a shared list of listeners to be attached to one or more parent Gateway(s). In addition, it expands upon the existing suggestion that Gateway API implementations may merge configuration from multiple Gateway objects. It also:

To be able to use experimental XListenerSet, you need to install the Experimental channel Gateway API XListenerSet yaml.

The following example shows a Gateway with an HTTP listener and two child HTTPS XListenerSets with unique hostnames and certificates. The combined set of listeners attached to the Gateway includes the two additional HTTPS listeners in the XListenerSets that attach to the Gateway. This example illustrates the delegation of listener TLS config to application owners in different namespaces ("store" and "app"). The HTTPRoute has both the Gateway listener named "foo" and one XListenerSet listener named "second" as parentRefs.

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
 name: prod-external
 namespace: infra
spec:
 gatewayClassName: example
 allowedListeners:
 - from: All
 listeners:
 - name: foo
 hostname: foo.com
 protocol: HTTP
 port: 80
---
apiVersion: gateway.networking.x-k8s.io/v1alpha1
kind: XListenerSet
metadata:
 name: store
 namespace: store
spec:
 parentRef:
 name: prod-external
 listeners:
 - name: first
 hostname: first.foo.com
 protocol: HTTPS
 port: 443
 tls:
 mode: Terminate
 certificateRefs:
 - kind: Secret
 group: ""
 name: first-workload-cert
---
apiVersion: gateway.networking.x-k8s.io/v1alpha1
kind: XListenerSet
metadata:
 name: app
 namespace: app
spec:
 parentRef:
 name: prod-external
 listeners:
 - name: second
 hostname: second.foo.com
 protocol: HTTPS
 port: 443
 tls:
 mode: Terminate
 certificateRefs:
 - kind: Secret
 group: ""
 name: second-workload-cert
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
 name: httproute-example
spec:
 parentRefs:
 - name: app
 kind: XListenerSet
 sectionName: second
 - name: parent-gateway
 kind: Gateway
 sectionName: foo
 ...

Each listener in a Gateway must have a unique combination of port, protocol, (and hostname if supported by the protocol) in order for all listeners to be compatible and not conflicted over which traffic they should receive.

Furthermore, implementations can merge separate Gateways into a single set of listener addresses if all listeners across those Gateways are compatible. The management of merged listeners was under-specified in releases prior to v1.3.0.

With the new feature, the specification on merging is expanded. Implementations must treat the parent Gateways as having the merged list of all listeners from itself and from attached XListenerSets, and validation of this list of listeners must behave the same as if the list were part of a single Gateway. Within a single Gateway, listeners are ordered using the following precedence:

  1. Single Listeners (not a part of an XListenerSet) first,
  2. Remaining listeners ordered by:
    • object creation time (oldest first), and if two listeners are defined in objects that have the same timestamp, then
    • alphabetically based on "{namespace}/{name of listener}"

Retry budgets (XBackendTrafficPolicy)

Leads: Eric Bishop, Mike Morris

GEP-3388: Retry Budgets

This feature allows you to configure a retry budget across all endpoints of a destination Service. This is used to limit additional client-side retries after reaching a configured threshold. When configuring the budget, the maximum percentage of active requests that may consist of retries may be specified, as well as the interval over which requests will be considered when calculating the threshold for retries. The development of this specification changed the existing experimental API kind BackendLBPolicy into a new experimental API kind, XBackendTrafficPolicy, in the interest of reducing the proliferation of policy resources that had commonalities.

To be able to use experimental retry budgets, you need to install the Experimental channel Gateway API XBackendTrafficPolicy yaml.

The following example shows an XBackendTrafficPolicy that applies a retryConstraint that represents a budget that limits the retries to a maximum of 20% of requests, over a duration of 10 seconds, and to a minimum of 3 retries over 1 second.

apiVersion: gateway.networking.x-k8s.io/v1alpha1
kind: XBackendTrafficPolicy
metadata:
 name: traffic-policy-example
spec:
 retryConstraint:
 budget:
 percent: 20
 interval: 10s
 minRetryRate:
 count: 3
 interval: 1s
 ...

Try it out

Unlike other Kubernetes APIs, you don't need to upgrade to the latest version of Kubernetes to get the latest version of Gateway API. As long as you're running Kubernetes 1.26 or later, you'll be able to get up and running with this version of Gateway API.

To try out the API, follow the Getting Started Guide. As of this writing, four implementations are already conformant with Gateway API v1.3 experimental channel features. In alphabetical order:

Get involved

Wondering when a feature will be added? There are lots of opportunities to get involved and help define the future of Kubernetes routing APIs for both ingress and service mesh.

The maintainers would like to thank everyone who's contributed to Gateway API, whether in the form of commits to the repo, discussion, ideas, or general support. We could never have made this kind of progress without the support of this dedicated and active community.

Related Kubernetes blog articles

02 Jun 2025 5:00pm GMT

16 May 2025

feedKubernetes Blog

Kubernetes v1.33: In-Place Pod Resize Graduated to Beta

On behalf of the Kubernetes project, I am excited to announce that the in-place Pod resize feature (also known as In-Place Pod Vertical Scaling), first introduced as alpha in Kubernetes v1.27, has graduated to Beta and will be enabled by default in the Kubernetes v1.33 release! This marks a significant milestone in making resource management for Kubernetes workloads more flexible and less disruptive.

What is in-place Pod resize?

Traditionally, changing the CPU or memory resources allocated to a container required restarting the Pod. While acceptable for many stateless applications, this could be disruptive for stateful services, batch jobs, or any workloads sensitive to restarts.

In-place Pod resizing allows you to change the CPU and memory requests and limits assigned to containers within a running Pod, often without requiring a container restart.

Here's the core idea:

You can try it out on a v1.33 Kubernetes cluster by using kubectl to edit a Pod (requires kubectl v1.32+):

kubectl edit pod <pod-name> --subresource resize

For detailed usage instructions and examples, please refer to the official Kubernetes documentation: Resize CPU and Memory Resources assigned to Containers.

Why does in-place Pod resize matter?

Kubernetes still excels at scaling workloads horizontally (adding or removing replicas), but in-place Pod resizing unlocks several key benefits for vertical scaling:

What's changed between Alpha and Beta?

Since the alpha release in v1.27, significant work has gone into maturing the feature, improving its stability, and refining the user experience based on feedback and further development. Here are the key changes:

Notable user-facing changes

Stability and reliability enhancements

What's next?

Graduating to Beta means the feature is ready for broader adoption, but development doesn't stop here! Here's what the community is focusing on next:

Getting started and providing feedback

With the InPlacePodVerticalScaling feature gate enabled by default in v1.33, you can start experimenting with in-place Pod resizing right away!

Refer to the documentation for detailed guides and examples.

As this feature moves through Beta, your feedback is invaluable. Please report any issues or share your experiences via the standard Kubernetes communication channels (GitHub issues, mailing lists, Slack). You can also review the KEP-1287: In-place Update of Pod Resources for the full in-depth design details.

We look forward to seeing how the community leverages in-place Pod resize to build more efficient and resilient applications on Kubernetes!

16 May 2025 6:30pm GMT

Announcing etcd v3.6.0

This announcement originally appeared on the etcd blog.

Today, we are releasing etcd v3.6.0, the first minor release since etcd v3.5.0 on June 15, 2021. This release introduces several new features, makes significant progress on long-standing efforts like downgrade support and migration to v3store, and addresses numerous critical & major issues. It also includes major optimizations in memory usage, improving efficiency and performance.

In addition to the features of v3.6.0, etcd has joined Kubernetes as a SIG (sig-etcd), enabling us to improve project sustainability. We've introduced systematic robustness testing to ensure correctness and reliability. Through the etcd-operator Working Group, we plan to improve usability as well.

What follows are the most significant changes introduced in etcd v3.6.0, along with the discussion of the roadmap for future development. For a detailed list of changes, please refer to the CHANGELOG-3.6.

A heartfelt thank you to all the contributors who made this release possible!

Security

etcd takes security seriously. To enhance software security in v3.6.0, we have improved our workflow checks by integrating govulncheck to scan the source code and trivy to scan container images. These improvements have also been backported to supported stable releases.

etcd continues to follow the Security Release Process to ensure vulnerabilities are properly managed and addressed.

Features

Migration to v3store

The v2store has been deprecated since etcd v3.4 but could still be enabled via --enable-v2. It remained the source of truth for membership data. In etcd v3.6.0, v2store can no longer be enabled as the --enable-v2 flag has been removed, and v3store has become the sole source of truth for membership data.

While v2store still exists in v3.6.0, etcd will fail to start if it contains any data other than membership information. To assist with migration, etcd v3.5.18+ provides the etcdutl check v2store command, which verifies that v2store contains only membership data (see PR 19113).

Compared to v2store, v3store offers better performance and transactional support. It is also the actively maintained storage engine moving forward.

The removal of v2store is still ongoing and is tracked in issues/12913.

Downgrade

etcd v3.6.0 is the first version to fully support downgrade. The effort for this downgrade task spans both versions 3.5 and 3.6, and all related work is tracked in issues/11716.

At a high level, the process involves migrating the data schema to the target version (e.g., v3.5), followed by a rolling downgrade.

Ensure the cluster is healthy, and take a snapshot backup. Validate whether the downgrade is valid:

$ etcdctl downgrade validate 3.5
Downgrade validate success, cluster version 3.6

If the downgrade is valid, enable downgrade mode:

$ etcdctl downgrade enable 3.5
Downgrade enable success, cluster version 3.6

etcd will then migrate the data schema in the background. Once complete, proceed with the rolling downgrade.

For details, refer to the Downgrade-3.6 guide.

Feature gates

In etcd v3.6.0, we introduced Kubernetes-style feature gates for managing new features. Previously, we indicated unstable features through the --experimental prefix in feature flag names. The prefix was removed once the feature was stable, causing a breaking change. Now, features will start in Alpha, progress to Beta, then GA, or get deprecated. This ensures a much smoother upgrade and downgrade experience for users.

See feature-gates for details.

livez / readyz checks

etcd now supports /livez and /readyz endpoints, aligning with Kubernetes' Liveness and Readiness probes. /livez indicates whether the etcd instance is alive, while /readyz indicates when it is ready to serve requests. This feature has also been backported to release-3.5 (starting from v3.5.11) and release-3.4 (starting from v3.4.29). See livez/readyz for details.

The existing /health endpoint remains functional. /livez is similar to /health?serializable=true, while /readyz is similar to /health or /health?serializable=false. Clearly, the /livez and /readyz endpoints provide clearer semantics and are easier to understand.

v3discovery

In etcd v3.6.0, the new discovery protocol v3discovery was introduced, based on clientv3. It facilitates the discovery of all cluster members during the bootstrap phase.

The previous v2discovery protocol, based on clientv2, has been deprecated. Additionally, the public discovery service at https://discovery.etcd.io/, which relied on v2discovery, is no longer maintained.

Performance

Memory

In this release, we reduced average memory consumption by at least 50% (see Figure 1). This improvement is primarily due to two changes:

Diagram of memory usage

Figure 1: Memory usage comparison between etcd v3.5.20 and v3.6.0-rc.2 under different read/write ratios. Each subplot shows the memory usage over time with a specific read/write ratio. The red line represents etcd v3.5.20, while the teal line represents v3.6.0-rc.2. Across all tested ratios, v3.6.0-rc.2 exhibits lower and more stable memory usage.

Throughput

Compared to v3.5, etcd v3.6 delivers an average performance improvement of approximately 10% in both read and write throughput (see Figure 2, 3, 4 and 5). This improvement is not attributed to any single major change, but rather the cumulative effect of multiple minor enhancements. One such example is the optimization of the free page queries introduced in PR/419.

etcd read transaction performance with a high write ratio

Figure 2: Read throughput comparison between etcd v3.5.20 and v3.6.0-rc.2 under a high write ratio. The read/write ratio is 0.0078, meaning 1 read per 128 writes. The right bar shows the percentage improvement in read throughput of v3.6.0-rc.2 over v3.5.20, ranging from 3.21% to 25.59%.

etcd read transaction performance with a high read ratio

Figure 3: Read throughput comparison between etcd v3.5.20 and v3.6.0-rc.2 under a high read ratio. The read/write ratio is 8, meaning 8 reads per write. The right bar shows the percentage improvement in read throughput of v3.6.0-rc.2 over v3.5.20, ranging from 4.38% to 27.20%.

etcd write transaction performance with a high write ratio

Figure 4: Write throughput comparison between etcd v3.5.20 and v3.6.0-rc.2 under a high write ratio. The read/write ratio is 0.0078, meaning 1 read per 128 writes. The right bar shows the percentage improvement in write throughput of v3.6.0-rc.2 over v3.5.20, ranging from 2.95% to 24.24%.

etcd write transaction performance with a high read ratio

Figure 5: Write throughput comparison between etcd v3.5.20 and v3.6.0-rc.2 under a high read ratio. The read/write ratio is 8, meaning 8 reads per write. The right bar shows the percentage improvement in write throughput of v3.6.0-rc.2 over v3.5.20, ranging from 3.86% to 28.37%.

Breaking changes

This section highlights a few notable breaking changes. For a complete list, please refer to the Upgrade etcd from v3.5 to v3.6 and the CHANGELOG-3.6.

Old binaries are incompatible with new schema versions

Old etcd binaries are not compatible with newer data schema versions. For example, etcd 3.5 cannot start with data created by etcd 3.6, and etcd 3.4 cannot start with data created by either 3.5 or 3.6.

When downgrading etcd, it's important to follow the documented downgrade procedure. Simply replacing the binary or image will result in the incompatibility issue.

Peer endpoints no longer serve client requests

Client endpoints (--advertise-client-urls) are intended to serve client requests only, while peer endpoints (--initial-advertise-peer-urls) are intended solely for peer communication. However, due to an implementation oversight, the peer endpoints were also able to handle client requests in etcd 3.4 and 3.5. This behavior was misleading and encouraged incorrect usage patterns. In etcd 3.6, this misleading behavior was corrected via PR/13565; peer endpoints no longer serve client requests.

Clear boundary between etcdctl and etcdutl

Both etcdctl and etcdutl are command line tools. etcdutl is an offline utility designed to operate directly on etcd data files, while etcdctl is an online tool that interacts with etcd over a network. Previously, there were some overlapping functionalities between the two, but these overlaps were removed in 3.6.0.

Critical bug fixes

Correctness has always been a top priority for the etcd project. In the process of developing 3.6.0, we found and fixed a few notable bugs that could lead to data inconsistency in specific cases. These fixes have been backported to previous releases, but we believe they deserve special mention here.

Previously, when etcd was applying data, it would update the consistent-index first, followed by committing the data. However, these operations were not atomic. If etcd crashed in between, it could lead to data inconsistency (see issue/13766). The issue was introduced in v3.5.0, and fixed in v3.5.3 with PR/13854.

When a client writes data and receives a success response, the data is expected to be persisted. However, the data might be lost if etcd crashes immediately after sending the success response to the client. This was a legacy issue (see issue/14370) affecting all previous releases. It was addressed in v3.4.21 and v3.5.5 with PR/14400, and fixed in raft side in main branch (now release-3.6) with PR/14413.

If etcd crashed during the defragmentation operation, upon restart, it might reapply some entries which had already been applied, accordingly leading to the revision inconsistency issue (see the discussions in PR/14685). The issue was introduced in v3.5.0, and fixed in v3.5.6 with PR/14730.

Upgrade issue

This section highlights a common issue issues/19557 in the etcd v3.5 to v3.6 upgrade that may cause the upgrade process to fail. For a complete upgrade guide, refer to Upgrade etcd from v3.5 to v3.6.

The issue was introduced in etcd v3.5.1, and resolved in v3.5.20.

Key takeaway: users are required to first upgrade to etcd v3.5.20 (or a higher patch version) before upgrading to etcd v3.6.0; otherwise, the upgrade may fail.

For more background and technical context, see upgrade_from_3.5_to_3.6_issue.

Testing

We introduced the Robustness testing to verify correctness, which has always been our top priority. It plays traffic of various types and volumes against an etcd cluster, concurrently injects a random failpoint, records all operations (including both requests and responses), and finally performs a linearizability check. It also verifies that the Watch APIs guarantees have not been violated. The robustness test increases our confidence in ensuring the quality of each etcd release.

We have migrated most of the etcd workflow tests to Kubernetes' Prow testing infrastructure to take advantage of its benefit, such as nice dashboards for viewing test results and the ability for contributors to rerun failed tests themselves.

Platforms

While retaining all existing supported platforms, we have promoted Linux/ARM64 to Tier 1 support. For more details, please refer to issues/15951. For the complete list of supported platforms, see supported-platform.

Dependencies

Dependency bumping guide

We have published an official guide on how to bump dependencies for etcd's main branch and stable releases. It also covers how to update the Go version. For more details, please refer to dependency_management. With this guide available, any contributors can now help with dependency upgrades.

Core Dependency Updates

bbolt and raft are two core dependencies of etcd.

Both etcd v3.4 and v3.5 depend on bbolt v1.3, while etcd v3.6 depends on bbolt v1.4.

For the release-3.4 and release-3.5 branches, raft is included in the etcd repository itself, so etcd v3.4 and v3.5 do not depend on an external raft module. Starting from etcd v3.6, raft was moved to a separate repository (raft), and the first standalone raft release is v3.6.0. As a result, etcd v3.6.0 depends on raft v3.6.0.

Please see the table below for a summary:

etcd versions bbolt versions raft versions
3.4.x v1.3.x N/A
3.5.x v1.3.x N/A
3.6.x v1.4.x v3.6.x

grpc-gateway@v2

We upgraded grpc-gateway from v1 to v2 via PR/16595 in etcd v3.6.0. This is a major step toward migrating to protobuf-go, the second major version of the Go protocol buffer API implementation.

grpc-gateway@v2 is designed to work with protobuf-go. However, etcd v3.6 still depends on the deprecated gogo/protobuf, which is actually protocol buffer v1 implementation. To resolve this incompatibility, we applied a patch to the generated *.pb.gw.go files to convert v1 messages to v2 messages.

grpc-ecosystem/go-grpc-middleware/providers/prometheus

We switched from the deprecated (and archived) grpc-ecosystem/go-grpc-prometheus to grpc-ecosystem/go-grpc-middleware/providers/prometheus via PR/19195. This change ensures continued support and access to the latest features and improvements in the gRPC Prometheus integration.

Community

There are exciting developments in the etcd community that reflect our ongoing commitment to strengthening collaboration, improving maintainability, and evolving the project's governance.

etcd Becomes a Kubernetes SIG

etcd has officially become a Kubernetes Special Interest Group: SIG-etcd. This change reflects etcd's critical role as the primary datastore for Kubernetes and establishes a more structured and transparent home for long-term stewardship and cross-project collaboration. The new SIG designation will help streamline decision-making, align roadmaps with Kubernetes needs, and attract broader community involvement.

New contributors, maintainers, and reviewers

We've seen increasing engagement from contributors, which has resulted in the addition of three new maintainers:

Their continued contributions have been instrumental in driving the project forward.

We also welcome two new reviewers to the project:

We appreciate their dedication to code quality and their willingness to take on broader review responsibilities within the community.

New release team

We've formed a new release team led by ivanvc and jmhbnz, streamlining the release process by automating many previously manual steps. Inspired by Kubernetes SIG Release, we've adopted several best practices, including clearly defined release team roles and the introduction of release shadows to support knowledge sharing and team sustainability. These changes have made our releases smoother and more reliable, allowing us to approach each release with greater confidence and consistency.

Introducing the etcd Operator Working Group

To further advance etcd's operational excellence, we have formed a new working group: WG-etcd-operator. The working group is dedicated to enabling the automatic and efficient operation of etcd clusters that run in the Kubernetes environment using an etcd-operator.

Future Development

The legacy v2store has been deprecated since etcd v3.4, and the flag --enable-v2 was removed entirely in v3.6. This means that starting from v3.6, there is no longer a way to enable or use the v2store. However, etcd still bootstraps internally from the legacy v2 snapshots. To address this inconsistency, We plan to change etcd to bootstrap from the v3store and replay the WAL entries based on the consistent-index. The work is being tracked in issues/12913.

One of the most persistent challenges remains the large range of queries from the kube-apiserver, which can lead to process crashes due to their unpredictable nature. The range stream feature, originally outlined in the v3.5 release blog/Future roadmaps, remains an idea worth revisiting to address the challenges of large range queries.

For more details and upcoming plans, please refer to the etcd roadmap.

16 May 2025 12:00am GMT

15 May 2025

feedKubernetes Blog

Kubernetes 1.33: Job's SuccessPolicy Goes GA

On behalf of the Kubernetes project, I'm pleased to announce that Job success policy has graduated to General Availability (GA) as part of the v1.33 release.

About Job's Success Policy

In batch workloads, you might want to use leader-follower patterns like MPI, in which the leader controls the execution, including the followers' lifecycle.

In this case, you might want to mark it as succeeded even if some of the indexes failed. Unfortunately, a leader-follower Kubernetes Job that didn't use a success policy, in most cases, would have to require all Pods to finish successfully for that Job to reach an overall succeeded state.

For Kubernetes Jobs, the API allows you to specify the early exit criteria using the .spec.successPolicy field (you can only use the .spec.successPolicy field for an indexed Job). Which describes a set of rules either using a list of succeeded indexes for a job, or defining a minimal required size of succeeded indexes.

This newly stable field is especially valuable for scientific simulation, AI/ML and High-Performance Computing (HPC) batch workloads. Users in these areas often run numerous experiments and may only need a specific number to complete successfully, rather than requiring all of them to succeed. In this case, the leader index failure is the only relevant Job exit criteria, and the outcomes for individual follower Pods are handled only indirectly via the status of the leader index. Moreover, followers do not know when they can terminate themselves.

After Job meets any Success Policy, the Job is marked as succeeded, and all Pods are terminated including the running ones.

How it works

The following excerpt from a Job manifest, using .successPolicy.rules[0].succeededCount, shows an example of using a custom success policy:

 parallelism: 10
 completions: 10
 completionMode: Indexed
 successPolicy:
 rules:
 - succeededCount: 1

Here, the Job is marked as succeeded when one index succeeded regardless of its number. Additionally, you can constrain index numbers against succeededCount in .successPolicy.rules[0].succeededCount as shown below:

parallelism: 10
completions: 10
completionMode: Indexed
successPolicy:
 rules:
 - succeededIndexes: 0 # index of the leader Pod
 succeededCount: 1

This example shows that the Job will be marked as succeeded once a Pod with a specific index (Pod index 0) has succeeded.

Once the Job either reaches one of the successPolicy rules, or achieves its Complete criteria based on .spec.completions, the Job controller within kube-controller-manager adds the SuccessCriteriaMet condition to the Job status. After that, the job-controller initiates cleanup and termination of Pods for Jobs with SuccessCriteriaMet condition. Eventually, Jobs obtain Complete condition when the job-controller finished cleanup and termination.

Learn more

Get involved

This work was led by the Kubernetes batch working group in close collaboration with the SIG Apps community.

If you are interested in working on new features in the space I recommend subscribing to our Slack channel and attending the regular community meetings.

15 May 2025 6:30pm GMT

14 May 2025

feedKubernetes Blog

Kubernetes v1.33: Updates to Container Lifecycle

Kubernetes v1.33 introduces a few updates to the lifecycle of containers. The Sleep action for container lifecycle hooks now supports a zero sleep duration (feature enabled by default). There is also alpha support for customizing the stop signal sent to containers when they are being terminated.

This blog post goes into the details of these new aspects of the container lifecycle, and how you can use them.

Zero value for Sleep action

Kubernetes v1.29 introduced the Sleep action for container PreStop and PostStart Lifecycle hooks. The Sleep action lets your containers pause for a specified duration after the container is started or before it is terminated. This was needed to provide a straightforward way to manage graceful shutdowns. Before the Sleep action, folks used to run the sleep command using the exec action in their container lifecycle hooks. If you wanted to do this you'd need to have the binary for the sleep command in your container image. This is difficult if you're using third party images.

The sleep action when it was added initially didn't have support for a sleep duration of zero seconds. The time.Sleep which the Sleep action uses under the hood supports a duration of zero seconds. Using a negative or a zero value for the sleep returns immediately, resulting in a no-op. We wanted the same behaviour with the sleep action. This support for the zero duration was later added in v1.32, with the PodLifecycleSleepActionAllowZero feature gate.

The PodLifecycleSleepActionAllowZero feature gate has graduated to beta in v1.33, and is now enabled by default. The original Sleep action for preStop and postStart hooks is been enabled by default, starting from Kubernetes v1.30. With a cluster running Kubernetes v1.33, you are able to set a zero duration for sleep lifecycle hooks. For a cluster with default configuration, you don't need to enable any feature gate to make that possible.

Container stop signals

Container runtimes such as containerd and CRI-O honor a StopSignal instruction in the container image definition. This can be used to specify a custom stop signal that the runtime will used to terminate containers based on that image. Stop signal configuration was not originally part of the Pod API in Kubernetes. Until Kubernetes v1.33, the only way to override the stop signal for containers was by rebuilding your container image with the new custom stop signal (for example, specifying STOPSIGNAL in a Containerfile or Dockerfile).

The ContainerStopSignals feature gate which is newly added in Kubernetes v1.33 adds stop signals to the Kubernetes API. This allows users to specify a custom stop signal in the container spec. Stop signals are added to the API as a new lifecycle along with the existing PreStop and PostStart lifecycle handlers. In order to use this feature, we expect the Pod to have the operating system specified with spec.os.name. This is enforced so that we can cross-validate the stop signal against the operating system and make sure that the containers in the Pod are created with a valid stop signal for the operating system the Pod is being scheduled to. For Pods scheduled on Windows nodes, only SIGTERM and SIGKILL are allowed as valid stop signals. Find the full list of signals supported in Linux nodes here.

Default behaviour

If a container has a custom stop signal defined in its lifecycle, the container runtime would use the signal defined in the lifecycle to kill the container, given that the container runtime also supports custom stop signals. If there is no custom stop signal defined in the container lifecycle, the runtime would fallback to the stop signal defined in the container image. If there is no stop signal defined in the container image, the default stop signal of the runtime would be used. The default signal is SIGTERM for both containerd and CRI-O.

Version skew

For the feature to work as intended, both the versions of Kubernetes and the container runtime should support container stop signals. The changes to the Kuberentes API and kubelet are available in alpha stage from v1.33, which can be enabled with the ContainerStopSignals feature gate. The container runtime implementations for containerd and CRI-O are still a work in progress and will be rolled out soon.

Using container stop signals

To enable this feature, you need to turn on the ContainerStopSignals feature gate in both the kube-apiserver and the kubelet. Once you have nodes where the feature gate is turned on, you can create Pods with a StopSignal lifecycle and a valid OS name like so:

apiVersion: v1
kind: Pod
metadata:
 name: nginx
spec:
 os:
 name: linux
 containers:
 - name: nginx
 image: nginx:latest
 lifecycle:
 stopSignal: SIGUSR1

Do note that the SIGUSR1 signal in this example can only be used if the container's Pod is scheduled to a Linux node. Hence we need to specify spec.os.name as linux to be able to use the signal. You will only be able to configure SIGTERM and SIGKILL signals if the Pod is being scheduled to a Windows node. You cannot specify a containers[*].lifecycle.stopSignal if the spec.os.name field is nil or unset either.

How do I get involved?

This feature is driven by the SIG Node. If you are interested in helping develop this feature, sharing feedback, or participating in any other ongoing SIG Node projects, please reach out to us!

You can reach SIG Node by several means:

You can also contact me directly:

14 May 2025 6:30pm GMT

13 May 2025

feedKubernetes Blog

Kubernetes v1.33: Job's Backoff Limit Per Index Goes GA

In Kubernetes v1.33, the Backoff Limit Per Index feature reaches general availability (GA). This blog describes the Backoff Limit Per Index feature and its benefits.

About backoff limit per index

When you run workloads on Kubernetes, you must consider scenarios where Pod failures can affect the completion of your workloads. Ideally, your workload should tolerate transient failures and continue running.

To achieve failure tolerance in a Kubernetes Job, you can set the spec.backoffLimit field. This field specifies the total number of tolerated failures.

However, for workloads where every index is considered independent, like embarassingly parallel workloads - the spec.backoffLimit field is often not flexible enough. For example, you may choose to run multiple suites of integration tests by representing each suite as an index within an Indexed Job. In that setup, a fast-failing index (test suite) is likely to consume your entire budget for tolerating Pod failures, and you might not be able to run the other indexes.

In order to address this limitation, Kubernetes introduced backoff limit per index, which allows you to control the number of retries per index.

How backoff limit per index works

To use Backoff Limit Per Index for Indexed Jobs, specify the number of tolerated Pod failures per index with the spec.backoffLimitPerIndex field. When you set this field, the Job executes all indexes by default.

Additionally, to fine-tune the error handling:

When the number of tolerated failures is exceeded, the Job marks that index as failed and lists it in the Job's status.failedIndexes field.

Example

The following Job spec snippet is an example of how to combine backoff limit per index with the Pod Failure Policy feature:

completions: 10
parallelism: 10
completionMode: Indexed
backoffLimitPerIndex: 1
maxFailedIndexes: 5
podFailurePolicy:
 rules:
 - action: Ignore
 onPodConditions:
 - type: DisruptionTarget
 - action: FailIndex
 onExitCodes:
 operator: In
 values: [ 42 ]

In this example, the Job handles Pod failures as follows:

Learn more

Get involved

This work was sponsored by the Kubernetes batch working group in close collaboration with the SIG Apps community.

If you are interested in working on new features in the space we recommend subscribing to our Slack channel and attending the regular community meetings.

13 May 2025 6:30pm GMT

12 May 2025

feedKubernetes Blog

Kubernetes v1.33: Image Pull Policy the way you always thought it worked!

Image Pull Policy the way you always thought it worked!

Some things in Kubernetes are surprising, and the way imagePullPolicy behaves might be one of them. Given Kubernetes is all about running pods, it may be peculiar to learn that there has been a caveat to restricting pod access to authenticated images for over 10 years in the form of issue 18787! It is an exciting release when you can resolve a ten-year-old issue.

IfNotPresent, even if I'm not supposed to have it

The gist of the problem is that the imagePullPolicy: IfNotPresent strategy has done precisely what it says, and nothing more. Let's set up a scenario. To begin, Pod A in Namespace X is scheduled to Node 1 and requires image Foo from a private repository. For it's image pull authentication material, the pod references Secret 1 in its imagePullSecrets. Secret 1 contains the necessary credentials to pull from the private repository. The Kubelet will utilize the credentials from Secret 1 as supplied by Pod A and it will pull container image Foo from the registry. This is the intended (and secure) behavior.

But now things get curious. If Pod B in Namespace Y happens to also be scheduled to Node 1, unexpected (and potentially insecure) things happen. Pod B may reference the same private image, specifying the IfNotPresent image pull policy. Pod B does not reference Secret 1 (or in our case, any secret) in its imagePullSecrets. When the Kubelet tries to run the pod, it honors the IfNotPresent policy. The Kubelet sees that the image Foo is already present locally, and will provide image Foo to Pod B. Pod B gets to run the image even though it did not provide credentials authorizing it to pull the image in the first place.

Illustration of the process of two pods trying to access a private image, the first one with a pull secret, the second one without it

Using a private image pulled by a different pod

While IfNotPresent should not pull image Foo if it is already present on the node, it is an incorrect security posture to allow all pods scheduled to a node to have access to previously pulled private image. These pods were never authorized to pull the image in the first place.

IfNotPresent, but only if I am supposed to have it

In Kubernetes v1.33, we - SIG Auth and SIG Node - have finally started to address this (really old) problem and getting the verification right! The basic expected behavior is not changed. If an image is not present, the Kubelet will attempt to pull the image. The credentials each pod supplies will be utilized for this task. This matches behavior prior to 1.33.

If the image is present, then the behavior of the Kubelet changes. The Kubelet will now verify the pod's credentials before allowing the pod to use the image.

Performance and service stability have been a consideration while revising the feature. Pods utilizing the same credential will not be required to re-authenticate. This is also true when pods source credentials from the same Kubernetes Secret object, even when the credentials are rotated.

Never pull, but use if authorized

The imagePullPolicy: Never option does not fetch images. However, if the container image is already present on the node, any pod attempting to use the private image will be required to provide credentials, and those credentials require verification.

Pods utilizing the same credential will not be required to re-authenticate. Pods that do not supply credentials previously used to successfully pull an image will not be allowed to use the private image.

Always pull, if authorized

The imagePullPolicy: Always has always worked as intended. Each time an image is requested, the request goes to the registry and the registry will perform an authentication check.

In the past, forcing the Always image pull policy via pod admission was the only way to ensure that your private container images didn't get reused by other pods on nodes which already pulled the images.

Fortunately, this was somewhat performant. Only the image manifest was pulled, not the image. However, there was still a cost and a risk. During a new rollout, scale up, or pod restart, the image registry that provided the image MUST be available for the auth check, putting the image registry in the critical path for stability of services running inside of the cluster.

How it all works

The feature is based on persistent, file-based caches that are present on each of the nodes. The following is a simplified description of how the feature works. For the complete version, please see KEP-2535.

The process of requesting an image for the first time goes like this:

  1. A pod requesting an image from a private registry is scheduled to a node.
  2. The image is not present on the node.
  3. The Kubelet makes a record of the intention to pull the image.
  4. The Kubelet extracts credentials from the Kubernetes Secret referenced by the pod as an image pull secret, and uses them to pull the image from the private registry.
  5. After the image has been successfully pulled, the Kubelet makes a record of the successful pull. This record includes details about credentials used (in the form of a hash) as well as the Secret from which they originated.
  6. The Kubelet removes the original record of intent.
  7. The Kubelet retains the record of successful pull for later use.

When future pods scheduled to the same node request the previously pulled private image:

  1. The Kubelet checks the credentials that the new pod provides for the pull.
  2. If the hash of these credentials, or the source Secret of the credentials match the hash or source Secret which were recorded for a previous successful pull, the pod is allowed to use the previously pulled image.
  3. If the credentials or their source Secret are not found in the records of successful pulls for that image, the Kubelet will attempt to use these new credentials to request a pull from the remote registry, triggering the authorization flow.

Try it out

In Kubernetes v1.33 we shipped the alpha version of this feature. To give it a spin, enable the KubeletEnsureSecretPulledImages feature gate for your 1.33 Kubelets.

You can learn more about the feature and additional optional configuration on the concept page for Images in the official Kubernetes documentation.

What's next?

In future releases we are going to:

  1. Make this feature work together with Projected service account tokens for Kubelet image credential providers which adds a new, workload-specific source of image pull credentials.
  2. Write a benchmarking suite to measure the performance of this feature and assess the impact of any future changes.
  3. Implement an in-memory caching layer so that we don't need to read files for each image pull request.
  4. Add support for credential expirations, thus forcing previously validated credentials to be re-authenticated.

How to get involved

Reading KEP-2535 is a great way to understand these changes in depth.

If you are interested in further involvement, reach out to us on the #sig-auth-authenticators-dev channel on Kubernetes Slack (for an invitation, visit https://slack.k8s.io/). You are also welcome to join the bi-weekly SIG Auth meetings, held every other Wednesday.

12 May 2025 6:30pm GMT

09 May 2025

feedKubernetes Blog

Kubernetes v1.33: Streaming List responses

Managing Kubernetes cluster stability becomes increasingly critical as your infrastructure grows. One of the most challenging aspects of operating large-scale clusters has been handling List requests that fetch substantial datasets - a common operation that could unexpectedly impact your cluster's stability.

Today, the Kubernetes community is excited to announce a significant architectural improvement: streaming encoding for List responses.

The problem: unnecessary memory consumption with large resources

Current API response encoders just serialize an entire response into a single contiguous memory and perform one ResponseWriter.Write call to transmit data to the client. Despite HTTP/2's capability to split responses into smaller frames for transmission, the underlying HTTP server continues to hold the complete response data as a single buffer. Even as individual frames are transmitted to the client, the memory associated with these frames cannot be freed incrementally.

When cluster size grows, the single response body can be substantial - like hundreds of megabytes in size. At large scale, the current approach becomes particularly inefficient, as it prevents incremental memory release during transmission. Imagining that when network congestion occurs, that large response body's memory block stays active for tens of seconds or even minutes. This limitation leads to unnecessarily high and prolonged memory consumption in the kube-apiserver process. If multiple large List requests occur simultaneously, the cumulative memory consumption can escalate rapidly, potentially leading to an Out-of-Memory (OOM) situation that compromises cluster stability.

The encoding/json package uses sync.Pool to reuse memory buffers during serialization. While efficient for consistent workloads, this mechanism creates challenges with sporadic large List responses. When processing these large responses, memory pools expand significantly. But due to sync.Pool's design, these oversized buffers remain reserved after use. Subsequent small List requests continue utilizing these large memory allocations, preventing garbage collection and maintaining persistently high memory consumption in the kube-apiserver even after the initial large responses complete.

Additionally, Protocol Buffers are not designed to handle large datasets. But it's great for handling individual messages within a large data set. This highlights the need for streaming-based approaches that can process and transmit large collections incrementally rather than as monolithic blocks.

As a general rule of thumb, if you are dealing in messages larger than a megabyte each, it may be time to consider an alternate strategy.

From https://protobuf.dev/programming-guides/techniques/

Streaming encoder for List responses

The streaming encoding mechanism is specifically designed for List responses, leveraging their common well-defined collection structures. The core idea focuses exclusively on the Items field within collection structures, which represents the bulk of memory consumption in large responses. Rather than encoding the entire Items array as one contiguous memory block, the new streaming encoder processes and transmits each item individually, allowing memory to be freed progressively as frame or chunk is transmitted. As a result, encoding items one by one significantly reduces the memory footprint required by the API server.

With Kubernetes objects typically limited to 1.5 MiB (from ETCD), streaming encoding keeps memory consumption predictable and manageable regardless of how many objects are in a List response. The result is significantly improved API server stability, reduced memory spikes, and better overall cluster performance - especially in environments where multiple large List operations might occur simultaneously.

To ensure perfect backward compatibility, the streaming encoder validates Go struct tags rigorously before activation, guaranteeing byte-for-byte consistency with the original encoder. Standard encoding mechanisms process all fields except Items, maintaining identical output formatting throughout. This approach seamlessly supports all Kubernetes List types-from built-in *List objects to Custom Resource UnstructuredList objects - requiring zero client-side modifications or awareness that the underlying encoding method has changed.

Performance gains you'll notice

Benchmark results

To validate results Kubernetes has introduced a new list benchmark which executes concurrently 10 list requests each returning 1GB of data.

The benchmark has showed 20x improvement, reducing memory usage from 70-80GB to 3GB.

Screenshot of a K8s performance dashboard showing memory usage for benchmark list going down from 60GB to 3GB

List benchmark memory usage

09 May 2025 6:30pm GMT

08 May 2025

feedKubernetes Blog

Kubernetes 1.33: Volume Populators Graduate to GA

Kubernetes volume populators are now generally available (GA)! The AnyVolumeDataSource feature gate is treated as always enabled for Kubernetes v1.33, which means that users can specify any appropriate custom resource as the data source of a PersistentVolumeClaim (PVC).

An example of how to use dataSourceRef in PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: pvc1
spec:
 ...
 dataSourceRef:
 apiGroup: provider.example.com
 kind: Provider
 name: provider1

What is new

There are four major enhancements from beta.

Populator Pod is optional

During the beta phase, contributors to Kubernetes identified potential resource leaks with PersistentVolumeClaim (PVC) deletion while volume population was in progress; these leaks happened due to limitations in finalizer handling. Ahead of the graduation to general availability, the Kubernetes project added support to delete temporary resources (PVC prime, etc.) if the original PVC is deleted.

To accommodate this, we've introduced three new plugin-based functions:

A provider example is added in lib-volume-populator/example.

Mutator functions to modify the Kubernetes resources

For GA, the CSI volume populator controller code gained a MutatorConfig, allowing the specification of mutator functions to modify Kubernetes resources. For example, if the PVC prime is not an exact copy of the PVC and you need provider-specific information for the driver, you can include this information in the optional MutatorConfig. This allows you to customize the Kubernetes objects in the volume populator.

Flexible metric handling for providers

Our beta phase highlighted a new requirement: the need to aggregate metrics not just from lib-volume-populator, but also from other components within the provider's codebase.

To address this, SIG Storage introduced a provider metric manager. This enhancement delegates the implementation of metrics logic to the provider itself, rather than relying solely on lib-volume-populator. This shift provides greater flexibility and control over metrics collection and aggregation, enabling a more comprehensive view of provider performance.

Clean up for temporary resources

During the beta phase, we identified potential resource leaks with PersistentVolumeClaim (PVC) deletion while volume population was in progress, due to limitations in finalizer handling. We have improved the populator to support the deletion of temporary resources (PVC prime, etc.) if the original PVC is deleted in this GA release.

How to use it

To try it out, please follow the steps in the previous beta blog.

Future directions and potential feature requests

For next step, there are several potential feature requests for volume populator:

To ensure we're building something truly valuable, Kubernetes SIG Storage would love to hear about any specific use cases you have in mind for this feature. For any inquiries or specific questions related to volume populator, please reach out to the SIG Storage community.

08 May 2025 6:30pm GMT

07 May 2025

feedKubernetes Blog

Kubernetes v1.33: From Secrets to Service Accounts: Kubernetes Image Pulls Evolved

Kubernetes has steadily evolved to reduce reliance on long-lived credentials stored in the API. A prime example of this shift is the transition of Kubernetes Service Account (KSA) tokens from long-lived, static tokens to ephemeral, automatically rotated tokens with OpenID Connect (OIDC)-compliant semantics. This advancement enables workloads to securely authenticate with external services without needing persistent secrets.

However, one major gap remains: image pull authentication. Today, Kubernetes clusters rely on image pull secrets stored in the API, which are long-lived and difficult to rotate, or on node-level kubelet credential providers, which allow any pod running on a node to access the same credentials. This presents security and operational challenges.

To address this, Kubernetes is introducing Service Account Token Integration for Kubelet Credential Providers, now available in alpha. This enhancement allows credential providers to use pod-specific service account tokens to obtain registry credentials, which kubelet can then use for image pulls - eliminating the need for long-lived image pull secrets.

The problem with image pull secrets

Currently, Kubernetes administrators have two primary options for handling private container image pulls:

  1. Image pull secrets stored in the Kubernetes API

    • These secrets are often long-lived because they are hard to rotate.
    • They must be explicitly attached to a service account or pod.
    • Compromise of a pull secret can lead to unauthorized image access.
  2. Kubelet credential providers

    • These providers fetch credentials dynamically at the node level.
    • Any pod running on the node can access the same credentials.
    • There's no per-workload isolation, increasing security risks.

Neither approach aligns with the principles of least privilege or ephemeral authentication, leaving Kubernetes with a security gap.

The solution: Service Account token integration for Kubelet credential providers

This new enhancement enables kubelet credential providers to use workload identity when fetching image registry credentials. Instead of relying on long-lived secrets, credential providers can use service account tokens to request short-lived credentials tied to a specific pod's identity.

This approach provides:

How it works

1. Service Account tokens for credential providers

Kubelet generates short-lived, automatically rotated tokens for service accounts if the credential provider it communicates with has opted into receiving a service account token for image pulls. These tokens conform to OIDC ID token semantics and are provided to the credential provider as part of the CredentialProviderRequest. The credential provider can then use this token to authenticate with an external service.

2. Image registry authentication flow

Benefits of this approach

What's next?

For Kubernetes v1.34, we expect to ship this feature in beta while continuing to gather feedback from users.

In the coming releases, we will focus on:

You can learn more about this feature on the service account token for image pulls page in the Kubernetes documentation.

You can also follow along on the KEP-4412 to track progress across the coming Kubernetes releases.

Try it out

To try out this feature:

  1. Ensure you are running Kubernetes v1.33 or later.
  2. Enable the ServiceAccountTokenForKubeletCredentialProviders feature gate on the kubelet.
  3. Ensure credential provider support: Modify or update your credential provider to use service account tokens for authentication.
  4. Update the credential provider configuration to opt into receiving service account tokens for the credential provider by configuring the tokenAttributes field.
  5. Deploy a pod that uses the credential provider to pull images from a private registry.

We would love to hear your feedback on this feature. Please reach out to us on the #sig-auth-authenticators-dev channel on Kubernetes Slack (for an invitation, visit https://slack.k8s.io/).

How to get involved

If you are interested in getting involved in the development of this feature, sharing feedback, or participating in any other ongoing SIG Auth projects, please reach out on the #sig-auth channel on Kubernetes Slack.

You are also welcome to join the bi-weekly SIG Auth meetings, held every other Wednesday.

07 May 2025 6:30pm GMT