06 Nov 2025
Kubernetes Blog
Gateway API 1.4: New Features
Ready to rock your Kubernetes networking? The Kubernetes SIG Network community presented the General Availability (GA) release of Gateway API (v1.4.0)! Released on October 6, 2025, version 1.4.0 reinforces the path for modern, expressive, and extensible service networking in Kubernetes.
Gateway API v1.4.0 brings three new features to the Standard channel (Gateway API's GA release channel):
- BackendTLSPolicy for TLS between gateways and backends
supportedFeaturesin GatewayClass status- Named rules for Routes
and introduces three new experimental features:
- Mesh resource for service mesh configuration
- Default gateways to ease configuration burden**
externalAuthfilter for HTTPRoute
Graduations to Standard Channel
Backend TLS policy
Leads: Candace Holman, Norwin Schnyder, Katarzyna Łach
GEP-1897: BackendTLSPolicy
BackendTLSPolicy is a new Gateway API type for specifying the TLS configuration of the connection from the Gateway to backend pod(s). . Prior to the introduction of BackendTLSPolicy, there was no API specification that allowed encrypted traffic on the hop from Gateway to backend.
The BackendTLSPolicy validation configuration requires a hostname. This hostname serves two purposes. It is used as the SNI header when connecting to the backend and for authentication, the certificate presented by the backend must match this hostname, unless subjectAltNames is explicitly specified.
If subjectAltNames (SANs) are specified, the hostname is only used for SNI, and authentication is performed against the SANs instead. If you still need to authenticate against the hostname value in this case, you MUST add it to the subjectAltNames list.
BackendTLSPolicy validation configuration also requires either caCertificateRefs or wellKnownCACertificates. caCertificateRefs refer to one or more (up to 8) PEM-encoded TLS certificate bundles. If there are no specific certificates to use, then depending on your implementation, you may use wellKnownCACertificates, set to "System" to tell the Gateway to use an implementation-specific set of trusted CA Certificates.
In this example, the BackendTLSPolicy is configured to use certificates defined in the auth-cert ConfigMap to connect with a TLS-encrypted upstream connection where pods backing the auth service are expected to serve a valid certificate for auth.example.com. It uses subjectAltNames with a Hostname type, but you may also use a URI type.
apiVersion: gateway.networking.k8s.io/v1
kind: BackendTLSPolicy
metadata:
name: tls-upstream-auth
spec:
targetRefs:
- kind: Service
name: auth
group: ""
sectionName: "https"
validation:
caCertificateRefs:
- group: "" # core API group
kind: ConfigMap
name: auth-cert
subjectAltNames:
- type: "Hostname"
hostname: "auth.example.com"
In this example, the BackendTLSPolicy is configured to use system certificates to connect with a TLS-encrypted backend connection where Pods backing the dev Service are expected to serve a valid certificate for dev.example.com.
apiVersion: gateway.networking.k8s.io/v1
kind: BackendTLSPolicy
metadata:
name: tls-upstream-dev
spec:
targetRefs:
- kind: Service
name: dev
group: ""
sectionName: "btls"
validation:
wellKnownCACertificates: "System"
hostname: dev.example.com
More information on the configuration of TLS in Gateway API can be found in Gateway API - TLS Configuration.
Status information about the features that an implementation supports
Leads: Lior Lieberman, Beka Modebadze
GEP-2162: Supported features in GatewayClass Status
GatewayClass status has a new field, supportedFeatures. This addition allows implementations to declare the set of features they support. This provides a clear way for users and tools to understand the capabilities of a given GatewayClass.
This feature's name for conformance tests (and GatewayClass status reporting) is SupportedFeatures. Implementations must populate the supportedFeatures field in the .status of the GatewayClass before the GatewayClass is accepted, or in the same operation.
Here's an example of a supportedFeatures published under GatewayClass' .status:
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
...
status:
conditions:
- lastTransitionTime: "2022-11-16T10:33:06Z"
message: Handled by Foo controller
observedGeneration: 1
reason: Accepted
status: "True"
type: Accepted
supportedFeatures:
- HTTPRoute
- HTTPRouteHostRewrite
- HTTPRoutePortRedirect
- HTTPRouteQueryParamMatching
Graduation of SupportedFeatures to Standard, helped improve the conformance testing process for Gateway API. The conformance test suite will now automatically run tests based on the features populated in the GatewayClass' status. This creates a strong, verifiable link between an implementation's declared capabilities and the test results, making it easier for implementers to run the correct conformance tests and for users to trust the conformance reports.
This means when the SupportedFeatures field is populated in the GatewayClass status there will be no need for additional conformance tests flags like -suported-features, or -exempt or -all-features. It's important to note that Mesh features are an exception to this and can be tested for conformance by using Conformance Profiles, or by manually providing any combination of features related flags until the dedicated resource graduates from the experimental channel.
Named rules for Routes
GEP-995: Adding a new name field to all xRouteRule types (HTTPRouteRule, GRPCRouteRule, etc.)
Leads: Guilherme Cassolato
This enhancement enables route rules to be explicitly identified and referenced across the Gateway API ecosystem. Some of the key use cases include:
- Status: Allowing status conditions to reference specific rules directly by name.
- Observability: Making it easier to identify individual rules in logs, traces, and metrics.
- Policies: Enabling policies (GEP-713) to target specific route rules via the
sectionNamefield in theirtargetRef[s]. - Tooling: Simplifying filtering and referencing of route rules in tools such as
gwctl,kubectl, and general-purpose utilities likejqandyq. - Internal configuration mapping: Facilitating the generation of internal configurations that reference route rules by name within gateway and mesh implementations.
This follows the same well-established pattern already adopted for Gateway listeners, Service ports, Pods (and containers), and many other Kubernetes resources.
While the new name field is optional (so existing resources remain valid), its use is strongly encouraged. Implementations are not expected to assign a default value, but they may enforce constraints such as immutability.
Finally, keep in mind that the name format is validated, and other fields (such as sectionName) may impose additional, indirect constraints.
Experimental channel changes
Enabling external Auth for HTTPRoute
Giving Gateway API the ability to enforce authentication and maybe authorization as well at the Gateway or HTTPRoute level has been a highly requested feature for a long time. (See the GEP-1494 issue for some background.)
This Gateway API release adds an Experimental filter in HTTPRoute that tells the Gateway API implementation to call out to an external service to authenticate (and, optionally, authorize) requests.
This filter is based on the Envoy ext_authz API, and allows talking to an Auth service that uses either gRPC or HTTP for its protocol.
Both methods allow the configuration of what headers to forward to the Auth service, with the HTTP protocol allowing some extra information like a prefix path.
A HTTP example might look like this (noting that this example requires the Experimental channel to be installed and an implementation that supports External Auth to actually understand the config):
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: require-auth
namespace: default
spec:
parentRefs:
- name: your-gateway-here
rules:
- matches:
- path:
type: Prefix
value: /admin
filters:
- type: ExternalAuth
externalAuth:
protocol: HTTP
backendRef:
name: auth-service
http:
# These headers are always sent for the HTTP protocol,
# but are included here for illustrative purposes
allowedHeaders:
- Host
- Method
- Path
- Content-Length
- Authorization
backendRefs:
- name: admin-backend
port: 8080
This allows the backend Auth service to use the supplied headers to make a determination about the authentication for the request.
When a request is allowed, the external Auth service will respond with a 200 HTTP response code, and optionally extra headers to be included in the request that is forwarded to the backend. When the request is denied, the Auth service will respond with a 403 HTTP response.
Since the Authorization header is used in many authentication methods, this method can be used to do Basic, Oauth, JWT, and other common authentication and authorization methods.
Mesh resource
Lead(s): Flynn
GEP-3949: Mesh-wide configuration and supported features
Gateway API v1.4.0 introduces a new experimental Mesh resource, which provides a way to configure mesh-wide settings and discover the features supported by a given mesh implementation. This resource is analogous to the Gateway resource and will initially be mainly used for conformance testing, with plans to extend its use to off-cluster Gateways in the future.
The Mesh resource is cluster-scoped and, as an experimental feature, is named XMesh and resides in the gateway.networking.x-k8s.io API group. A key field is controllerName, which specifies the mesh implementation responsible for the resource. The resource's status stanza indicates whether the mesh implementation has accepted it and lists the features the mesh supports.
One of the goals of this GEP is to avoid making it more difficult for users to adopt a mesh. To simplify adoption, mesh implementations are expected to create a default Mesh resource upon startup if one with a matching controllerName doesn't already exist. This avoids the need for manual creation of the resource to begin using a mesh.
The new XMesh API kind, within the gateway.networking.x-k8s.io/v1alpha1 API group, provides a central point for mesh configuration and feature discovery (source).
A minimal XMesh object specifies the controllerName:
apiVersion: gateway.networking.x-k8s.io/v1alpha1
kind: XMesh
metadata:
name: one-mesh-to-mesh-them-all
spec:
controllerName: one-mesh.example.com/one-mesh
The mesh implementation populates the status field to confirm it has accepted the resource and to list its supported features ( source):
status:
conditions:
- type: Accepted
status: "True"
reason: Accepted
supportedFeatures:
- name: MeshHTTPRoute
- name: OffClusterGateway
Introducing default Gateways
Lead(s): Flynn
GEP-3793: Allowing Gateways to program some routes by default.
For application developers, one common piece of feedback has been the need to explicitly name a parent Gateway for every single north-south Route. While this explicitness prevents ambiguity, it adds friction, especially for developers who just want to expose their application to the outside world without worrying about the underlying infrastructure's naming scheme. To address this, we have introduce the concept of Default Gateways.
For application developers: Just "use the default"
As an application developer, you often don't care about the specific Gateway your traffic flows through, you just want it to work. With this enhancement, you can now create a Route and simply ask it to use a default Gateway.
This is done by setting the new useDefaultGateways field in your Route's spec.
Here's a simple HTTPRoute that uses a default Gateway:
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: my-route
spec:
useDefaultGateways: All
rules:
- backendRefs:
- name: my-service
port: 80
That's it! No more need to hunt down the correct Gateway name for your environment. Your Route is now a "defaulted Route."
For cluster operators: You're still in control
This feature doesn't take control away from cluster operators ("Chihiro"). In fact, they have explicit control over which Gateways can act as a default. A Gateway will only accept these defaulted Routes if it is configured to do so.
You can also use a ValidatingAdmissionPolicy to either require or even forbid for Routes to rely on a default Gateway.
As a cluster operator, you can designate a Gateway as a default by setting the (new) .spec.defaultScope field:
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: my-default-gateway
namespace: default
spec:
defaultScope: All
# ... other gateway configuration
Operators can choose to have no default Gateways, or even multiple.
How it works and key details
-
To maintain a clean, GitOps-friendly workflow, a default Gateway does not modify the
spec.parentRefsof your Route. Instead, the binding is reflected in the Route'sstatusfield. You can always inspect thestatus.parentsstanza of your Route to see exactly which Gateway or Gateways have accepted it. This preserves your original intent and avoids conflicts with CD tools. -
The design explicitly supports having multiple Gateways designated as defaults within a cluster. When this happens, a defaulted Route will bind to all of them. This enables cluster operators to perform zero-downtime migrations and testing of new default Gateways.
-
You can create a single Route that handles both north-south traffic (traffic entering or leaving the cluster, via a default Gateway) and east-west/mesh traffic (traffic between services within the cluster), by explicitly referencing a Service in
parentRefs.
Default Gateways represent a significant step forward in making the Gateway API simpler and more intuitive for everyday use cases, bridging the gap between the flexibility needed by operators and the simplicity desired by developers.
Configuring client certificate validation
Lead(s): Arko Dasgupta, Katarzyna Łach
GEP-91: Address connection coalescing security issue
This release brings updates for configuring client certificate validation, addressing a critical security vulnerability related to connection reuse. HTTP connection coalescing is a web performance optimization that allows a client to reuse an existing TLS connection for requests to different domains. While this reduces the overhead of establishing new connections, it introduces a security risk in the context of API gateways. The ability to reuse a single TLS connection across multiple Listeners brings the need to introduce shared client certificate configuration in order to avoid unauthorized access.
Why SNI-based mTLS is not the answer
One might think that using Server Name Indication (SNI) to differentiate between Listeners would solve this problem. However, TLS SNI is not a reliable mechanism for enforcing security policies in a connection coalescing scenario. A client could use a single TLS connection for multiple peer connections, as long as they are all covered by the same certificate. This means that a client could establish a connection by indicating one peer identity (using SNI), and then reuse that connection to access a different virtual host that is listening on the same IP address and port. That reuse, which is controlled by client side heuristics, could bypass mutual TLS policies that were specific to the second listener configuration.
Here's an example to help explain it:
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: wildcard-tls-gateway
spec:
gatewayClassName: example
listeners:
- name: foo-https
protocol: HTTPS
port: 443
hostname: foo.example.com
tls:
certificateRefs:
- group: "" # core API group
kind: Secret
name: foo-example-com-cert # SAN: foo.example.com
- name: wildcard-https
protocol: HTTPS
port: 443
hostname: "*.example.com"
tls:
certificateRefs:
- group: "" # core API group
kind: Secret
name: wildcard-example-com-cert # SAN: *.example.com
I have configured a Gateway with two listeners, both having overlapping hostnames. My intention is for the foo-http listener to be accessible only by clients presenting the foo-example-com-cert certificate. In contrast, the wildcard-https listener should allow access to a broader audience using any certificate valid for the *.example.com domain.
Consider a scenario where a client initially connects to foo.example.com. The server requests and successfully validates the foo-example-com-cert certificate, establishing the connection. Subsequently, the same client wishes to access other sites within this domain, such as bar.example.com, which is handled by the wildcard-https listener. Due to connection reuse, clients can access wildcard-https backends without an additional TLS handshake on the existing connection. This process functions as expected.
However, a critical security vulnerability arises when the order of access is reversed. If a client first connects to bar.example.com and presents a valid bar.example.com certificate, the connection is successfully established. If this client then attempts to access foo.example.com, the existing connection's client certificate will not be re-validated. This allows the client to bypass the specific certificate requirement for the foo backend, leading to a serious security breach.
The solution: per-port TLS configuration
The updated Gateway API gains a tls field in the .spec of a Gateway, that allows you to define a default client certificate validation configuration for all Listeners, and then if needed override it on a per-port basis. This provides a flexible and powerful way to manage your TLS policies.
Here's a look at the updated API definitions (shown as Go source code):
// GatewaySpec defines the desired state of Gateway.
type GatewaySpec struct {
...
// GatewayTLSConfig specifies frontend tls configuration for gateway.
TLS *GatewayTLSConfig `json:"tls,omitempty"`
}
// GatewayTLSConfig specifies frontend tls configuration for gateway.
type GatewayTLSConfig struct {
// Default specifies the default client certificate validation configuration
Default TLSConfig `json:"default"`
// PerPort specifies tls configuration assigned per port.
PerPort []TLSPortConfig `json:"perPort,omitempty"`
}
// TLSPortConfig describes a TLS configuration for a specific port.
type TLSPortConfig struct {
// The Port indicates the Port Number to which the TLS configuration will be applied.
Port PortNumber `json:"port"`
// TLS store the configuration that will be applied to all Listeners handling
// HTTPS traffic and matching given port.
TLS TLSConfig `json:"tls"`
}
Breaking changes
Standard GRPCRoute - .spec field required (technicality)
The promotion of GRPCRoute to Standard introduces a minor but technically breaking change regarding the presence of the top-level .spec field. As part of achieving Standard status, the Gateway API has tightened the OpenAPI schema validation within the GRPCRoute CustomResourceDefinition (CRD) to explicitly ensure the spec field is required for all GRPCRoute resources. This change enforces stricter conformance to Kubernetes object standards and enhances the resource's stability and predictability. While it is highly unlikely that users were attempting to define a GRPCRoute without any specification, any existing automation or manifests that might have relied on a relaxed interpretation allowing a completely absent spec field will now fail validation and must be updated to include the .spec field, even if empty.
Experimental CORS support in HTTPRoute - breaking change for allowCredentials field
The Gateway API subproject has introduced a breaking change to the Experimental CORS support in HTTPRoute, concerning the allowCredentials field within the CORS policy. This field's definition has been strictly aligned with the upstream CORS specification, which dictates that the corresponding Access-Control-Allow-Credentials header must represent a Boolean value. Previously, the implementation might have been overly permissive, potentially accepting non-standard or string representations such as true due to relaxed schema validation. Users who were configuring CORS rules must now review their manifests and ensure the value for allowCredentials strictly conforms to the new, more restrictive schema. Any existing HTTPRoute definitions that do not adhere to this stricter validation will now be rejected by the API server, requiring a configuration update to maintain functionality.
Improving the development and usage experience
As part of this release, we have improved some of the developer experience workflow:
- Added Kube API Linter to the CI/CD pipelines, reducing the burden of API reviewers and also reducing the amount of common mistakes.
- Improving the execution time of CRD tests with the usage of
envtest.
Additionally, as part of the effort to improve Gateway API usage experience, some efforts were made to remove some ambiguities and some old tech-debts from our documentation website:
- The API reference is now explicit when a field is
experimental. - The GEP (GatewayAPI Enhancement Proposal) navigation bar is automatically generated, reflecting the real status of the enhancements.
Try it out
Unlike other Kubernetes APIs, you don't need to upgrade to the latest version of Kubernetes to get the latest version of Gateway API. As long as you're running Kubernetes 1.26 or later, you'll be able to get up and running with this version of Gateway API.
To try out the API, follow the Getting Started Guide.
As of this writing, seven implementations are already conformant with Gateway API v1.4.0. In alphabetical order:
- Agent Gateway (with kgateway)
- Airlock Microgateway
- Envoy Gateway
- GKE Gateway
- Istio
- kgateway
- Traefik Proxy
Get involved
Wondering when a feature will be added? There are lots of opportunities to get involved and help define the future of Kubernetes routing APIs for both ingress and service mesh.
- Check out the user guides to see what use-cases can be addressed.
- Try out one of the existing Gateway controllers.
- Or join us in the community and help us build the future of Gateway API together!
The maintainers would like to thank everyone who's contributed to Gateway API, whether in the form of commits to the repo, discussion, ideas, or general support. We could never have made this kind of progress without the support of this dedicated and active community.
Related Kubernetes blog articles
- Gateway API v1.3.0: Advancements in Request Mirroring, CORS, Gateway Merging, and Retry Budgets (June 2025)
- Gateway API v1.2: WebSockets, Timeouts, Retries, and More (November 2024)
- Gateway API v1.1: Service mesh, GRPCRoute, and a whole lot more (May 2024)
- New Experimental Features in Gateway API v1.0 (November 2023)
- Gateway API v1.0: GA Release (October 2023)
06 Nov 2025 5:00pm GMT
20 Oct 2025
Kubernetes Blog
7 Common Kubernetes Pitfalls (and How I Learned to Avoid Them)
It's no secret that Kubernetes can be both powerful and frustrating at times. When I first started dabbling with container orchestration, I made more than my fair share of mistakes enough to compile a whole list of pitfalls. In this post, I want to walk through seven big gotchas I've encountered (or seen others run into) and share some tips on how to avoid them. Whether you're just kicking the tires on Kubernetes or already managing production clusters, I hope these insights help you steer clear of a little extra stress.
1. Skipping resource requests and limits
The pitfall: Not specifying CPU and memory requirements in Pod specifications. This typically happens because Kubernetes does not require these fields, and workloads can often start and run without them-making the omission easy to overlook in early configurations or during rapid deployment cycles.
Context: In Kubernetes, resource requests and limits are critical for efficient cluster management. Resource requests ensure that the scheduler reserves the appropriate amount of CPU and memory for each pod, guaranteeing that it has the necessary resources to operate. Resource limits cap the amount of CPU and memory a pod can use, preventing any single pod from consuming excessive resources and potentially starving other pods. When resource requests and limits are not set:
- Resource Starvation: Pods may get insufficient resources, leading to degraded performance or failures. This is because Kubernetes schedules pods based on these requests. Without them, the scheduler might place too many pods on a single node, leading to resource contention and performance bottlenecks.
- Resource Hoarding: Conversely, without limits, a pod might consume more than its fair share of resources, impacting the performance and stability of other pods on the same node. This can lead to issues such as other pods getting evicted or killed by the Out-Of-Memory (OOM) killer due to lack of available memory.
How to avoid it:
- Start with modest
requests(for example100mCPU,128Mimemory) and see how your app behaves. - Monitor real-world usage and refine your values; the HorizontalPodAutoscaler can help automate scaling based on metrics.
- Keep an eye on
kubectl top podsor your logging/monitoring tool to confirm you're not over- or under-provisioning.
My reality check: Early on, I never thought about memory limits. Things seemed fine on my local cluster. Then, on a larger environment, Pods got OOMKilled left and right. Lesson learned. For detailed instructions on configuring resource requests and limits for your containers, please refer to Assign Memory Resources to Containers and Pods (part of the official Kubernetes documentation).
2. Underestimating liveness and readiness probes
The pitfall: Deploying containers without explicitly defining how Kubernetes should check their health or readiness. This tends to happen because Kubernetes will consider a container "running" as long as the process inside hasn't exited. Without additional signals, Kubernetes assumes the workload is functioning-even if the application inside is unresponsive, initializing, or stuck.
Context:
Liveness, readiness, and startup probes are mechanisms Kubernetes uses to monitor container health and availability.
- Liveness probes determine if the application is still alive. If a liveness check fails, the container is restarted.
- Readiness probes control whether a container is ready to serve traffic. Until the readiness probe passes, the container is removed from Service endpoints.
- Startup probes help distinguish between long startup times and actual failures.
How to avoid it:
- Add a simple HTTP
livenessProbeto check a health endpoint (for example/healthz) so Kubernetes can restart a hung container. - Use a
readinessProbeto ensure traffic doesn't reach your app until it's warmed up. - Keep probes simple. Overly complex checks can create false alarms and unnecessary restarts.
My reality check: I once forgot a readiness probe for a web service that took a while to load. Users hit it prematurely, got weird timeouts, and I spent hours scratching my head. A 3-line readiness probe would have saved the day.
For comprehensive instructions on configuring liveness, readiness, and startup probes for containers, please refer to Configure Liveness, Readiness and Startup Probes in the official Kubernetes documentation.
3. "We'll just look at container logs" (famous last words)
The pitfall: Relying solely on container logs retrieved via kubectl logs. This often happens because the command is quick and convenient, and in many setups, logs appear accessible during development or early troubleshooting. However, kubectl logs only retrieves logs from currently running or recently terminated containers, and those logs are stored on the node's local disk. As soon as the container is deleted, evicted, or the node is restarted, the log files may be rotated out or permanently lost.
How to avoid it:
- Centralize logs using CNCF tools like Fluentd or Fluent Bit to aggregate output from all Pods.
- Adopt OpenTelemetry for a unified view of logs, metrics, and (if needed) traces. This lets you spot correlations between infrastructure events and app-level behavior.
- Pair logs with Prometheus metrics to track cluster-level data alongside application logs. If you need distributed tracing, consider CNCF projects like Jaeger.
My reality check: The first time I lost Pod logs to a quick restart, I realized how flimsy "kubectl logs" can be on its own. Since then, I've set up a proper pipeline for every cluster to avoid missing vital clues.
4. Treating dev and prod exactly the same
The pitfall: Deploying the same Kubernetes manifests with identical settings across development, staging, and production environments. This often occurs when teams aim for consistency and reuse, but overlook that environment-specific factors-such as traffic patterns, resource availability, scaling needs, or access control-can differ significantly. Without customization, configurations optimized for one environment may cause instability, poor performance, or security gaps in another.
How to avoid it:
- Use environment overlays or kustomize to maintain a shared base while customizing resource requests, replicas, or config for each environment.
- Extract environment-specific configuration into ConfigMaps and / or Secrets. You can use a specialized tool such as Sealed Secrets to manage confidential data.
- Plan for scale in production. Your dev cluster can probably get away with minimal CPU/memory, but prod might need significantly more.
My reality check: One time, I scaled up replicaCount from 2 to 10 in a tiny dev environment just to "test." I promptly ran out of resources and spent half a day cleaning up the aftermath. Oops.
5. Leaving old stuff floating around
The pitfall: Leaving unused or outdated resources-such as Deployments, Services, ConfigMaps, or PersistentVolumeClaims-running in the cluster. This often happens because Kubernetes does not automatically remove resources unless explicitly instructed, and there is no built-in mechanism to track ownership or expiration. Over time, these forgotten objects can accumulate, consuming cluster resources, increasing cloud costs, and creating operational confusion, especially when stale Services or LoadBalancers continue to route traffic.
How to avoid it:
- Label everything with a purpose or owner label. That way, you can easily query resources you no longer need.
- Regularly audit your cluster: run
kubectl get all -n <namespace>to see what's actually running, and confirm it's all legit. - Adopt Kubernetes' Garbage Collection: K8s docs show how to remove dependent objects automatically.
- Leverage policy automation: Tools like Kyverno can automatically delete or block stale resources after a certain period, or enforce lifecycle policies so you don't have to remember every single cleanup step.
My reality check: After a hackathon, I forgot to tear down a "test-svc" pinned to an external load balancer. Three weeks later, I realized I'd been paying for that load balancer the entire time. Facepalm.
6. Diving too deep into networking too soon
The pitfall: Introducing advanced networking solutions-such as service meshes, custom CNI plugins, or multi-cluster communication-before fully understanding Kubernetes' native networking primitives. This commonly occurs when teams implement features like traffic routing, observability, or mTLS using external tools without first mastering how core Kubernetes networking works: including Pod-to-Pod communication, ClusterIP Services, DNS resolution, and basic ingress traffic handling. As a result, network-related issues become harder to troubleshoot, especially when overlays introduce additional abstractions and failure points.
How to avoid it:
- Start small: a Deployment, a Service, and a basic ingress controller such as one based on NGINX (e.g., Ingress-NGINX).
- Make sure you understand how traffic flows within the cluster, how service discovery works, and how DNS is configured.
- Only move to a full-blown mesh or advanced CNI features when you actually need them, complex networking adds overhead.
My reality check: I tried Istio on a small internal app once, then spent more time debugging Istio itself than the actual app. Eventually, I stepped back, removed Istio, and everything worked fine.
7. Going too light on security and RBAC
The pitfall: Deploying workloads with insecure configurations, such as running containers as the root user, using the latest image tag, disabling security contexts, or assigning overly broad RBAC roles like cluster-admin. These practices persist because Kubernetes does not enforce strict security defaults out of the box, and the platform is designed to be flexible rather than opinionated. Without explicit security policies in place, clusters can remain exposed to risks like container escape, unauthorized privilege escalation, or accidental production changes due to unpinned images.
How to avoid it:
- Use RBAC to define roles and permissions within Kubernetes. While RBAC is the default and most widely supported authorization mechanism, Kubernetes also allows the use of alternative authorizers. For more advanced or external policy needs, consider solutions like OPA Gatekeeper (based on Rego), Kyverno, or custom webhooks using policy languages such as CEL or Cedar.
- Pin images to specific versions (no more
:latest!). This helps you know what's actually deployed. - Look into Pod Security Admission (or other solutions like Kyverno) to enforce non-root containers, read-only filesystems, etc.
My reality check: I never had a huge security breach, but I've heard plenty of cautionary tales. If you don't tighten things up, it's only a matter of time before something goes wrong.
Final thoughts
Kubernetes is amazing, but it's not psychic, it won't magically do the right thing if you don't tell it what you need. By keeping these pitfalls in mind, you'll avoid a lot of headaches and wasted time. Mistakes happen (trust me, I've made my share), but each one is a chance to learn more about how Kubernetes truly works under the hood. If you're curious to dive deeper, the official docs and the community Slack are excellent next steps. And of course, feel free to share your own horror stories or success tips, because at the end of the day, we're all in this cloud native adventure together.
Happy Shipping!
20 Oct 2025 3:30pm GMT
18 Oct 2025
Kubernetes Blog
Spotlight on Policy Working Group
(Note: The Policy Working Group has completed its mission and is no longer active. This article reflects its work, accomplishments, and insights into how a working group operates.)
In the complex world of Kubernetes, policies play a crucial role in managing and securing clusters. But have you ever wondered how these policies are developed, implemented, and standardized across the Kubernetes ecosystem? To answer that, let's take a look back at the work of the Policy Working Group.
The Policy Working Group was dedicated to a critical mission: providing an overall architecture that encompasses both current policy-related implementations and future policy proposals in Kubernetes. Their goal was both ambitious and essential: to develop a universal policy architecture that benefits developers and end-users alike.
Through collaborative methods, this working group strove to bring clarity and consistency to the often complex world of Kubernetes policies. By focusing on both existing implementations and future proposals, they ensured that the policy landscape in Kubernetes remains coherent and accessible as the technology evolves.
This blog post dives deeper into the work of the Policy Working Group, guided by insights from its former co-chairs:
Interviewed by Arujjwal Negi.
These co-chairs explained what the Policy Working Group was all about.
Introduction
Hello, thank you for the time! Let's start with some introductions, could you tell us a bit about yourself, your role, and how you got involved in Kubernetes?
Jim Bugwadia: My name is Jim Bugwadia, and I am a co-founder and the CEO at Nirmata which provides solutions that automate security and compliance for cloud-native workloads. At Nirmata, we have been working with Kubernetes since it started in 2014. We initially built a Kubernetes policy engine in our commercial platform and later donated it to CNCF as the Kyverno project. I joined the CNCF Kubernetes Policy Working Group to help build and standardize various aspects of policy management for Kubernetes and later became a co-chair.
Andy Suderman: My name is Andy Suderman and I am the CTO of Fairwinds, a managed Kubernetes-as-a-Service provider. I began working with Kubernetes in 2016 building a web conferencing platform. I am an author and/or maintainer of several Kubernetes-related open-source projects such as Goldilocks, Pluto, and Polaris. Polaris is a JSON-schema-based policy engine, which started Fairwinds' journey into the policy space and my involvement in the Policy Working Group.
Poonam Lamba: My name is Poonam Lamba, and I currently work as a Product Manager for Google Kubernetes Engine (GKE) at Google. My journey with Kubernetes began back in 2017 when I was building an SRE platform for a large enterprise, using a private cloud built on Kubernetes. Intrigued by its potential to revolutionize the way we deployed and managed applications at the time, I dove headfirst into learning everything I could about it. Since then, I've had the opportunity to build the policy and compliance products for GKE. I lead and contribute to GKE CIS benchmarks. I am involved with the Gatekeeper project as well as I have contributed to Policy-WG for over 2 years and served as a co-chair for the group.
Responses to the following questions represent an amalgamation of insights from the former co-chairs.
About Working Groups
One thing even I am not aware of is the difference between a working group and a SIG. Can you help us understand what a working group is and how it is different from a SIG?
Unlike SIGs, working groups are temporary and focused on tackling specific, cross-cutting issues or projects that may involve multiple SIGs. Their lifespan is defined, and they disband once they've achieved their objective. Generally, working groups don't own code or have long-term responsibility for managing a particular area of the Kubernetes project.
(To know more about SIGs, visit the list of Special Interest Groups)
You mentioned that Working Groups involve multiple SIGS. What SIGS was the Policy WG closely involved with, and how did you coordinate with them?
The group collaborated closely with Kubernetes SIG Auth throughout our existence, and more recently, the group also worked with SIG Security since its formation. Our collaboration occurred in a few ways. We provided periodic updates during the SIG meetings to keep them informed of our progress and activities. Additionally, we utilize other community forums to maintain open lines of communication and ensured our work aligned with the broader Kubernetes ecosystem. This collaborative approach helped the group stay coordinated with related efforts across the Kubernetes community.
Policy WG
Why was the Policy Working Group created?
To enable a broad set of use cases, we recognize that Kubernetes is powered by a highly declarative, fine-grained, and extensible configuration management system. We've observed that a Kubernetes configuration manifest may have different portions that are important to various stakeholders. For example, some parts may be crucial for developers, while others might be of particular interest to security teams or address operational concerns. Given this complexity, we believe that policies governing the usage of these intricate configurations are essential for success with Kubernetes.
Our Policy Working Group was created specifically to research the standardization of policy definitions and related artifacts. We saw a need to bring consistency and clarity to how policies are defined and implemented across the Kubernetes ecosystem, given the diverse requirements and stakeholders involved in Kubernetes deployments.
Can you give me an idea of the work you did in the group?
We worked on several Kubernetes policy-related projects. Our initiatives included:
- We worked on a Kubernetes Enhancement Proposal (KEP) for the Kubernetes Policy Reports API. This aims to standardize how policy reports are generated and consumed within the Kubernetes ecosystem.
- We conducted a CNCF survey to better understand policy usage in the Kubernetes space. This helped gauge the practices and needs across the community at the time.
- We wrote a paper that will guide users in achieving PCI-DSS compliance for containers. This is intended to help organizations meet important security standards in their Kubernetes environments.
- We also worked on a paper highlighting how shifting security down can benefit organizations. This focuses on the advantages of implementing security measures earlier in the development and deployment process.
Can you tell us what were the main objectives of the Policy Working Group and some of your key accomplishments?
The charter of the Policy WG was to help standardize policy management for Kubernetes and educate the community on best practices.
To accomplish this we updated the Kubernetes documentation (Policies | Kubernetes), produced several whitepapers (Kubernetes Policy Management, Kubernetes GRC), and created the Policy Reports API (API reference) which standardizes reporting across various tools. Several popular tools such as Falco, Trivy, Kyverno, kube-bench, and others support the Policy Report API. A major milestone for the Policy WG was promoting the Policy Reports API to a SIG-level API or finding it a stable home.
Beyond that, as ValidatingAdmissionPolicy and MutatingAdmissionPolicy approached GA in Kubernetes, a key goal of the WG was to guide and educate the community on the tradeoffs and appropriate usage patterns for these built-in API objects and other CNCF policy management solutions like OPA/Gatekeeper and Kyverno.
Challenges
What were some of the major challenges that the Policy Working Group worked on?
During our work in the Policy Working Group, we encountered several challenges:
-
One of the main issues we faced was finding time to consistently contribute. Given that many of us have other professional commitments, it can be difficult to dedicate regular time to the working group's initiatives.
-
Another challenge we experienced was related to our consensus-driven model. While this approach ensures that all voices are heard, it can sometimes lead to slower decision-making processes. We valued thorough discussion and agreement, but this can occasionally delay progress on our projects.
-
We've also encountered occasional differences of opinion among group members. These situations require careful navigation to ensure that we maintain a collaborative and productive environment while addressing diverse viewpoints.
-
Lastly, we've noticed that newcomers to the group may find it difficult to contribute effectively without consistent attendance at our meetings. The complex nature of our work often requires ongoing context, which can be challenging for those who aren't able to participate regularly.
Can you tell me more about those challenges? How did you discover each one? What has the impact been? What were some strategies you used to address them?
There are no easy answers, but having more contributors and maintainers greatly helps! Overall the CNCF community is great to work with and is very welcoming to beginners. So, if folks out there are hesitating to get involved, I highly encourage them to attend a WG or SIG meeting and just listen in.
It often takes a few meetings to fully understand the discussions, so don't feel discouraged if you don't grasp everything right away. We made a point to emphasize this and encouraged new members to review documentation as a starting point for getting involved.
Additionally, differences of opinion were valued and encouraged within the Policy-WG. We adhered to the CNCF core values and resolve disagreements by maintaining respect for one another. We also strove to timebox our decisions and assign clear responsibilities to keep things moving forward.
This is where our discussion about the Policy Working Group ends. The working group, and especially the people who took part in this article, hope this gave you some insights into the group's aims and workings. You can get more info about Working Groups here.
18 Oct 2025 12:00am GMT
06 Oct 2025
Kubernetes Blog
Introducing Headlamp Plugin for Karpenter - Scaling and Visibility
Headlamp is an open‑source, extensible Kubernetes SIG UI project designed to let you explore, manage, and debug cluster resources.
Karpenter is a Kubernetes Autoscaling SIG node provisioning project that helps clusters scale quickly and efficiently. It launches new nodes in seconds, selects appropriate instance types for workloads, and manages the full node lifecycle, including scale-down.
The new Headlamp Karpenter Plugin adds real-time visibility into Karpenter's activity directly from the Headlamp UI. It shows how Karpenter resources relate to Kubernetes objects, displays live metrics, and surfaces scaling events as they happen. You can inspect pending pods during provisioning, review scaling decisions, and edit Karpenter-managed resources with built-in validation. The Karpenter plugin was made as part of a LFX mentor project.
The Karpenter plugin for Headlamp aims to make it easier for Kubernetes users and operators to understand, debug, and fine-tune autoscaling behavior in their clusters. Now we will give a brief tour of the Headlamp plugin.
Map view of Karpenter Resources and how they relate to Kubernetes resources
Easily see how Karpenter Resources like NodeClasses, NodePool and NodeClaims connect with core Kubernetes resources like Pods, Nodes etc.

Visualization of Karpenter Metrics
Get instant insights of Resource Usage v/s Limits, Allowed disruptions, Pending Pods, Provisioning Latency and many more .


Scaling decisions
Shows which instances are being provisioned for your workloads and understand the reason behind why Karpenter made those choices. Helpful while debugging.


Config editor with validation support
Make live edits to Karpenter configurations. The editor includes diff previews and resource validation for safer adjustments.

Real time view of Karpenter resources
View and track Karpenter specific resources in real time such as "NodeClaims" as your cluster scales up and down.



Dashboard for Pending Pods
View all pending pods with unmet scheduling requirements/Failed Scheduling highlighting why they couldn't be scheduled.

Karpenter Providers
This plugin should work with most Karpenter providers, but has only so far been tested on the ones listed in the table. Additionally, each provider gives some extra information, and the ones in the table below are displayed by the plugin.
| Provider Name | Tested | Extra provider specific info supported |
|---|---|---|
| AWS | ✅ | ✅ |
| Azure | ✅ | ✅ |
| AlibabaCloud | ❌ | ❌ |
| Bizfly Cloud | ❌ | ❌ |
| Cluster API | ❌ | ❌ |
| GCP | ❌ | ❌ |
| Proxmox | ❌ | ❌ |
| Oracle Cloud Infrastructure (OCI) | ❌ | ❌ |
Please submit an issue if you test one of the untested providers or if you want support for this provider (PRs also gladly accepted).
How to use
Please see the plugins/karpenter/README.md for instructions on how to use.
Feedback and Questions
Please submit an issue if you use Karpenter and have any other ideas or feedback. Or come to the Kubernetes slack headlamp channel for a chat.
06 Oct 2025 12:00am GMT
25 Sep 2025
Kubernetes Blog
Announcing Changed Block Tracking API support (alpha)
We're excited to announce the alpha support for a changed block tracking mechanism. This enhances the Kubernetes storage ecosystem by providing an efficient way for CSI storage drivers to identify changed blocks in PersistentVolume snapshots. With a driver that can use the feature, you could benefit from faster and more resource-efficient backup operations.
If you're eager to try this feature, you can skip to the Getting Started section.
What is changed block tracking?
Changed block tracking enables storage systems to identify and track modifications at the block level between snapshots, eliminating the need to scan entire volumes during backup operations. The improvement is a change to the Container Storage Interface (CSI), and also to the storage support in Kubernetes itself. With the alpha feature enabled, your cluster can:
- Identify allocated blocks within a CSI volume snapshot
- Determine changed blocks between two snapshots of the same volume
- Streamline backup operations by focusing only on changed data blocks
For Kubernetes users managing large datasets, this API enables significantly more efficient backup processes. Backup applications can now focus only on the blocks that have changed, rather than processing entire volumes.
Note:
As of now, the Changed Block Tracking API is supported only for block volumes and not for file volumes. CSI drivers that manage file-based storage systems will not be able to implement this capability.Benefits of changed block tracking support in Kubernetes
As Kubernetes adoption grows for stateful workloads managing critical data, the need for efficient backup solutions becomes increasingly important. Traditional full backup approaches face challenges with:
- Long backup windows: Full volume backups can take hours for large datasets, making it difficult to complete within maintenance windows.
- High resource utilization: Backup operations consume substantial network bandwidth and I/O resources, especially for large data volumes and data-intensive applications.
- Increased storage costs: Repetitive full backups store redundant data, causing storage requirements to grow linearly even when only a small percentage of data actually changes between backups.
The Changed Block Tracking API addresses these challenges by providing native Kubernetes support for incremental backup capabilities through the CSI interface.
Key components
The implementation consists of three primary components:
- CSI SnapshotMetadata Service API: An API, offered by gRPC, that provides volume snapshot and changed block data.
- SnapshotMetadataService API: A Kubernetes CustomResourceDefinition (CRD) that advertises CSI driver metadata service availability and connection details to cluster clients.
- External Snapshot Metadata Sidecar: An intermediary component that connects CSI drivers to backup applications via a standardized gRPC interface.
Implementation requirements
Storage provider responsibilities
If you're an author of a storage integration with Kubernetes and want to support the changed block tracking feature, you must implement specific requirements:
-
Implement CSI RPCs: Storage providers need to implement the
SnapshotMetadataservice as defined in the CSI specifications protobuf. This service requires server-side streaming implementations for the following RPCs:GetMetadataAllocated: For identifying allocated blocks in a snapshotGetMetadataDelta: For determining changed blocks between two snapshots
-
Storage backend capabilities: Ensure the storage backend has the capability to track and report block-level changes.
-
Deploy external components: Integrate with the
external-snapshot-metadatasidecar to expose the snapshot metadata service. -
Register custom resource: Register the
SnapshotMetadataServiceresource using a CustomResourceDefinition and create aSnapshotMetadataServicecustom resource that advertises the availability of the metadata service and provides connection details. -
Support error handling: Implement proper error handling for these RPCs according to the CSI specification requirements.
Backup solution responsibilities
A backup solution looking to leverage this feature must:
-
Set up authentication: The backup application must provide a Kubernetes ServiceAccount token when using the Kubernetes SnapshotMetadataService API. Appropriate access grants, such as RBAC RoleBindings, must be established to authorize the backup application ServiceAccount to obtain such tokens.
-
Implement streaming client-side code: Develop clients that implement the streaming gRPC APIs defined in the schema.proto file. Specifically:
- Implement streaming client code for
GetMetadataAllocatedandGetMetadataDeltamethods - Handle server-side streaming responses efficiently as the metadata comes in chunks
- Process the
SnapshotMetadataResponsemessage format with proper error handling
The
external-snapshot-metadataGitHub repository provides a convenient iterator support package to simplify client implementation. - Implement streaming client code for
-
Handle large dataset streaming: Design clients to efficiently handle large streams of block metadata that could be returned for volumes with significant changes.
-
Optimize backup processes: Modify backup workflows to use the changed block metadata to identify and only transfer changed blocks to make backups more efficient, reducing both backup duration and resource consumption.
Getting started
To use changed block tracking in your cluster:
- Ensure your CSI driver supports volume snapshots and implements the snapshot metadata capabilities with the required
external-snapshot-metadatasidecar - Make sure the SnapshotMetadataService custom resource is registered using CRD
- Verify the presence of a SnapshotMetadataService custom resource for your CSI driver
- Create clients that can access the API using appropriate authentication (via Kubernetes ServiceAccount tokens)
The API provides two main functions:
GetMetadataAllocated: Lists blocks allocated in a single snapshotGetMetadataDelta: Lists blocks changed between two snapshots
What's next?
Depending on feedback and adoption, the Kubernetes developers hope to push the CSI Snapshot Metadata implementation to Beta in the future releases.
Where can I learn more?
For those interested in trying out this new feature:
- Official Kubernetes CSI Developer Documentation
- The enhancement proposal for the snapshot metadata feature.
- GitHub repository for implementation and release status of
external-snapshot-metadata - Complete gRPC protocol definitions for snapshot metadata API: schema.proto
- Example snapshot metadata client implementation: snapshot-metadata-lister
- End-to-end example with csi-hostpath-driver: example documentation
How do I get involved?
This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. On behalf of SIG Storage, I would like to offer a huge thank you to the contributors who helped review the design and implementation of the project, including but not limited to the following:
- Ben Swartzlander (bswartz)
- Carl Braganza (carlbraganza)
- Daniil Fedotov (hairyhum)
- Ivan Sim (ihcsim)
- Nikhil Ladha (Nikhil-Ladha)
- Prasad Ghangal (PrasadG193)
- Praveen M (iPraveenParihar)
- Rakshith R (Rakshith-R)
- Xing Yang (xing-yang)
Thank also to everyone who has contributed to the project, including others who helped review the KEP and the CSI spec PR
For those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We always welcome new contributors.
The SIG also holds regular Data Protection Working Group meetings. New attendees are welcome to join our discussions.
25 Sep 2025 1:00pm GMT
22 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Pod Level Resources Graduated to Beta
On behalf of the Kubernetes community, I am thrilled to announce that the Pod Level Resources feature has graduated to Beta in the Kubernetes v1.34 release and is enabled by default! This significant milestone introduces a new layer of flexibility for defining and managing resource allocation for your Pods. This flexibility stems from the ability to specify CPU and memory resources for the Pod as a whole. Pod level resources can be combined with the container-level specifications to express the exact resource requirements and limits your application needs.
Pod-level specification for resources
Until recently, resource specifications that applied to Pods were primarily defined at the individual container level. While effective, this approach sometimes required duplicating or meticulously calculating resource needs across multiple containers within a single Pod. As a beta feature, Kubernetes allows you to specify the CPU, memory and hugepages resources at the Pod-level. This means you can now define resource requests and limits for an entire Pod, enabling easier resource sharing without requiring granular, per-container management of these resources where it's not needed.
Why does Pod-level specification matter?
This feature enhances resource management in Kubernetes by offering flexible resource management at both the Pod and container levels.
-
It provides a consolidated approach to resource declaration, reducing the need for meticulous, per-container management, especially for Pods with multiple containers.
-
Pod-level resources enable containers within a pod to share unused resoures amongst themselves, promoting efficient utilization within the pod. For example, it prevents sidecar containers from becoming performance bottlenecks. Previously, a sidecar (e.g., a logging agent or service mesh proxy) hitting its individual CPU limit could be throttled and slow down the entire Pod, even if the main application container had plenty of spare CPU. With pod-level resources, the sidecar and the main container can share Pod's resource budget, ensuring smooth operation during traffic spikes - either the whole Pod is throttled or all containers work.
-
When both pod-level and container-level resources are specified, pod-level requests and limits take precedence. This gives you - and cluster administrators - a powerful way to enforce overall resource boundaries for your Pods.
For scheduling, if a pod-level request is explicitly defined, the scheduler uses that specific value to find a suitable node, insteaf of the aggregated requests of the individual containers. At runtime, the pod-level limit acts as a hard ceiling for the combined resource usage of all containers. Crucially, this pod-level limit is the absolute enforcer; even if the sum of the individual container limits is higher, the total resource consumption can never exceed the pod-level limit.
-
Pod-level resources are prioritized in influencing the Quality of Service (QoS) class of the Pod.
-
For Pods running on Linux nodes, the Out-Of-Memory (OOM) score adjustment calculation considers both pod-level and container-level resources requests.
-
Pod-level resources are designed to be compatible with existing Kubernetes functionalities, ensuring a smooth integration into your workflows.
How to specify resources for an entire Pod
Using PodLevelResources feature gate requires Kubernetes v1.34 or newer for all cluster components, including the control plane and every node. This feature gate is in beta and enabled by default in v1.34.
Example manifest
You can specify CPU, memory and hugepages resources directly in the Pod spec manifest at the resources field for the entire Pod.
Here's an example demonstrating a Pod with both CPU and memory requests and limits defined at the Pod level:
apiVersion: v1
kind: Pod
metadata:
name: pod-resources-demo
namespace: pod-resources-example
spec:
# The 'resources' field at the Pod specification level defines the overall
# resource budget for all containers within this Pod combined.
resources: # Pod-level resources
# 'limits' specifies the maximum amount of resources the Pod is allowed to use.
# The sum of the limits of all containers in the Pod cannot exceed these values.
limits:
cpu: "1" # The entire Pod cannot use more than 1 CPU core.
memory: "200Mi" # The entire Pod cannot use more than 200 MiB of memory.
# 'requests' specifies the minimum amount of resources guaranteed to the Pod.
# This value is used by the Kubernetes scheduler to find a node with enough capacity.
requests:
cpu: "1" # The Pod is guaranteed 1 CPU core when scheduled.
memory: "100Mi" # The Pod is guaranteed 100 MiB of memory when scheduled.
containers:
- name: main-app-container
image: nginx
...
# This container has no resource requests or limits specified.
- name: auxiliary-container
image: fedora
command: ["sleep", "inf"]
...
# This container has no resource requests or limits specified.
In this example, the pod-resources-demo Pod as a whole requests 1 CPU and 100 MiB of memory, and is limited to 1 CPU and 200 MiB of memory. The containers within will operate under these overall Pod-level constraints, as explained in the next section.
Interaction with container-level resource requests or limits
When both pod-level and container-level resources are specified, pod-level requests and limits take precedence. This means the node allocates resources based on the pod-level specifications.
Consider a Pod with two containers where pod-level CPU and memory requests and limits are defined, and only one container has its own explicit resource definitions:
apiVersion: v1
kind: Pod
metadata:
name: pod-resources-demo
namespace: pod-resources-example
spec:
resources:
limits:
cpu: "1"
memory: "200Mi"
requests:
cpu: "1"
memory: "100Mi"
containers:
- name: main-app-container
image: nginx
resources:
requests:
cpu: "0.5"
memory: "50Mi"
- name: auxiliary-container
image: fedora
command: [ "sleep", "inf"]
# This container has no resource requests or limits specified.
-
Pod-Level Limits: The pod-level limits (cpu: "1", memory: "200Mi") establish an absolute boundary for the entire Pod. The sum of resources consumed by all its containers is enforced at this ceiling and cannot be surpassed.
-
Resource Sharing and Bursting: Containers can dynamically borrow any unused capacity, allowing them to burst as needed, so long as the Pod's aggregate usage stays within the overall limit.
-
Pod-Level Requests: The pod-level requests (cpu: "1", memory: "100Mi") serve as the foundational resource guarantee for the entire Pod. This value informs the scheduler's placement decision and represents the minimum resources the Pod can rely on during node-level contention.
-
Container-Level Requests: Container-level requests create a priority system within the Pod's guaranteed budget. Because main-app-container has an explicit request (cpu: "0.5", memory: "50Mi"), it is given precedence for its share of resources under resource pressure over the auxiliary-container, which has no such explicit claim.
Limitations
-
First of all, in-place resize of pod-level resources is not supported for Kubernetes v1.34 (or earlier). Attempting to modify the pod-level resource limits or requests on a running Pod results in an error: the resize is rejected. The v1.34 implementation of Pod level resources focuses on allowing initial declaration of an overall resource envelope, that applies to the entire Pod. That is distinct from in-place pod resize, which (despite what the name might suggest) allows you to make dynamic adjustments to container resource requests and limits, within a running Pod, and potentially without a container restart. In-place resizing is also not yet a stable feature; it graduated to Beta in the v1.33 release.
-
Only CPU, memory, and hugepages resources can be specified at pod-level.
-
Pod-level resources are not supported for Windows pods. If the Pod specification explicitly targets Windows (e.g., by setting spec.os.name: "windows"), the API server will reject the Pod during the validation step. If the Pod is not explicitly marked for Windows but is scheduled to a Windows node (e.g., via a nodeSelector), the Kubelet on that Windows node will reject the Pod during its admission process.
-
The Topology Manager, Memory Manager and CPU Manager do not align pods and containers based on pod-level resources as these resource managers don't currently support pod-level resources.
Getting started and providing feedback
Ready to explore Pod Level Resources feature? You'll need a Kubernetes cluster running version 1.34 or later. Remember to enable the PodLevelResources feature gate across your control plane and all nodes.
As this feature moves through Beta, your feedback is invaluable. Please report any issues or share your experiences via the standard Kubernetes communication channels:
22 Sep 2025 6:30pm GMT
19 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Recovery From Volume Expansion Failure (GA)
Have you ever made a typo when expanding your persistent volumes in Kubernetes? Meant to specify 2TB but specified 20TiB? This seemingly innocuous problem was kinda hard to fix - and took the project almost 5 years to fix. Automated recovery from storage expansion has been around for a while in beta; however, with the v1.34 release, we have graduated this to general availability.
While it was always possible to recover from failing volume expansions manually, it usually required cluster-admin access and was tedious to do (See aformentioned link for more information).
What if you make a mistake and then realize immediately? With Kubernetes v1.34, you should be able to reduce the requested size of the PersistentVolumeClaim (PVC) and, as long as the expansion to previously requested size hadn't finished, you can amend the size requested. Kubernetes will automatically work to correct it. Any quota consumed by failed expansion will be returned to the user and the associated PersistentVolume should be resized to the latest size you specified.
I'll walk through an example of how all of this works.
Reducing PVC size to recover from failed expansion
Imagine that you are running out of disk space for one of your database servers, and you want to expand the PVC from previously specified 10TB to 100TB - but you make a typo and specify 1000TB.
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: myclaim
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1000TB # newly specified size - but incorrect!
Now, you may be out of disk space on your disk array or simply ran out of allocated quota on your cloud-provider. But, assume that expansion to 1000TB is never going to succeed.
In Kubernetes v1.34, you can simply correct your mistake and request a new PVC size, that is smaller than the mistake, provided it is still larger than the original size of the actual PersistentVolume.
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: myclaim
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100TB # Corrected size; has to be greater than 10TB.
# You cannot shrink the volume below its actual size.
This requires no admin intervention. Even better, any surplus Kubernetes quota that you temporarily consumed will be automatically returned.
This fault recovery mechanism does have a caveat: whatever new size you specify for the PVC, it must be still higher than the original size in .status.capacity. Since Kubernetes doesn't support shrinking your PV objects, you can never go below the size that was originally allocated for your PVC request.
Improved error handling and observability of volume expansion
Implementing what might look like a relatively minor change also required us to almost fully redo how volume expansion works under the hood in Kubernetes. There are new API fields available in PVC objects which you can monitor to observe progress of volume expansion.
Improved observability of in-progress expansion
You can query .status.allocatedResourceStatus['storage'] of a PVC to monitor progress of a volume expansion operation. For a typical block volume, this should transition between ControllerResizeInProgress, NodeResizePending and NodeResizeInProgress and become nil/empty when volume expansion has finished.
If for some reason, volume expansion to requested size is not feasible it should accordingly be in states like - ControllerResizeInfeasible or NodeResizeInfeasible.
You can also observe size towards which Kubernetes is working by watching pvc.status.allocatedResources.
Improved error handling and reporting
Kubernetes should now retry your failed volume expansions at slower rate, it should make fewer requests to both storage system and Kubernetes apiserver.
Errors observerd during volume expansion are now reported as condition on PVC objects and should persist unlike events. Kubernetes will now populate pvc.status.conditions with error keys ControllerResizeError or NodeResizeError when volume expansion fails.
Fixes long standing bugs in resizing workflows
This feature also has allowed us to fix long standing bugs in resizing workflow such as Kubernetes issue #115294. If you observe anything broken, please report your bugs to https://github.com/kubernetes/kubernetes/issues, along with details about how to reproduce the problem.
Working on this feature through its lifecycle was challenging and it wouldn't have been possible to reach GA without feedback from @msau42, @jsafrane and @xing-yang.
All of the contributors who worked on this also appreciate the input provided by @thockin and @liggitt at various Kubernetes contributor summits.
19 Sep 2025 6:30pm GMT
18 Sep 2025
Kubernetes Blog
Kubernetes v1.34: DRA Consumable Capacity
Dynamic Resource Allocation (DRA) is a Kubernetes API for managing scarce resources across Pods and containers. It enables flexible resource requests, going beyond simply allocating N number of devices to support more granular usage scenarios. With DRA, users can request specific types of devices based on their attributes, define custom configurations tailored to their workloads, and even share the same resource among multiple containers or Pods.
In this blog, we focus on the device sharing feature and dive into a new capability introduced in Kubernetes 1.34: DRA consumable capacity, which extends DRA to support finer-grained device sharing.
Background: device sharing via ResourceClaims
From the beginning, DRA introduced the ability for multiple Pods to share a device by referencing the same ResourceClaim. This design decouples resource allocation from specific hardware, allowing for more dynamic and reusable provisioning of devices.
In Kubernetes 1.33, the new support for partitionable devices allowed resource drivers to advertise slices of a device that are available, rather than exposing the entire device as an all-or-nothing resource. This enabled Kubernetes to model shareable hardware more accurately.
But there was still a missing piece: it didn't yet support scenarios where the device driver manages fine-grained, dynamic portions of a device resource - like network bandwidth - based on user demand, or to share those resources independently of ResourceClaims, which are restricted by their spec and namespace.
That's where consumable capacity for DRA comes in.
Benefits of DRA consumable capacity support
Here's a taste of what you get in a cluster with the DRAConsumableCapacity feature gate enabled.
Device sharing across multiple ResourceClaims or DeviceRequests
Resource drivers can now support sharing the same device - or even a slice of a device - across multiple ResourceClaims or across multiple DeviceRequests.
This means that Pods from different namespaces can simultaneously share the same device, if permitted and supported by the specific DRA driver.
Device resource allocation
Kubernetes extends the allocation algorithm in the scheduler to support allocating a portion of a device's resources, as defined in the capacity field. The scheduler ensures that the total allocated capacity across all consumers never exceeds the device's total capacity, even when shared across multiple ResourceClaims or DeviceRequests. This is very similar to the way the scheduler allows Pods and containers to share allocatable resources on Nodes; in this case, it allows them to share allocatable (consumable) resources on Devices.
This feature expands support for scenarios where the device driver is able to manage resources within a device and on a per-process basis - for example, allocating a specific amount of memory (e.g., 8 GiB) from a virtual GPU, or setting bandwidth limits on virtual network interfaces allocated to specific Pods. This aims to provide safe and efficient resource sharing.
DistinctAttribute constraint
This feature also introduces a new constraint: DistinctAttribute, which is the complement of the existing MatchAttribute constraint.
The primary goal of DistinctAttribute is to prevent the same underlying device from being allocated multiple times within a single ResourceClaim, which could happen since we are allocating shares (or subsets) of devices. This constraint ensures that each allocation refers to a distinct resource, even if they belong to the same device class.
It is useful for use cases such as allocating network devices connecting to different subnets to expand coverage or provide redundancy across failure domains.
How to use consumable capacity?
DRAConsumableCapacity is introduced as an alpha feature in Kubernetes 1.34. The feature gate DRAConsumableCapacity must be enabled in kubelet, kube-apiserver, kube-scheduler and kube-controller-manager.
--feature-gates=...,DRAConsumableCapacity=true
As a DRA driver developer
As a DRA driver developer writing in Golang, you can make a device within a ResourceSlice allocatable to multiple ResourceClaims (or devices.requests) by setting AllowMultipleAllocations to true.
Device {
...
AllowMultipleAllocations: ptr.To(true),
...
}
Additionally, you can define a policy to restrict how each device's Capacity should be consumed by each DeviceRequest by defining RequestPolicy field in the DeviceCapacity. The example below shows how to define a policy that requires a GPU with 40 GiB of memory to allocate at least 5 GiB per request, with each allocation in multiples of 5 GiB.
DeviceCapacity{
Value: resource.MustParse("40Gi"),
RequestPolicy: &CapacityRequestPolicy{
Default: ptr.To(resource.MustParse("5Gi")),
ValidRange: &CapacityRequestPolicyRange {
Min: ptr.To(resource.MustParse("5Gi")),
Step: ptr.To(resource.MustParse("5Gi")),
}
}
}
This will be published to the ResourceSlice, as partially shown below:
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
...
spec:
devices:
- name: gpu0
allowMultipleAllocations: true
capacity:
memory:
value: 40Gi
requestPolicy:
default: 5Gi
validRange:
min: 5Gi
step: 5Gi
An allocated device with a specified portion of consumed capacity will have a ShareID field set in the allocation status.
claim.Status.Allocation.Devices.Results[i].ShareID
This ShareID allows the driver to distinguish between different allocations that refer to the same device or same statically-partitioned slice but come from different ResourceClaim requests.
It acts as a unique identifier for each shared slice, enabling the driver to manage and enforce resource limits independently across multiple consumers.
As a consumer
As a consumer (or user), the device resource can be requested with a ResourceClaim like this:
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
...
spec:
devices:
requests: # for devices
- name: req0
exactly:
deviceClassName: resource.example.com
capacity:
requests: # for resources which must be provided by those devices
memory: 10Gi
This configuration ensures that the requested device can provide at least 10GiB of memory.
Notably that any resource.example.com device that has at least 10GiB of memory can be allocated. If a device that does not support multiple allocations is chosen, the allocation would consume the entire device. To filter only devices that support multiple allocations, you can define a selector like this:
selectors:
- cel:
expression: |-
device.allowMultipleAllocations == true
Integration with DRA device status
In device sharing, general device information is provided through the resource slice. However, some details are set dynamically after allocation. These can be conveyed using the .status.devices field of a ResourceClaim. That field is only published in clusters where the DRAResourceClaimDeviceStatus feature gate is enabled.
If you do have device status support available, a driver can expose additional device-specific information beyond the ShareID. One particularly useful use case is for virtual networks, where a driver can include the assigned IP address(es) in the status. This is valuable for both network service operations and troubleshooting.
You can find more information by watching our recording at: KubeCon Japan 2025 - Reimagining Cloud Native Networks: The Critical Role of DRA.
What can you do next?
-
Check out the CNI DRA Driver project for an example of DRA integration in Kubernetes networking. Try integrating with network resources like
macvlan,ipvlan, or smart NICs. -
Start enabling the
DRAConsumableCapacityfeature gate and experimenting with virtualized or partitionable devices. Specify your workloads with consumable capacity (for example: fractional bandwidth or memory). -
Let us know your feedback:
- ✅ What worked well?
- ⚠️ What didn't?
If you encountered issues to fix or opportunities to enhance, please file a new issue and reference KEP-5075 there, or reach out via Slack (#wg-device-management).
Conclusion
Consumable capacity support enhances the device sharing capability of DRA by allowing effective device sharing across namespaces, across claims, and tailored to each Pod's actual needs. It also empowers drivers to enforce capacity limits, improves scheduling accuracy, and unlocks new use cases like bandwidth-aware networking and multi-tenant device sharing.
Try it out, experiment with consumable resources, and help shape the future of dynamic resource allocation in Kubernetes!
Further Reading
- DRA in the Kubernetes documentation
- KEP for DRA Partitionable Devices
- KEP for DRA Device Status
- KEP for DRA Consumable Capacity
- Kubernetes 1.34 Release Notes
18 Sep 2025 6:30pm GMT
17 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Pods Report DRA Resource Health
The rise of AI/ML and other high-performance workloads has made specialized hardware like GPUs, TPUs, and FPGAs a critical component of many Kubernetes clusters. However, as discussed in a previous blog post about navigating failures in Pods with devices, when this hardware fails, it can be difficult to diagnose, leading to significant downtime. With the release of Kubernetes v1.34, we are excited to announce a new alpha feature that brings much-needed visibility into the health of these devices.
This work extends the functionality of KEP-4680, which first introduced a mechanism for reporting the health of devices managed by Device Plugins. Now, this capability is being extended to Dynamic Resource Allocation (DRA). Controlled by the ResourceHealthStatus feature gate, this enhancement allows DRA drivers to report device health directly into a Pod's .status field, providing crucial insights for operators and developers.
Why expose device health in Pod status?
For stateful applications or long-running jobs, a device failure can be disruptive and costly. By exposing device health in the .status field for a Pod, Kubernetes provides a standardized way for users and automation tools to quickly diagnose issues. If a Pod is failing, you can now check its status to see if an unhealthy device is the root cause, saving valuable time that might otherwise be spent debugging application code.
How it works
This feature introduces a new, optional communication channel between the Kubelet and DRA drivers, built on three core components.
A new gRPC health service
A new gRPC service, DRAResourceHealth, is defined in the dra-health/v1alpha1 API group. DRA drivers can implement this service to stream device health updates to the Kubelet. The service includes a NodeWatchResources server-streaming RPC that sends the health status (Healthy, Unhealthy, or Unknown) for the devices it manages.
Kubelet integration
The Kubelet's DRAPluginManager discovers which drivers implement the health service. For each compatible driver, it starts a long-lived NodeWatchResources stream to receive health updates. The DRA Manager then consumes these updates and stores them in a persistent healthInfoCache that can survive Kubelet restarts.
Populating the Pod status
When a device's health changes, the DRA manager identifies all Pods affected by the change and triggers a Pod status update. A new field, allocatedResourcesStatus, is now part of the v1.ContainerStatus API object. The Kubelet populates this field with the current health of each device allocated to the container.
A practical example
If a Pod is in a CrashLoopBackOff state, you can use kubectl describe pod <pod-name> to inspect its status. If an allocated device has failed, the output will now include the allocatedResourcesStatus field, clearly indicating the problem:
status:
containerStatuses:
- name: my-gpu-intensive-container
# ... other container statuses
allocatedResourcesStatus:
- name: "claim:my-gpu-claim"
resources:
- resourceID: "example.com/gpu-a1b2-c3d4"
health: "Unhealthy"
This explicit status makes it clear that the issue is with the underlying hardware, not the application.
Now you can improve the failure detection logic to react on the unhealthy devices associated with the Pod by de-scheduling a Pod.
How to use this feature
As this is an alpha feature in Kubernetes v1.34, you must take the following steps to use it:
- Enable the
ResourceHealthStatusfeature gate on your kube-apiserver and kubelets. - Ensure you are using a DRA driver that implements the
v1alpha1 DRAResourceHealthgRPC service.
DRA drivers
If you are developing a DRA driver, make sure to think about device failure detection strategy and ensure that your driver is integrated with this feature. This way, your driver will improve the user experience and simplify debuggability of hardware issues.
What's next?
This is the first step in a broader effort to improve how Kubernetes handles device failures. As we gather feedback on this alpha feature, the community is planning several key enhancements before graduating to Beta:
- Detailed health messages: To improve the troubleshooting experience, we plan to add a human-readable message field to the gRPC API. This will allow DRA drivers to provide specific context for a health status, such as "GPU temperature exceeds threshold" or "NVLink connection lost".
- Configurable health timeouts: The timeout for marking a device's health as "Unknown" is currently hardcoded. We plan to make this configurable, likely on a per-driver basis, to better accommodate the different health-reporting characteristics of various hardware.
- Improved post-mortem troubleshooting: We will address a known limitation where health updates may not be applied to pods that have already terminated. This fix will ensure that the health status of a device at the time of failure is preserved, which is crucial for troubleshooting batch jobs and other "run-to-completion" workloads.
This feature was developed as part of KEP-4680, and community feedback is crucial as we work toward graduating it to Beta. We have more improvements of device failure handling in k8s and encourage you to try it out and share your experiences with the SIG Node community!
17 Sep 2025 6:30pm GMT
16 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Moving Volume Group Snapshots to v1beta2
Volume group snapshots were introduced as an Alpha feature with the Kubernetes 1.27 release and moved to Beta in the Kubernetes 1.32 release. The recent release of Kubernetes v1.34 moved that support to a second beta. The support for volume group snapshots relies on a set of extension APIs for group snapshots. These APIs allow users to take crash consistent snapshots for a set of volumes. Behind the scenes, Kubernetes uses a label selector to group multiple PersistentVolumeClaims for snapshotting. A key aim is to allow you restore that set of snapshots to new volumes and recover your workload based on a crash consistent recovery point.
This new feature is only supported for CSI volume drivers.
What's new in Beta 2?
While testing the beta version, we encountered an issue where the restoreSize field is not set for individual VolumeSnapshotContents and VolumeSnapshots if CSI driver does not implement the ListSnapshots RPC call. We evaluated various options here and decided to make this change releasing a new beta for the API.
Specifically, a VolumeSnapshotInfo struct is added in v1beta2, it contains information for an individual volume snapshot that is a member of a volume group snapshot. VolumeSnapshotInfoList, a list of VolumeSnapshotInfo, is added to VolumeGroupSnapshotContentStatus, replacing VolumeSnapshotHandlePairList. VolumeSnapshotInfoList is a list of snapshot information returned by the CSI driver to identify snapshots on the storage system. VolumeSnapshotInfoList is populated by the csi-snapshotter sidecar based on the CSI CreateVolumeGroupSnapshotResponse returned by the CSI driver's CreateVolumeGroupSnapshot call.
The existing v1beta1 API objects will be converted to the new v1beta2 API objects by a conversion webhook.
What's next?
Depending on feedback and adoption, the Kubernetes project plans to push the volume group snapshot implementation to general availability (GA) in a future release.
How can I learn more?
- The design spec for the volume group snapshot feature.
- The code repository for volume group snapshot APIs and controller.
- CSI documentation on the group snapshot feature.
How do I get involved?
This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. On behalf of SIG Storage, I would like to offer a huge thank you to the contributors who stepped up these last few quarters to help the project reach beta:
- Ben Swartzlander (bswartz)
- Hemant Kumar (gnufied)
- Jan Šafránek (jsafrane)
- Madhu Rajanna (Madhu-1)
- Michelle Au (msau42)
- Niels de Vos (nixpanic)
- Leonardo Cecchi (leonardoce)
- Saad Ali (saad-ali)
- Xing Yang (xing-yang)
- Yati Padia (yati1998)
For those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We always welcome new contributors.
We also hold regular Data Protection Working Group meetings. New attendees are welcome to join our discussions.
16 Sep 2025 6:30pm GMT
15 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Decoupled Taint Manager Is Now Stable
This enhancement separates the responsibility of managing node lifecycle and pod eviction into two distinct components. Previously, the node lifecycle controller handled both marking nodes as unhealthy with NoExecute taints and evicting pods from them. Now, a dedicated taint eviction controller manages the eviction process, while the node lifecycle controller focuses solely on applying taints. This separation not only improves code organization but also makes it easier to improve taint eviction controller or build custom implementations of the taint based eviction.
What's new?
The feature gate SeparateTaintEvictionController has been promoted to GA in this release. Users can optionally disable taint-based eviction by setting --controllers=-taint-eviction-controller in kube-controller-manager.
How can I learn more?
For more details, refer to the KEP and to the beta announcement article: Kubernetes 1.29: Decoupling taint manager from node lifecycle controller.
How to get involved?
We offer a huge thank you to all the contributors who helped with design, implementation, and review of this feature and helped move it from beta to stable:
- Ed Bartosh (@bart0sh)
- Yuan Chen (@yuanchen8911)
- Aldo Culquicondor (@alculquicondor)
- Baofa Fan (@carlory)
- Sergey Kanzhelev (@SergeyKanzhelev)
- Tim Bannister (@lmktfy)
- Maciej Skoczeń (@macsko)
- Maciej Szulik (@soltysh)
- Wojciech Tyczynski (@wojtek-t)
15 Sep 2025 6:30pm GMT
12 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Autoconfiguration for Node Cgroup Driver Goes GA
Historically, configuring the correct cgroup driver has been a pain point for users running new Kubernetes clusters. On Linux systems, there are two different cgroup drivers: cgroupfs and systemd. In the past, both the kubelet and CRI implementation (like CRI-O or containerd) needed to be configured to use the same cgroup driver, or else the kubelet would misbehave without any explicit error message. This was a source of headaches for many cluster admins. Now, we've (almost) arrived at the end of that headache.
Automated cgroup driver detection
In v1.28.0, the SIG Node community introduced the feature gate KubeletCgroupDriverFromCRI, which instructs the kubelet to ask the CRI implementation which cgroup driver to use. You can read more here. After many releases of waiting for each CRI implementation to have major versions released and packaged in major operating systems, this feature has gone GA as of Kubernetes 1.34.0.
In addition to setting the feature gate, a cluster admin needs to ensure their CRI implementation is new enough:
- containerd: Support was added in v2.0.0
- CRI-O: Support was added in v1.28.0
Announcement: Kubernetes is deprecating containerd v1.y support
While CRI-O releases versions that match Kubernetes versions, and thus CRI-O versions without this behavior are no longer supported, containerd maintains its own release cycle. containerd support for this feature is only in v2.0 and later, but Kubernetes 1.34 still supports containerd 1.7 and other LTS releases of containerd.
The Kubernetes SIG Node community has formally agreed upon a final support timeline for containerd v1.y. The last Kubernetes release to offer this support will be the last released version of v1.35, and support will be dropped in v1.36.0. To assist administrators in managing this future transition, a new detection mechanism is available. You are able to monitor the kubelet_cri_losing_support metric to determine if any nodes in your cluster are using a containerd version that will soon be outdated. The presence of this metric with a version label of 1.36.0 will indicate that the node's containerd runtime is not new enough for the upcoming requirements. Consequently, an administrator will need to upgrade containerd to v2.0 or a later version before, or at the same time as, upgrading the kubelet to v1.36.0.
12 Sep 2025 6:30pm GMT
11 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Mutable CSI Node Allocatable Graduates to Beta
The functionality for CSI drivers to update information about attachable volume count on the nodes, first introduced as Alpha in Kubernetes v1.33, has graduated to Beta in the Kubernetes v1.34 release! This marks a significant milestone in enhancing the accuracy of stateful pod scheduling by reducing failures due to outdated attachable volume capacity information.
Background
Traditionally, Kubernetes CSI drivers report a static maximum volume attachment limit when initializing. However, actual attachment capacities can change during a node's lifecycle for various reasons, such as:
- Manual or external operations attaching/detaching volumes outside of Kubernetes control.
- Dynamically attached network interfaces or specialized hardware (GPUs, NICs, etc.) consuming available slots.
- Multi-driver scenarios, where one CSI driver's operations affect available capacity reported by another.
Static reporting can cause Kubernetes to schedule pods onto nodes that appear to have capacity but don't, leading to pods stuck in a ContainerCreating state.
Dynamically adapting CSI volume limits
With this new feature, Kubernetes enables CSI drivers to dynamically adjust and report node attachment capacities at runtime. This ensures that the scheduler, as well as other components relying on this information, have the most accurate, up-to-date view of node capacity.
How it works
Kubernetes supports two mechanisms for updating the reported node volume limits:
- Periodic Updates: CSI drivers specify an interval to periodically refresh the node's allocatable capacity.
- Reactive Updates: An immediate update triggered when a volume attachment fails due to exhausted resources (
ResourceExhaustederror).
Enabling the feature
To use this beta feature, the MutableCSINodeAllocatableCount feature gate must be enabled in these components:
kube-apiserverkubelet
Example CSI driver configuration
Below is an example of configuring a CSI driver to enable periodic updates every 60 seconds:
apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
name: example.csi.k8s.io
spec:
nodeAllocatableUpdatePeriodSeconds: 60
This configuration directs kubelet to periodically call the CSI driver's NodeGetInfo method every 60 seconds, updating the node's allocatable volume count. Kubernetes enforces a minimum update interval of 10 seconds to balance accuracy and resource usage.
Immediate updates on attachment failures
When a volume attachment operation fails due to a ResourceExhausted error (gRPC code 8), Kubernetes immediately updates the allocatable count instead of waiting for the next periodic update. The Kubelet then marks the affected pods as Failed, enabling their controllers to recreate them. This prevents pods from getting permanently stuck in the ContainerCreating state.
Getting started
To enable this feature in your Kubernetes v1.34 cluster:
- Enable the feature gate
MutableCSINodeAllocatableCounton thekube-apiserverandkubeletcomponents. - Update your CSI driver configuration by setting
nodeAllocatableUpdatePeriodSeconds. - Monitor and observe improvements in scheduling accuracy and pod placement reliability.
Next steps
This feature is currently in beta and the Kubernetes community welcomes your feedback. Test it, share your experiences, and help guide its evolution to GA stability.
Join discussions in the Kubernetes Storage Special Interest Group (SIG-Storage) to shape the future of Kubernetes storage capabilities.
11 Sep 2025 6:30pm GMT
10 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Use An Init Container To Define App Environment Variables
Kubernetes typically uses ConfigMaps and Secrets to set environment variables, which introduces additional API calls and complexity, For example, you need to separately manage the Pods of your workloads and their configurations, while ensuring orderly updates for both the configurations and the workload Pods.
Alternatively, you might be using a vendor-supplied container that requires environment variables (such as a license key or a one-time token), but you don't want to hard-code them or mount volumes just to get the job done.
If that's the situation you are in, you now have a new (alpha) way to achieve that. Provided you have the EnvFiles feature gate enabled across your cluster, you can tell the kubelet to load a container's environment variables from a volume (the volume must be part of the Pod that the container belongs to). this feature gate allows you to load environment variables directly from a file in an emptyDir volume without actually mounting that file into the container. It's a simple yet elegant solution to some surprisingly common problems.
What's this all about?
At its core, this feature allows you to point your container to a file, one generated by an initContainer, and have Kubernetes parse that file to set your environment variables. The file lives in an emptyDir volume (a temporary storage space that lasts as long as the pod does), Your main container doesn't need to mount the volume. The kubelet will read the file and inject these variables when the container starts.
How It Works
Here's a simple example:
apiVersion: v1
kind: Pod
spec:
initContainers:
- name: generate-config
image: busybox
command: ['sh', '-c', 'echo "CONFIG_VAR=HELLO" > /config/config.env']
volumeMounts:
- name: config-volume
mountPath: /config
containers:
- name: app-container
image: gcr.io/distroless/static
env:
- name: CONFIG_VAR
valueFrom:
fileKeyRef:
path: config.env
volumeName: config-volume
key: CONFIG_VAR
volumes:
- name: config-volume
emptyDir: {}
Using this approach is a breeze. You define your environment variables in the pod spec using the fileKeyRef field, which tells Kubernetes where to find the file and which key to pull. The file itself resembles the standard for .env syntax (think KEY=VALUE), and (for this alpha stage at least) you must ensure that it is written into an emptyDir volume. Other volume types aren't supported for this feature. At least one init container must mount that emptyDir volume (to write the file), but the main container doesn't need to-it just gets the variables handed to it at startup.
A word on security
While this feature supports handling sensitive data such as keys or tokens, note that its implementation relies on emptyDir volumes mounted into pod. Operators with node filesystem access could therefore easily retrieve this sensitive data through pod directory paths.
If storing sensitive data like keys or tokens using this feature, ensure your cluster security policies effectively protect nodes against unauthorized access to prevent exposure of confidential information.
Summary
This feature will eliminate a number of complex workarounds used today, simplifying apps authoring, and opening doors for more use cases. Kubernetes stays flexible and open for feedback. Tell us how you use this feature or what is missing.
10 Sep 2025 6:30pm GMT
09 Sep 2025
Kubernetes Blog
Kubernetes v1.34: Snapshottable API server cache
For years, the Kubernetes community has been on a mission to improve the stability and performance predictability of the API server. A major focus of this effort has been taming list requests, which have historically been a primary source of high memory usage and heavy load on the etcd datastore. With each release, we've chipped away at the problem, and today, we're thrilled to announce the final major piece of this puzzle.
The snapshottable API server cache feature has graduated to Beta in Kubernetes v1.34, culminating a multi-release effort to allow virtually all read requests to be served directly from the API server's cache.
Evolving the cache for performance and stability
The path to the current state involved several key enhancements over recent releases that paved the way for today's announcement.
Consistent reads from cache (Beta in v1.31)
While the API server has long used a cache for performance, a key milestone was guaranteeing consistent reads of the latest data from it. This v1.31 enhancement allowed the watch cache to be used for strongly-consistent read requests for the first time, a huge win as it enabled filtered collections (e.g. "a list of pods bound to this node") to be safely served from the cache instead of etcd, dramatically reducing its load for common workloads.
Taming large responses with streaming (Beta in v1.33)
Another key improvement was tackling the problem of memory spikes when transmitting large responses. The streaming encoder, introduced in v1.33, allowed the API server to send list items one by one, rather than buffering the entire multi-gigabyte response in memory. This made the memory cost of sending a response predictable and minimal, regardless of its size.
The missing piece
Despite these huge improvements, a critical gap remained. Any request for a historical LIST-most commonly used for paginating through large result sets-still had to bypass the cache and query etcd directly. This meant that the cost of retrieving the data was still unpredictable and could put significant memory pressure on the API server.
Kubernetes 1.34: snapshots complete the picture
The snapshottable API server cache solves this final piece of the puzzle. This feature enhances the watch cache, enabling it to generate efficient, point-in-time snapshots of its state.
Here's how it works: for each update, the cache creates a lightweight snapshot. These snapshots are "lazy copies," meaning they don't duplicate objects but simply store pointers, making them incredibly memory-efficient.
When a list request for a historical resourceVersion arrives, the API server now finds the corresponding snapshot and serves the response directly from its memory. This closes the final major gap, allowing paginated requests to be served entirely from the cache.
A new era of API Server performance 🚀
With this final piece in place, the synergy of these three features ushers in a new era of API server predictability and performance:
- Get Data from Cache: Consistent reads and snapshottable cache work together to ensure nearly all read requests-whether for the latest data or a historical snapshot-are served from the API server's memory.
- Send data via stream: Streaming list responses ensure that sending this data to the client has a minimal and constant memory footprint.
The result is a system where the resource cost of read operations is almost fully predictable and much more resiliant to spikes in request load. This means dramatically reduced memory pressure, a lighter load on etcd, and a more stable, scalable, and reliable control plane for all Kubernetes clusters.
How to get started
With its graduation to Beta, the SnapshottableCache feature gate is enabled by default in Kubernetes v1.34. There are no actions required to start benefiting from these performance and stability improvements.
Acknowledgements
Special thanks for designing, implementing, and reviewing these critical features go to:
- Ahmad Zolfaghari (@ah8ad3)
- Ben Luddy (@benluddy) - Red Hat
- Chen Chen (@z1cheng) - Microsoft
- Davanum Srinivas (@dims) - Nvidia
- David Eads (@deads2k) - Red Hat
- Han Kang (@logicalhan) - CoreWeave
- haosdent (@haosdent) - Shopee
- Joe Betz (@jpbetz) - Google
- Jordan Liggitt (@liggitt) - Google
- Łukasz Szaszkiewicz (@p0lyn0mial) - Red Hat
- Maciej Borsz (@mborsz) - Google
- Madhav Jivrajani (@MadhavJivrajani) - UIUC
- Marek Siarkowicz (@serathius) - Google
- NKeert (@NKeert)
- Tim Bannister (@lmktfy)
- Wei Fu (@fuweid) - Microsoft
- Wojtek Tyczyński (@wojtek-t) - Google
...and many others in SIG API Machinery. This milestone is a testament to the community's dedication to building a more scalable and robust Kubernetes.
09 Sep 2025 6:30pm GMT
08 Sep 2025
Kubernetes Blog
Kubernetes v1.34: VolumeAttributesClass for Volume Modification GA
The VolumeAttributesClass API, which empowers users to dynamically modify volume attributes, has officially graduated to General Availability (GA) in Kubernetes v1.34. This marks a significant milestone, providing a robust and stable way to tune your persistent storage directly within Kubernetes.
What is VolumeAttributesClass?
At its core, VolumeAttributesClass is a cluster-scoped resource that defines a set of mutable parameters for a volume. Think of it as a "profile" for your storage, allowing cluster administrators to expose different quality-of-service (QoS) levels or performance tiers.
Users can then specify a volumeAttributesClassName in their PersistentVolumeClaim (PVC) to indicate which class of attributes they desire. The magic happens through the Container Storage Interface (CSI): when a PVC referencing a VolumeAttributesClass is updated, the associated CSI driver interacts with the underlying storage system to apply the specified changes to the volume.
This means you can now:
- Dynamically scale performance: Increase IOPS or throughput for a busy database, or reduce it for a less critical application.
- Optimize costs: Adjust attributes on the fly to match your current needs, avoiding over-provisioning.
- Simplify operations: Manage volume modifications directly within the Kubernetes API, rather than relying on external tools or manual processes.
What is new from Beta to GA
There are two major enhancements from beta.
Cancellation support when errors occur
To improve resilience and user experience, the GA release introduces explicit cancel support when a requested volume modification encounters an error. If the underlying storage system or CSI driver indicates that the requested changes cannot be applied (e.g., due to invalid arguments), users can cancel the operation and revert the volume to its previous stable configuration, preventing the volume from being left in an inconsistent state.
Quota support based on scope
While VolumeAttributesClass doesn't add a new quota type, the Kubernetes control plane can be configured to enforce quotas on PersistentVolumeClaims that reference a specific VolumeAttributesClass.
This is achieved by using the scopeSelector field in a ResourceQuota to target PVCs that have .spec.volumeAttributesClassName set to a particular VolumeAttributesClass name. Please see more details here.
Drivers support VolumeAttributesClass
- Amazon EBS CSI Driver: The AWS EBS CSI driver has robust support for VolumeAttributesClass and allows you to modify parameters like volume type (e.g., gp2 to gp3, io1 to io2), IOPS, and throughput of EBS volumes dynamically.
- Google Compute Engine (GCE) Persistent Disk CSI Driver (pd.csi.storage.gke.io): This driver also supports dynamic modification of persistent disk attributes, including IOPS and throughput, via VolumeAttributesClass.
Contact
For any inquiries or specific questions related to VolumeAttributesClass, please reach out to the SIG Storage community.
08 Sep 2025 6:30pm GMT