06 Nov 2025
Kubernetes Blog
Gateway API 1.4: New Features
Ready to rock your Kubernetes networking? The Kubernetes SIG Network community presented the General Availability (GA) release of Gateway API (v1.4.0)! Released on October 6, 2025, version 1.4.0 reinforces the path for modern, expressive, and extensible service networking in Kubernetes.
Gateway API v1.4.0 brings three new features to the Standard channel (Gateway API's GA release channel):
- BackendTLSPolicy for TLS between gateways and backends
supportedFeaturesin GatewayClass status- Named rules for Routes
and introduces three new experimental features:
- Mesh resource for service mesh configuration
- Default gateways to ease configuration burden**
externalAuthfilter for HTTPRoute
Graduations to Standard Channel
Backend TLS policy
Leads: Candace Holman, Norwin Schnyder, Katarzyna Łach
GEP-1897: BackendTLSPolicy
BackendTLSPolicy is a new Gateway API type for specifying the TLS configuration of the connection from the Gateway to backend pod(s). . Prior to the introduction of BackendTLSPolicy, there was no API specification that allowed encrypted traffic on the hop from Gateway to backend.
The BackendTLSPolicy validation configuration requires a hostname. This hostname serves two purposes. It is used as the SNI header when connecting to the backend and for authentication, the certificate presented by the backend must match this hostname, unless subjectAltNames is explicitly specified.
If subjectAltNames (SANs) are specified, the hostname is only used for SNI, and authentication is performed against the SANs instead. If you still need to authenticate against the hostname value in this case, you MUST add it to the subjectAltNames list.
BackendTLSPolicy validation configuration also requires either caCertificateRefs or wellKnownCACertificates. caCertificateRefs refer to one or more (up to 8) PEM-encoded TLS certificate bundles. If there are no specific certificates to use, then depending on your implementation, you may use wellKnownCACertificates, set to "System" to tell the Gateway to use an implementation-specific set of trusted CA Certificates.
In this example, the BackendTLSPolicy is configured to use certificates defined in the auth-cert ConfigMap to connect with a TLS-encrypted upstream connection where pods backing the auth service are expected to serve a valid certificate for auth.example.com. It uses subjectAltNames with a Hostname type, but you may also use a URI type.
apiVersion: gateway.networking.k8s.io/v1
kind: BackendTLSPolicy
metadata:
name: tls-upstream-auth
spec:
targetRefs:
- kind: Service
name: auth
group: ""
sectionName: "https"
validation:
caCertificateRefs:
- group: "" # core API group
kind: ConfigMap
name: auth-cert
subjectAltNames:
- type: "Hostname"
hostname: "auth.example.com"
In this example, the BackendTLSPolicy is configured to use system certificates to connect with a TLS-encrypted backend connection where Pods backing the dev Service are expected to serve a valid certificate for dev.example.com.
apiVersion: gateway.networking.k8s.io/v1
kind: BackendTLSPolicy
metadata:
name: tls-upstream-dev
spec:
targetRefs:
- kind: Service
name: dev
group: ""
sectionName: "btls"
validation:
wellKnownCACertificates: "System"
hostname: dev.example.com
More information on the configuration of TLS in Gateway API can be found in Gateway API - TLS Configuration.
Status information about the features that an implementation supports
Leads: Lior Lieberman, Beka Modebadze
GEP-2162: Supported features in GatewayClass Status
GatewayClass status has a new field, supportedFeatures. This addition allows implementations to declare the set of features they support. This provides a clear way for users and tools to understand the capabilities of a given GatewayClass.
This feature's name for conformance tests (and GatewayClass status reporting) is SupportedFeatures. Implementations must populate the supportedFeatures field in the .status of the GatewayClass before the GatewayClass is accepted, or in the same operation.
Here's an example of a supportedFeatures published under GatewayClass' .status:
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
...
status:
conditions:
- lastTransitionTime: "2022-11-16T10:33:06Z"
message: Handled by Foo controller
observedGeneration: 1
reason: Accepted
status: "True"
type: Accepted
supportedFeatures:
- HTTPRoute
- HTTPRouteHostRewrite
- HTTPRoutePortRedirect
- HTTPRouteQueryParamMatching
Graduation of SupportedFeatures to Standard, helped improve the conformance testing process for Gateway API. The conformance test suite will now automatically run tests based on the features populated in the GatewayClass' status. This creates a strong, verifiable link between an implementation's declared capabilities and the test results, making it easier for implementers to run the correct conformance tests and for users to trust the conformance reports.
This means when the SupportedFeatures field is populated in the GatewayClass status there will be no need for additional conformance tests flags like -suported-features, or -exempt or -all-features. It's important to note that Mesh features are an exception to this and can be tested for conformance by using Conformance Profiles, or by manually providing any combination of features related flags until the dedicated resource graduates from the experimental channel.
Named rules for Routes
GEP-995: Adding a new name field to all xRouteRule types (HTTPRouteRule, GRPCRouteRule, etc.)
Leads: Guilherme Cassolato
This enhancement enables route rules to be explicitly identified and referenced across the Gateway API ecosystem. Some of the key use cases include:
- Status: Allowing status conditions to reference specific rules directly by name.
- Observability: Making it easier to identify individual rules in logs, traces, and metrics.
- Policies: Enabling policies (GEP-713) to target specific route rules via the
sectionNamefield in theirtargetRef[s]. - Tooling: Simplifying filtering and referencing of route rules in tools such as
gwctl,kubectl, and general-purpose utilities likejqandyq. - Internal configuration mapping: Facilitating the generation of internal configurations that reference route rules by name within gateway and mesh implementations.
This follows the same well-established pattern already adopted for Gateway listeners, Service ports, Pods (and containers), and many other Kubernetes resources.
While the new name field is optional (so existing resources remain valid), its use is strongly encouraged. Implementations are not expected to assign a default value, but they may enforce constraints such as immutability.
Finally, keep in mind that the name format is validated, and other fields (such as sectionName) may impose additional, indirect constraints.
Experimental channel changes
Enabling external Auth for HTTPRoute
Giving Gateway API the ability to enforce authentication and maybe authorization as well at the Gateway or HTTPRoute level has been a highly requested feature for a long time. (See the GEP-1494 issue for some background.)
This Gateway API release adds an Experimental filter in HTTPRoute that tells the Gateway API implementation to call out to an external service to authenticate (and, optionally, authorize) requests.
This filter is based on the Envoy ext_authz API, and allows talking to an Auth service that uses either gRPC or HTTP for its protocol.
Both methods allow the configuration of what headers to forward to the Auth service, with the HTTP protocol allowing some extra information like a prefix path.
A HTTP example might look like this (noting that this example requires the Experimental channel to be installed and an implementation that supports External Auth to actually understand the config):
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: require-auth
namespace: default
spec:
parentRefs:
- name: your-gateway-here
rules:
- matches:
- path:
type: Prefix
value: /admin
filters:
- type: ExternalAuth
externalAuth:
protocol: HTTP
backendRef:
name: auth-service
http:
# These headers are always sent for the HTTP protocol,
# but are included here for illustrative purposes
allowedHeaders:
- Host
- Method
- Path
- Content-Length
- Authorization
backendRefs:
- name: admin-backend
port: 8080
This allows the backend Auth service to use the supplied headers to make a determination about the authentication for the request.
When a request is allowed, the external Auth service will respond with a 200 HTTP response code, and optionally extra headers to be included in the request that is forwarded to the backend. When the request is denied, the Auth service will respond with a 403 HTTP response.
Since the Authorization header is used in many authentication methods, this method can be used to do Basic, Oauth, JWT, and other common authentication and authorization methods.
Mesh resource
Lead(s): Flynn
GEP-3949: Mesh-wide configuration and supported features
Gateway API v1.4.0 introduces a new experimental Mesh resource, which provides a way to configure mesh-wide settings and discover the features supported by a given mesh implementation. This resource is analogous to the Gateway resource and will initially be mainly used for conformance testing, with plans to extend its use to off-cluster Gateways in the future.
The Mesh resource is cluster-scoped and, as an experimental feature, is named XMesh and resides in the gateway.networking.x-k8s.io API group. A key field is controllerName, which specifies the mesh implementation responsible for the resource. The resource's status stanza indicates whether the mesh implementation has accepted it and lists the features the mesh supports.
One of the goals of this GEP is to avoid making it more difficult for users to adopt a mesh. To simplify adoption, mesh implementations are expected to create a default Mesh resource upon startup if one with a matching controllerName doesn't already exist. This avoids the need for manual creation of the resource to begin using a mesh.
The new XMesh API kind, within the gateway.networking.x-k8s.io/v1alpha1 API group, provides a central point for mesh configuration and feature discovery (source).
A minimal XMesh object specifies the controllerName:
apiVersion: gateway.networking.x-k8s.io/v1alpha1
kind: XMesh
metadata:
name: one-mesh-to-mesh-them-all
spec:
controllerName: one-mesh.example.com/one-mesh
The mesh implementation populates the status field to confirm it has accepted the resource and to list its supported features ( source):
status:
conditions:
- type: Accepted
status: "True"
reason: Accepted
supportedFeatures:
- name: MeshHTTPRoute
- name: OffClusterGateway
Introducing default Gateways
Lead(s): Flynn
GEP-3793: Allowing Gateways to program some routes by default.
For application developers, one common piece of feedback has been the need to explicitly name a parent Gateway for every single north-south Route. While this explicitness prevents ambiguity, it adds friction, especially for developers who just want to expose their application to the outside world without worrying about the underlying infrastructure's naming scheme. To address this, we have introduce the concept of Default Gateways.
For application developers: Just "use the default"
As an application developer, you often don't care about the specific Gateway your traffic flows through, you just want it to work. With this enhancement, you can now create a Route and simply ask it to use a default Gateway.
This is done by setting the new useDefaultGateways field in your Route's spec.
Here's a simple HTTPRoute that uses a default Gateway:
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: my-route
spec:
useDefaultGateways: All
rules:
- backendRefs:
- name: my-service
port: 80
That's it! No more need to hunt down the correct Gateway name for your environment. Your Route is now a "defaulted Route."
For cluster operators: You're still in control
This feature doesn't take control away from cluster operators ("Chihiro"). In fact, they have explicit control over which Gateways can act as a default. A Gateway will only accept these defaulted Routes if it is configured to do so.
You can also use a ValidatingAdmissionPolicy to either require or even forbid for Routes to rely on a default Gateway.
As a cluster operator, you can designate a Gateway as a default by setting the (new) .spec.defaultScope field:
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: my-default-gateway
namespace: default
spec:
defaultScope: All
# ... other gateway configuration
Operators can choose to have no default Gateways, or even multiple.
How it works and key details
-
To maintain a clean, GitOps-friendly workflow, a default Gateway does not modify the
spec.parentRefsof your Route. Instead, the binding is reflected in the Route'sstatusfield. You can always inspect thestatus.parentsstanza of your Route to see exactly which Gateway or Gateways have accepted it. This preserves your original intent and avoids conflicts with CD tools. -
The design explicitly supports having multiple Gateways designated as defaults within a cluster. When this happens, a defaulted Route will bind to all of them. This enables cluster operators to perform zero-downtime migrations and testing of new default Gateways.
-
You can create a single Route that handles both north-south traffic (traffic entering or leaving the cluster, via a default Gateway) and east-west/mesh traffic (traffic between services within the cluster), by explicitly referencing a Service in
parentRefs.
Default Gateways represent a significant step forward in making the Gateway API simpler and more intuitive for everyday use cases, bridging the gap between the flexibility needed by operators and the simplicity desired by developers.
Configuring client certificate validation
Lead(s): Arko Dasgupta, Katarzyna Łach
GEP-91: Address connection coalescing security issue
This release brings updates for configuring client certificate validation, addressing a critical security vulnerability related to connection reuse. HTTP connection coalescing is a web performance optimization that allows a client to reuse an existing TLS connection for requests to different domains. While this reduces the overhead of establishing new connections, it introduces a security risk in the context of API gateways. The ability to reuse a single TLS connection across multiple Listeners brings the need to introduce shared client certificate configuration in order to avoid unauthorized access.
Why SNI-based mTLS is not the answer
One might think that using Server Name Indication (SNI) to differentiate between Listeners would solve this problem. However, TLS SNI is not a reliable mechanism for enforcing security policies in a connection coalescing scenario. A client could use a single TLS connection for multiple peer connections, as long as they are all covered by the same certificate. This means that a client could establish a connection by indicating one peer identity (using SNI), and then reuse that connection to access a different virtual host that is listening on the same IP address and port. That reuse, which is controlled by client side heuristics, could bypass mutual TLS policies that were specific to the second listener configuration.
Here's an example to help explain it:
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: wildcard-tls-gateway
spec:
gatewayClassName: example
listeners:
- name: foo-https
protocol: HTTPS
port: 443
hostname: foo.example.com
tls:
certificateRefs:
- group: "" # core API group
kind: Secret
name: foo-example-com-cert # SAN: foo.example.com
- name: wildcard-https
protocol: HTTPS
port: 443
hostname: "*.example.com"
tls:
certificateRefs:
- group: "" # core API group
kind: Secret
name: wildcard-example-com-cert # SAN: *.example.com
I have configured a Gateway with two listeners, both having overlapping hostnames. My intention is for the foo-http listener to be accessible only by clients presenting the foo-example-com-cert certificate. In contrast, the wildcard-https listener should allow access to a broader audience using any certificate valid for the *.example.com domain.
Consider a scenario where a client initially connects to foo.example.com. The server requests and successfully validates the foo-example-com-cert certificate, establishing the connection. Subsequently, the same client wishes to access other sites within this domain, such as bar.example.com, which is handled by the wildcard-https listener. Due to connection reuse, clients can access wildcard-https backends without an additional TLS handshake on the existing connection. This process functions as expected.
However, a critical security vulnerability arises when the order of access is reversed. If a client first connects to bar.example.com and presents a valid bar.example.com certificate, the connection is successfully established. If this client then attempts to access foo.example.com, the existing connection's client certificate will not be re-validated. This allows the client to bypass the specific certificate requirement for the foo backend, leading to a serious security breach.
The solution: per-port TLS configuration
The updated Gateway API gains a tls field in the .spec of a Gateway, that allows you to define a default client certificate validation configuration for all Listeners, and then if needed override it on a per-port basis. This provides a flexible and powerful way to manage your TLS policies.
Here's a look at the updated API definitions (shown as Go source code):
// GatewaySpec defines the desired state of Gateway.
type GatewaySpec struct {
...
// GatewayTLSConfig specifies frontend tls configuration for gateway.
TLS *GatewayTLSConfig `json:"tls,omitempty"`
}
// GatewayTLSConfig specifies frontend tls configuration for gateway.
type GatewayTLSConfig struct {
// Default specifies the default client certificate validation configuration
Default TLSConfig `json:"default"`
// PerPort specifies tls configuration assigned per port.
PerPort []TLSPortConfig `json:"perPort,omitempty"`
}
// TLSPortConfig describes a TLS configuration for a specific port.
type TLSPortConfig struct {
// The Port indicates the Port Number to which the TLS configuration will be applied.
Port PortNumber `json:"port"`
// TLS store the configuration that will be applied to all Listeners handling
// HTTPS traffic and matching given port.
TLS TLSConfig `json:"tls"`
}
Breaking changes
Standard GRPCRoute - .spec field required (technicality)
The promotion of GRPCRoute to Standard introduces a minor but technically breaking change regarding the presence of the top-level .spec field. As part of achieving Standard status, the Gateway API has tightened the OpenAPI schema validation within the GRPCRoute CustomResourceDefinition (CRD) to explicitly ensure the spec field is required for all GRPCRoute resources. This change enforces stricter conformance to Kubernetes object standards and enhances the resource's stability and predictability. While it is highly unlikely that users were attempting to define a GRPCRoute without any specification, any existing automation or manifests that might have relied on a relaxed interpretation allowing a completely absent spec field will now fail validation and must be updated to include the .spec field, even if empty.
Experimental CORS support in HTTPRoute - breaking change for allowCredentials field
The Gateway API subproject has introduced a breaking change to the Experimental CORS support in HTTPRoute, concerning the allowCredentials field within the CORS policy. This field's definition has been strictly aligned with the upstream CORS specification, which dictates that the corresponding Access-Control-Allow-Credentials header must represent a Boolean value. Previously, the implementation might have been overly permissive, potentially accepting non-standard or string representations such as true due to relaxed schema validation. Users who were configuring CORS rules must now review their manifests and ensure the value for allowCredentials strictly conforms to the new, more restrictive schema. Any existing HTTPRoute definitions that do not adhere to this stricter validation will now be rejected by the API server, requiring a configuration update to maintain functionality.
Improving the development and usage experience
As part of this release, we have improved some of the developer experience workflow:
- Added Kube API Linter to the CI/CD pipelines, reducing the burden of API reviewers and also reducing the amount of common mistakes.
- Improving the execution time of CRD tests with the usage of
envtest.
Additionally, as part of the effort to improve Gateway API usage experience, some efforts were made to remove some ambiguities and some old tech-debts from our documentation website:
- The API reference is now explicit when a field is
experimental. - The GEP (GatewayAPI Enhancement Proposal) navigation bar is automatically generated, reflecting the real status of the enhancements.
Try it out
Unlike other Kubernetes APIs, you don't need to upgrade to the latest version of Kubernetes to get the latest version of Gateway API. As long as you're running Kubernetes 1.26 or later, you'll be able to get up and running with this version of Gateway API.
To try out the API, follow the Getting Started Guide.
As of this writing, seven implementations are already conformant with Gateway API v1.4.0. In alphabetical order:
- Agent Gateway (with kgateway)
- Airlock Microgateway
- Envoy Gateway
- GKE Gateway
- Istio
- kgateway
- Traefik Proxy
Get involved
Wondering when a feature will be added? There are lots of opportunities to get involved and help define the future of Kubernetes routing APIs for both ingress and service mesh.
- Check out the user guides to see what use-cases can be addressed.
- Try out one of the existing Gateway controllers.
- Or join us in the community and help us build the future of Gateway API together!
The maintainers would like to thank everyone who's contributed to Gateway API, whether in the form of commits to the repo, discussion, ideas, or general support. We could never have made this kind of progress without the support of this dedicated and active community.
Related Kubernetes blog articles
- Gateway API v1.3.0: Advancements in Request Mirroring, CORS, Gateway Merging, and Retry Budgets (June 2025)
- Gateway API v1.2: WebSockets, Timeouts, Retries, and More (November 2024)
- Gateway API v1.1: Service mesh, GRPCRoute, and a whole lot more (May 2024)
- New Experimental Features in Gateway API v1.0 (November 2023)
- Gateway API v1.0: GA Release (October 2023)
06 Nov 2025 5:00pm GMT
20 Oct 2025
Kubernetes Blog
7 Common Kubernetes Pitfalls (and How I Learned to Avoid Them)
It's no secret that Kubernetes can be both powerful and frustrating at times. When I first started dabbling with container orchestration, I made more than my fair share of mistakes enough to compile a whole list of pitfalls. In this post, I want to walk through seven big gotchas I've encountered (or seen others run into) and share some tips on how to avoid them. Whether you're just kicking the tires on Kubernetes or already managing production clusters, I hope these insights help you steer clear of a little extra stress.
1. Skipping resource requests and limits
The pitfall: Not specifying CPU and memory requirements in Pod specifications. This typically happens because Kubernetes does not require these fields, and workloads can often start and run without them-making the omission easy to overlook in early configurations or during rapid deployment cycles.
Context: In Kubernetes, resource requests and limits are critical for efficient cluster management. Resource requests ensure that the scheduler reserves the appropriate amount of CPU and memory for each pod, guaranteeing that it has the necessary resources to operate. Resource limits cap the amount of CPU and memory a pod can use, preventing any single pod from consuming excessive resources and potentially starving other pods. When resource requests and limits are not set:
- Resource Starvation: Pods may get insufficient resources, leading to degraded performance or failures. This is because Kubernetes schedules pods based on these requests. Without them, the scheduler might place too many pods on a single node, leading to resource contention and performance bottlenecks.
- Resource Hoarding: Conversely, without limits, a pod might consume more than its fair share of resources, impacting the performance and stability of other pods on the same node. This can lead to issues such as other pods getting evicted or killed by the Out-Of-Memory (OOM) killer due to lack of available memory.
How to avoid it:
- Start with modest
requests(for example100mCPU,128Mimemory) and see how your app behaves. - Monitor real-world usage and refine your values; the HorizontalPodAutoscaler can help automate scaling based on metrics.
- Keep an eye on
kubectl top podsor your logging/monitoring tool to confirm you're not over- or under-provisioning.
My reality check: Early on, I never thought about memory limits. Things seemed fine on my local cluster. Then, on a larger environment, Pods got OOMKilled left and right. Lesson learned. For detailed instructions on configuring resource requests and limits for your containers, please refer to Assign Memory Resources to Containers and Pods (part of the official Kubernetes documentation).
2. Underestimating liveness and readiness probes
The pitfall: Deploying containers without explicitly defining how Kubernetes should check their health or readiness. This tends to happen because Kubernetes will consider a container "running" as long as the process inside hasn't exited. Without additional signals, Kubernetes assumes the workload is functioning-even if the application inside is unresponsive, initializing, or stuck.
Context:
Liveness, readiness, and startup probes are mechanisms Kubernetes uses to monitor container health and availability.
- Liveness probes determine if the application is still alive. If a liveness check fails, the container is restarted.
- Readiness probes control whether a container is ready to serve traffic. Until the readiness probe passes, the container is removed from Service endpoints.
- Startup probes help distinguish between long startup times and actual failures.
How to avoid it:
- Add a simple HTTP
livenessProbeto check a health endpoint (for example/healthz) so Kubernetes can restart a hung container. - Use a
readinessProbeto ensure traffic doesn't reach your app until it's warmed up. - Keep probes simple. Overly complex checks can create false alarms and unnecessary restarts.
My reality check: I once forgot a readiness probe for a web service that took a while to load. Users hit it prematurely, got weird timeouts, and I spent hours scratching my head. A 3-line readiness probe would have saved the day.
For comprehensive instructions on configuring liveness, readiness, and startup probes for containers, please refer to Configure Liveness, Readiness and Startup Probes in the official Kubernetes documentation.
3. "We'll just look at container logs" (famous last words)
The pitfall: Relying solely on container logs retrieved via kubectl logs. This often happens because the command is quick and convenient, and in many setups, logs appear accessible during development or early troubleshooting. However, kubectl logs only retrieves logs from currently running or recently terminated containers, and those logs are stored on the node's local disk. As soon as the container is deleted, evicted, or the node is restarted, the log files may be rotated out or permanently lost.
How to avoid it:
- Centralize logs using CNCF tools like Fluentd or Fluent Bit to aggregate output from all Pods.
- Adopt OpenTelemetry for a unified view of logs, metrics, and (if needed) traces. This lets you spot correlations between infrastructure events and app-level behavior.
- Pair logs with Prometheus metrics to track cluster-level data alongside application logs. If you need distributed tracing, consider CNCF projects like Jaeger.
My reality check: The first time I lost Pod logs to a quick restart, I realized how flimsy "kubectl logs" can be on its own. Since then, I've set up a proper pipeline for every cluster to avoid missing vital clues.
4. Treating dev and prod exactly the same
The pitfall: Deploying the same Kubernetes manifests with identical settings across development, staging, and production environments. This often occurs when teams aim for consistency and reuse, but overlook that environment-specific factors-such as traffic patterns, resource availability, scaling needs, or access control-can differ significantly. Without customization, configurations optimized for one environment may cause instability, poor performance, or security gaps in another.
How to avoid it:
- Use environment overlays or kustomize to maintain a shared base while customizing resource requests, replicas, or config for each environment.
- Extract environment-specific configuration into ConfigMaps and / or Secrets. You can use a specialized tool such as Sealed Secrets to manage confidential data.
- Plan for scale in production. Your dev cluster can probably get away with minimal CPU/memory, but prod might need significantly more.
My reality check: One time, I scaled up replicaCount from 2 to 10 in a tiny dev environment just to "test." I promptly ran out of resources and spent half a day cleaning up the aftermath. Oops.
5. Leaving old stuff floating around
The pitfall: Leaving unused or outdated resources-such as Deployments, Services, ConfigMaps, or PersistentVolumeClaims-running in the cluster. This often happens because Kubernetes does not automatically remove resources unless explicitly instructed, and there is no built-in mechanism to track ownership or expiration. Over time, these forgotten objects can accumulate, consuming cluster resources, increasing cloud costs, and creating operational confusion, especially when stale Services or LoadBalancers continue to route traffic.
How to avoid it:
- Label everything with a purpose or owner label. That way, you can easily query resources you no longer need.
- Regularly audit your cluster: run
kubectl get all -n <namespace>to see what's actually running, and confirm it's all legit. - Adopt Kubernetes' Garbage Collection: K8s docs show how to remove dependent objects automatically.
- Leverage policy automation: Tools like Kyverno can automatically delete or block stale resources after a certain period, or enforce lifecycle policies so you don't have to remember every single cleanup step.
My reality check: After a hackathon, I forgot to tear down a "test-svc" pinned to an external load balancer. Three weeks later, I realized I'd been paying for that load balancer the entire time. Facepalm.
6. Diving too deep into networking too soon
The pitfall: Introducing advanced networking solutions-such as service meshes, custom CNI plugins, or multi-cluster communication-before fully understanding Kubernetes' native networking primitives. This commonly occurs when teams implement features like traffic routing, observability, or mTLS using external tools without first mastering how core Kubernetes networking works: including Pod-to-Pod communication, ClusterIP Services, DNS resolution, and basic ingress traffic handling. As a result, network-related issues become harder to troubleshoot, especially when overlays introduce additional abstractions and failure points.
How to avoid it:
- Start small: a Deployment, a Service, and a basic ingress controller such as one based on NGINX (e.g., Ingress-NGINX).
- Make sure you understand how traffic flows within the cluster, how service discovery works, and how DNS is configured.
- Only move to a full-blown mesh or advanced CNI features when you actually need them, complex networking adds overhead.
My reality check: I tried Istio on a small internal app once, then spent more time debugging Istio itself than the actual app. Eventually, I stepped back, removed Istio, and everything worked fine.
7. Going too light on security and RBAC
The pitfall: Deploying workloads with insecure configurations, such as running containers as the root user, using the latest image tag, disabling security contexts, or assigning overly broad RBAC roles like cluster-admin. These practices persist because Kubernetes does not enforce strict security defaults out of the box, and the platform is designed to be flexible rather than opinionated. Without explicit security policies in place, clusters can remain exposed to risks like container escape, unauthorized privilege escalation, or accidental production changes due to unpinned images.
How to avoid it:
- Use RBAC to define roles and permissions within Kubernetes. While RBAC is the default and most widely supported authorization mechanism, Kubernetes also allows the use of alternative authorizers. For more advanced or external policy needs, consider solutions like OPA Gatekeeper (based on Rego), Kyverno, or custom webhooks using policy languages such as CEL or Cedar.
- Pin images to specific versions (no more
:latest!). This helps you know what's actually deployed. - Look into Pod Security Admission (or other solutions like Kyverno) to enforce non-root containers, read-only filesystems, etc.
My reality check: I never had a huge security breach, but I've heard plenty of cautionary tales. If you don't tighten things up, it's only a matter of time before something goes wrong.
Final thoughts
Kubernetes is amazing, but it's not psychic, it won't magically do the right thing if you don't tell it what you need. By keeping these pitfalls in mind, you'll avoid a lot of headaches and wasted time. Mistakes happen (trust me, I've made my share), but each one is a chance to learn more about how Kubernetes truly works under the hood. If you're curious to dive deeper, the official docs and the community Slack are excellent next steps. And of course, feel free to share your own horror stories or success tips, because at the end of the day, we're all in this cloud native adventure together.
Happy Shipping!
20 Oct 2025 3:30pm GMT
18 Oct 2025
Kubernetes Blog
Spotlight on Policy Working Group
(Note: The Policy Working Group has completed its mission and is no longer active. This article reflects its work, accomplishments, and insights into how a working group operates.)
In the complex world of Kubernetes, policies play a crucial role in managing and securing clusters. But have you ever wondered how these policies are developed, implemented, and standardized across the Kubernetes ecosystem? To answer that, let's take a look back at the work of the Policy Working Group.
The Policy Working Group was dedicated to a critical mission: providing an overall architecture that encompasses both current policy-related implementations and future policy proposals in Kubernetes. Their goal was both ambitious and essential: to develop a universal policy architecture that benefits developers and end-users alike.
Through collaborative methods, this working group strove to bring clarity and consistency to the often complex world of Kubernetes policies. By focusing on both existing implementations and future proposals, they ensured that the policy landscape in Kubernetes remains coherent and accessible as the technology evolves.
This blog post dives deeper into the work of the Policy Working Group, guided by insights from its former co-chairs:
Interviewed by Arujjwal Negi.
These co-chairs explained what the Policy Working Group was all about.
Introduction
Hello, thank you for the time! Let's start with some introductions, could you tell us a bit about yourself, your role, and how you got involved in Kubernetes?
Jim Bugwadia: My name is Jim Bugwadia, and I am a co-founder and the CEO at Nirmata which provides solutions that automate security and compliance for cloud-native workloads. At Nirmata, we have been working with Kubernetes since it started in 2014. We initially built a Kubernetes policy engine in our commercial platform and later donated it to CNCF as the Kyverno project. I joined the CNCF Kubernetes Policy Working Group to help build and standardize various aspects of policy management for Kubernetes and later became a co-chair.
Andy Suderman: My name is Andy Suderman and I am the CTO of Fairwinds, a managed Kubernetes-as-a-Service provider. I began working with Kubernetes in 2016 building a web conferencing platform. I am an author and/or maintainer of several Kubernetes-related open-source projects such as Goldilocks, Pluto, and Polaris. Polaris is a JSON-schema-based policy engine, which started Fairwinds' journey into the policy space and my involvement in the Policy Working Group.
Poonam Lamba: My name is Poonam Lamba, and I currently work as a Product Manager for Google Kubernetes Engine (GKE) at Google. My journey with Kubernetes began back in 2017 when I was building an SRE platform for a large enterprise, using a private cloud built on Kubernetes. Intrigued by its potential to revolutionize the way we deployed and managed applications at the time, I dove headfirst into learning everything I could about it. Since then, I've had the opportunity to build the policy and compliance products for GKE. I lead and contribute to GKE CIS benchmarks. I am involved with the Gatekeeper project as well as I have contributed to Policy-WG for over 2 years and served as a co-chair for the group.
Responses to the following questions represent an amalgamation of insights from the former co-chairs.
About Working Groups
One thing even I am not aware of is the difference between a working group and a SIG. Can you help us understand what a working group is and how it is different from a SIG?
Unlike SIGs, working groups are temporary and focused on tackling specific, cross-cutting issues or projects that may involve multiple SIGs. Their lifespan is defined, and they disband once they've achieved their objective. Generally, working groups don't own code or have long-term responsibility for managing a particular area of the Kubernetes project.
(To know more about SIGs, visit the list of Special Interest Groups)
You mentioned that Working Groups involve multiple SIGS. What SIGS was the Policy WG closely involved with, and how did you coordinate with them?
The group collaborated closely with Kubernetes SIG Auth throughout our existence, and more recently, the group also worked with SIG Security since its formation. Our collaboration occurred in a few ways. We provided periodic updates during the SIG meetings to keep them informed of our progress and activities. Additionally, we utilize other community forums to maintain open lines of communication and ensured our work aligned with the broader Kubernetes ecosystem. This collaborative approach helped the group stay coordinated with related efforts across the Kubernetes community.
Policy WG
Why was the Policy Working Group created?
To enable a broad set of use cases, we recognize that Kubernetes is powered by a highly declarative, fine-grained, and extensible configuration management system. We've observed that a Kubernetes configuration manifest may have different portions that are important to various stakeholders. For example, some parts may be crucial for developers, while others might be of particular interest to security teams or address operational concerns. Given this complexity, we believe that policies governing the usage of these intricate configurations are essential for success with Kubernetes.
Our Policy Working Group was created specifically to research the standardization of policy definitions and related artifacts. We saw a need to bring consistency and clarity to how policies are defined and implemented across the Kubernetes ecosystem, given the diverse requirements and stakeholders involved in Kubernetes deployments.
Can you give me an idea of the work you did in the group?
We worked on several Kubernetes policy-related projects. Our initiatives included:
- We worked on a Kubernetes Enhancement Proposal (KEP) for the Kubernetes Policy Reports API. This aims to standardize how policy reports are generated and consumed within the Kubernetes ecosystem.
- We conducted a CNCF survey to better understand policy usage in the Kubernetes space. This helped gauge the practices and needs across the community at the time.
- We wrote a paper that will guide users in achieving PCI-DSS compliance for containers. This is intended to help organizations meet important security standards in their Kubernetes environments.
- We also worked on a paper highlighting how shifting security down can benefit organizations. This focuses on the advantages of implementing security measures earlier in the development and deployment process.
Can you tell us what were the main objectives of the Policy Working Group and some of your key accomplishments?
The charter of the Policy WG was to help standardize policy management for Kubernetes and educate the community on best practices.
To accomplish this we updated the Kubernetes documentation (Policies | Kubernetes), produced several whitepapers (Kubernetes Policy Management, Kubernetes GRC), and created the Policy Reports API (API reference) which standardizes reporting across various tools. Several popular tools such as Falco, Trivy, Kyverno, kube-bench, and others support the Policy Report API. A major milestone for the Policy WG was promoting the Policy Reports API to a SIG-level API or finding it a stable home.
Beyond that, as ValidatingAdmissionPolicy and MutatingAdmissionPolicy approached GA in Kubernetes, a key goal of the WG was to guide and educate the community on the tradeoffs and appropriate usage patterns for these built-in API objects and other CNCF policy management solutions like OPA/Gatekeeper and Kyverno.
Challenges
What were some of the major challenges that the Policy Working Group worked on?
During our work in the Policy Working Group, we encountered several challenges:
-
One of the main issues we faced was finding time to consistently contribute. Given that many of us have other professional commitments, it can be difficult to dedicate regular time to the working group's initiatives.
-
Another challenge we experienced was related to our consensus-driven model. While this approach ensures that all voices are heard, it can sometimes lead to slower decision-making processes. We valued thorough discussion and agreement, but this can occasionally delay progress on our projects.
-
We've also encountered occasional differences of opinion among group members. These situations require careful navigation to ensure that we maintain a collaborative and productive environment while addressing diverse viewpoints.
-
Lastly, we've noticed that newcomers to the group may find it difficult to contribute effectively without consistent attendance at our meetings. The complex nature of our work often requires ongoing context, which can be challenging for those who aren't able to participate regularly.
Can you tell me more about those challenges? How did you discover each one? What has the impact been? What were some strategies you used to address them?
There are no easy answers, but having more contributors and maintainers greatly helps! Overall the CNCF community is great to work with and is very welcoming to beginners. So, if folks out there are hesitating to get involved, I highly encourage them to attend a WG or SIG meeting and just listen in.
It often takes a few meetings to fully understand the discussions, so don't feel discouraged if you don't grasp everything right away. We made a point to emphasize this and encouraged new members to review documentation as a starting point for getting involved.
Additionally, differences of opinion were valued and encouraged within the Policy-WG. We adhered to the CNCF core values and resolve disagreements by maintaining respect for one another. We also strove to timebox our decisions and assign clear responsibilities to keep things moving forward.
This is where our discussion about the Policy Working Group ends. The working group, and especially the people who took part in this article, hope this gave you some insights into the group's aims and workings. You can get more info about Working Groups here.
18 Oct 2025 12:00am GMT