20 Oct 2025

feedKubernetes Blog

7 Common Kubernetes Pitfalls (and How I Learned to Avoid Them)

It's no secret that Kubernetes can be both powerful and frustrating at times. When I first started dabbling with container orchestration, I made more than my fair share of mistakes enough to compile a whole list of pitfalls. In this post, I want to walk through seven big gotchas I've encountered (or seen others run into) and share some tips on how to avoid them. Whether you're just kicking the tires on Kubernetes or already managing production clusters, I hope these insights help you steer clear of a little extra stress.

1. Skipping resource requests and limits

The pitfall: Not specifying CPU and memory requirements in Pod specifications. This typically happens because Kubernetes does not require these fields, and workloads can often start and run without them-making the omission easy to overlook in early configurations or during rapid deployment cycles.

Context: In Kubernetes, resource requests and limits are critical for efficient cluster management. Resource requests ensure that the scheduler reserves the appropriate amount of CPU and memory for each pod, guaranteeing that it has the necessary resources to operate. Resource limits cap the amount of CPU and memory a pod can use, preventing any single pod from consuming excessive resources and potentially starving other pods. When resource requests and limits are not set:

  1. Resource Starvation: Pods may get insufficient resources, leading to degraded performance or failures. This is because Kubernetes schedules pods based on these requests. Without them, the scheduler might place too many pods on a single node, leading to resource contention and performance bottlenecks.
  2. Resource Hoarding: Conversely, without limits, a pod might consume more than its fair share of resources, impacting the performance and stability of other pods on the same node. This can lead to issues such as other pods getting evicted or killed by the Out-Of-Memory (OOM) killer due to lack of available memory.

How to avoid it:

My reality check: Early on, I never thought about memory limits. Things seemed fine on my local cluster. Then, on a larger environment, Pods got OOMKilled left and right. Lesson learned. For detailed instructions on configuring resource requests and limits for your containers, please refer to Assign Memory Resources to Containers and Pods (part of the official Kubernetes documentation).

2. Underestimating liveness and readiness probes

The pitfall: Deploying containers without explicitly defining how Kubernetes should check their health or readiness. This tends to happen because Kubernetes will consider a container "running" as long as the process inside hasn't exited. Without additional signals, Kubernetes assumes the workload is functioning-even if the application inside is unresponsive, initializing, or stuck.

Context:
Liveness, readiness, and startup probes are mechanisms Kubernetes uses to monitor container health and availability.

How to avoid it:

My reality check: I once forgot a readiness probe for a web service that took a while to load. Users hit it prematurely, got weird timeouts, and I spent hours scratching my head. A 3-line readiness probe would have saved the day.

For comprehensive instructions on configuring liveness, readiness, and startup probes for containers, please refer to Configure Liveness, Readiness and Startup Probes in the official Kubernetes documentation.

3. "We'll just look at container logs" (famous last words)

The pitfall: Relying solely on container logs retrieved via kubectl logs. This often happens because the command is quick and convenient, and in many setups, logs appear accessible during development or early troubleshooting. However, kubectl logs only retrieves logs from currently running or recently terminated containers, and those logs are stored on the node's local disk. As soon as the container is deleted, evicted, or the node is restarted, the log files may be rotated out or permanently lost.

How to avoid it:

My reality check: The first time I lost Pod logs to a quick restart, I realized how flimsy "kubectl logs" can be on its own. Since then, I've set up a proper pipeline for every cluster to avoid missing vital clues.

4. Treating dev and prod exactly the same

The pitfall: Deploying the same Kubernetes manifests with identical settings across development, staging, and production environments. This often occurs when teams aim for consistency and reuse, but overlook that environment-specific factors-such as traffic patterns, resource availability, scaling needs, or access control-can differ significantly. Without customization, configurations optimized for one environment may cause instability, poor performance, or security gaps in another.

How to avoid it:

My reality check: One time, I scaled up replicaCount from 2 to 10 in a tiny dev environment just to "test." I promptly ran out of resources and spent half a day cleaning up the aftermath. Oops.

5. Leaving old stuff floating around

The pitfall: Leaving unused or outdated resources-such as Deployments, Services, ConfigMaps, or PersistentVolumeClaims-running in the cluster. This often happens because Kubernetes does not automatically remove resources unless explicitly instructed, and there is no built-in mechanism to track ownership or expiration. Over time, these forgotten objects can accumulate, consuming cluster resources, increasing cloud costs, and creating operational confusion, especially when stale Services or LoadBalancers continue to route traffic.

How to avoid it:

My reality check: After a hackathon, I forgot to tear down a "test-svc" pinned to an external load balancer. Three weeks later, I realized I'd been paying for that load balancer the entire time. Facepalm.

6. Diving too deep into networking too soon

The pitfall: Introducing advanced networking solutions-such as service meshes, custom CNI plugins, or multi-cluster communication-before fully understanding Kubernetes' native networking primitives. This commonly occurs when teams implement features like traffic routing, observability, or mTLS using external tools without first mastering how core Kubernetes networking works: including Pod-to-Pod communication, ClusterIP Services, DNS resolution, and basic ingress traffic handling. As a result, network-related issues become harder to troubleshoot, especially when overlays introduce additional abstractions and failure points.

How to avoid it:

My reality check: I tried Istio on a small internal app once, then spent more time debugging Istio itself than the actual app. Eventually, I stepped back, removed Istio, and everything worked fine.

7. Going too light on security and RBAC

The pitfall: Deploying workloads with insecure configurations, such as running containers as the root user, using the latest image tag, disabling security contexts, or assigning overly broad RBAC roles like cluster-admin. These practices persist because Kubernetes does not enforce strict security defaults out of the box, and the platform is designed to be flexible rather than opinionated. Without explicit security policies in place, clusters can remain exposed to risks like container escape, unauthorized privilege escalation, or accidental production changes due to unpinned images.

How to avoid it:

My reality check: I never had a huge security breach, but I've heard plenty of cautionary tales. If you don't tighten things up, it's only a matter of time before something goes wrong.

Final thoughts

Kubernetes is amazing, but it's not psychic, it won't magically do the right thing if you don't tell it what you need. By keeping these pitfalls in mind, you'll avoid a lot of headaches and wasted time. Mistakes happen (trust me, I've made my share), but each one is a chance to learn more about how Kubernetes truly works under the hood. If you're curious to dive deeper, the official docs and the community Slack are excellent next steps. And of course, feel free to share your own horror stories or success tips, because at the end of the day, we're all in this cloud native adventure together.

Happy Shipping!

20 Oct 2025 3:30pm GMT

18 Oct 2025

feedKubernetes Blog

Spotlight on Policy Working Group

(Note: The Policy Working Group has completed its mission and is no longer active. This article reflects its work, accomplishments, and insights into how a working group operates.)

In the complex world of Kubernetes, policies play a crucial role in managing and securing clusters. But have you ever wondered how these policies are developed, implemented, and standardized across the Kubernetes ecosystem? To answer that, let's take a look back at the work of the Policy Working Group.

The Policy Working Group was dedicated to a critical mission: providing an overall architecture that encompasses both current policy-related implementations and future policy proposals in Kubernetes. Their goal was both ambitious and essential: to develop a universal policy architecture that benefits developers and end-users alike.

Through collaborative methods, this working group strove to bring clarity and consistency to the often complex world of Kubernetes policies. By focusing on both existing implementations and future proposals, they ensured that the policy landscape in Kubernetes remains coherent and accessible as the technology evolves.

This blog post dives deeper into the work of the Policy Working Group, guided by insights from its former co-chairs:

Interviewed by Arujjwal Negi.

These co-chairs explained what the Policy Working Group was all about.

Introduction

Hello, thank you for the time! Let's start with some introductions, could you tell us a bit about yourself, your role, and how you got involved in Kubernetes?

Jim Bugwadia: My name is Jim Bugwadia, and I am a co-founder and the CEO at Nirmata which provides solutions that automate security and compliance for cloud-native workloads. At Nirmata, we have been working with Kubernetes since it started in 2014. We initially built a Kubernetes policy engine in our commercial platform and later donated it to CNCF as the Kyverno project. I joined the CNCF Kubernetes Policy Working Group to help build and standardize various aspects of policy management for Kubernetes and later became a co-chair.

Andy Suderman: My name is Andy Suderman and I am the CTO of Fairwinds, a managed Kubernetes-as-a-Service provider. I began working with Kubernetes in 2016 building a web conferencing platform. I am an author and/or maintainer of several Kubernetes-related open-source projects such as Goldilocks, Pluto, and Polaris. Polaris is a JSON-schema-based policy engine, which started Fairwinds' journey into the policy space and my involvement in the Policy Working Group.

Poonam Lamba: My name is Poonam Lamba, and I currently work as a Product Manager for Google Kubernetes Engine (GKE) at Google. My journey with Kubernetes began back in 2017 when I was building an SRE platform for a large enterprise, using a private cloud built on Kubernetes. Intrigued by its potential to revolutionize the way we deployed and managed applications at the time, I dove headfirst into learning everything I could about it. Since then, I've had the opportunity to build the policy and compliance products for GKE. I lead and contribute to GKE CIS benchmarks. I am involved with the Gatekeeper project as well as I have contributed to Policy-WG for over 2 years and served as a co-chair for the group.

Responses to the following questions represent an amalgamation of insights from the former co-chairs.

About Working Groups

One thing even I am not aware of is the difference between a working group and a SIG. Can you help us understand what a working group is and how it is different from a SIG?

Unlike SIGs, working groups are temporary and focused on tackling specific, cross-cutting issues or projects that may involve multiple SIGs. Their lifespan is defined, and they disband once they've achieved their objective. Generally, working groups don't own code or have long-term responsibility for managing a particular area of the Kubernetes project.

(To know more about SIGs, visit the list of Special Interest Groups)

You mentioned that Working Groups involve multiple SIGS. What SIGS was the Policy WG closely involved with, and how did you coordinate with them?

The group collaborated closely with Kubernetes SIG Auth throughout our existence, and more recently, the group also worked with SIG Security since its formation. Our collaboration occurred in a few ways. We provided periodic updates during the SIG meetings to keep them informed of our progress and activities. Additionally, we utilize other community forums to maintain open lines of communication and ensured our work aligned with the broader Kubernetes ecosystem. This collaborative approach helped the group stay coordinated with related efforts across the Kubernetes community.

Policy WG

Why was the Policy Working Group created?

To enable a broad set of use cases, we recognize that Kubernetes is powered by a highly declarative, fine-grained, and extensible configuration management system. We've observed that a Kubernetes configuration manifest may have different portions that are important to various stakeholders. For example, some parts may be crucial for developers, while others might be of particular interest to security teams or address operational concerns. Given this complexity, we believe that policies governing the usage of these intricate configurations are essential for success with Kubernetes.

Our Policy Working Group was created specifically to research the standardization of policy definitions and related artifacts. We saw a need to bring consistency and clarity to how policies are defined and implemented across the Kubernetes ecosystem, given the diverse requirements and stakeholders involved in Kubernetes deployments.

Can you give me an idea of the work you did in the group?

We worked on several Kubernetes policy-related projects. Our initiatives included:

Can you tell us what were the main objectives of the Policy Working Group and some of your key accomplishments?

The charter of the Policy WG was to help standardize policy management for Kubernetes and educate the community on best practices.

To accomplish this we updated the Kubernetes documentation (Policies | Kubernetes), produced several whitepapers (Kubernetes Policy Management, Kubernetes GRC), and created the Policy Reports API (API reference) which standardizes reporting across various tools. Several popular tools such as Falco, Trivy, Kyverno, kube-bench, and others support the Policy Report API. A major milestone for the Policy WG was promoting the Policy Reports API to a SIG-level API or finding it a stable home.

Beyond that, as ValidatingAdmissionPolicy and MutatingAdmissionPolicy approached GA in Kubernetes, a key goal of the WG was to guide and educate the community on the tradeoffs and appropriate usage patterns for these built-in API objects and other CNCF policy management solutions like OPA/Gatekeeper and Kyverno.

Challenges

What were some of the major challenges that the Policy Working Group worked on?

During our work in the Policy Working Group, we encountered several challenges:

Can you tell me more about those challenges? How did you discover each one? What has the impact been? What were some strategies you used to address them?

There are no easy answers, but having more contributors and maintainers greatly helps! Overall the CNCF community is great to work with and is very welcoming to beginners. So, if folks out there are hesitating to get involved, I highly encourage them to attend a WG or SIG meeting and just listen in.

It often takes a few meetings to fully understand the discussions, so don't feel discouraged if you don't grasp everything right away. We made a point to emphasize this and encouraged new members to review documentation as a starting point for getting involved.

Additionally, differences of opinion were valued and encouraged within the Policy-WG. We adhered to the CNCF core values and resolve disagreements by maintaining respect for one another. We also strove to timebox our decisions and assign clear responsibilities to keep things moving forward.


This is where our discussion about the Policy Working Group ends. The working group, and especially the people who took part in this article, hope this gave you some insights into the group's aims and workings. You can get more info about Working Groups here.

18 Oct 2025 12:00am GMT

06 Oct 2025

feedKubernetes Blog

Introducing Headlamp Plugin for Karpenter - Scaling and Visibility

Headlamp is an open‑source, extensible Kubernetes SIG UI project designed to let you explore, manage, and debug cluster resources.

Karpenter is a Kubernetes Autoscaling SIG node provisioning project that helps clusters scale quickly and efficiently. It launches new nodes in seconds, selects appropriate instance types for workloads, and manages the full node lifecycle, including scale-down.

The new Headlamp Karpenter Plugin adds real-time visibility into Karpenter's activity directly from the Headlamp UI. It shows how Karpenter resources relate to Kubernetes objects, displays live metrics, and surfaces scaling events as they happen. You can inspect pending pods during provisioning, review scaling decisions, and edit Karpenter-managed resources with built-in validation. The Karpenter plugin was made as part of a LFX mentor project.

The Karpenter plugin for Headlamp aims to make it easier for Kubernetes users and operators to understand, debug, and fine-tune autoscaling behavior in their clusters. Now we will give a brief tour of the Headlamp plugin.

Map view of Karpenter Resources and how they relate to Kubernetes resources

Easily see how Karpenter Resources like NodeClasses, NodePool and NodeClaims connect with core Kubernetes resources like Pods, Nodes etc.

Map view showing relationships between resources

Visualization of Karpenter Metrics

Get instant insights of Resource Usage v/s Limits, Allowed disruptions, Pending Pods, Provisioning Latency and many more .

NodePool default metrics shown with controls to see different frequencies

NodeClaim default metrics shown with controls to see different frequencies

Scaling decisions

Shows which instances are being provisioned for your workloads and understand the reason behind why Karpenter made those choices. Helpful while debugging.

Pod Placement Decisions data including reason, from, pod, message, and age

Node decision data including Type, Reason, Node, From, Message

Config editor with validation support

Make live edits to Karpenter configurations. The editor includes diff previews and resource validation for safer adjustments.
Config editor with validation support

Real time view of Karpenter resources

View and track Karpenter specific resources in real time such as "NodeClaims" as your cluster scales up and down.

Node claims data including Name, Status, Instance Type, CPU, Zone, Age, and Actions

Node Pools data including Name, NodeClass, CPU, Memory, Nodes, Status, Age, Actions

EC2 Node Classes data including Name, Cluster, Instance Profile, Status, IAM Role, Age, and Actions

Dashboard for Pending Pods

View all pending pods with unmet scheduling requirements/Failed Scheduling highlighting why they couldn't be scheduled.

Pending Pods data including Name, Namespace, Type, Reason, From, and Message

Karpenter Providers

This plugin should work with most Karpenter providers, but has only so far been tested on the ones listed in the table. Additionally, each provider gives some extra information, and the ones in the table below are displayed by the plugin.

Provider Name Tested Extra provider specific info supported
AWS
Azure
AlibabaCloud
Bizfly Cloud
Cluster API
GCP
Proxmox
Oracle Cloud Infrastructure (OCI)

Please submit an issue if you test one of the untested providers or if you want support for this provider (PRs also gladly accepted).

How to use

Please see the plugins/karpenter/README.md for instructions on how to use.

Feedback and Questions

Please submit an issue if you use Karpenter and have any other ideas or feedback. Or come to the Kubernetes slack headlamp channel for a chat.

06 Oct 2025 12:00am GMT