Tech Blog

I started running Kubernetes in production in early 2022. At the time, I'd done the tutorials, understood the concepts, and felt reasonably prepared. Then production happened.

Three years later, here are the things I actually learned from running it — not the things you read in the official documentation, but the things you discover at 2am when something stops working in a way you didn't know was possible.

Resource Limits Are Not Optional

The first major incident I had was a runaway process in one container that consumed all available CPU on a node, causing unrelated pods to start failing. Kubernetes doesn't enforce resource limits unless you set them. If you don't set CPU and memory limits on every container, one misbehaving workload can destabilise everything running on that node.

Setting limits feels pedantic when you're moving fast. It's not pedantic — it's load isolation. Set limits on everything. Use LimitRange objects to enforce defaults at the namespace level so newly deployed workloads can't accidentally be unlimited.

Networking Is Where Things Get Weird

DNS within a Kubernetes cluster seems simple until it doesn't work. I've spent hours debugging pods that couldn't resolve service names, external DNS not working correctly, network policies blocking traffic I expected to be allowed. The Kubernetes networking model is powerful and flexible, which is another way of saying it has a lot of configuration surface area that can be wrong.

What saved me: learning to use kubectl exec to drop into a pod and run nslookup and curl directly. Being able to test connectivity from inside the cluster, not just from outside it, cuts debugging time dramatically.

RBAC Will Bite You When You Least Expect It

RBAC permissions in Kubernetes are additive and not immediately obvious from any single place. I've had deployments fail in production because a service account was missing a specific permission that hadn't been needed in the staging environment (different namespace, different setup). Auditing what your service accounts can actually do — not what you think they can do — is worth doing periodically.

The Monitoring That Actually Matters

The standard advice is "set up Prometheus and Grafana." That's correct but incomplete. What you actually need to watch that people don't tell you: node pressure events (disk pressure, memory pressure before pods start getting evicted), pending pods (pods that can't be scheduled because nodes are full), and OOMKilled events (containers being killed because they exceeded their memory limit — this is usually silent in the logs unless you're watching for it).

Upgrades Are Harder Than Setup

The Kubernetes upgrade process for managed clusters (EKS, GKE, AKS) has gotten better, but it's still not "click a button, wait five minutes." API deprecations across minor versions catch you by surprise. The node upgrade process requires draining pods and managing PodDisruptionBudgets carefully. I now plan a half-day for every minor version upgrade and have never finished significantly faster than that.

The Honest Summary

Kubernetes is worth learning and running when you actually need what it offers. But go in expecting operational complexity, invest in your monitoring from day one, and treat resource limits and RBAC as core configuration rather than nice-to-haves. The platform rewards careful operation.