Everything Was Green, But Production Was Slow
This wasn’t an outage.
Nothing crashed. No alerts fired.
Users reported slow requests and timeouts.
Dashboards were green.
CPU low
Memory stable
Pods running
No packet loss
No restarts
At first glance, there was nothing to fix.
That’s exactly why it was dangerous.
What looked healthy (and fooled us)
CPU usage looked fine
kubectl top pod -n prodNAME CPU(cores) MEMORY(bytes)
api-1 120m 300Mi
api-2 130m 310MiCPU wasn’t close to limits.
But CPU usage only tells you how much time the process ran, not how long it waited.
In Kubernetes, CPU limits are enforced using cgroups.
A container can be paused even when the node has idle CPU.
Low CPU can mean throttling, not health.
Ping showed no packet loss
ping service.prod.svc.cluster.local64 bytes from 10.96.0.12: time=0.5 msPing only proves:
packets are tiny
ICMP works
Ping does not exercise:
TCP congestion control
queueing
TLS handshakes
retransmissions
Ping worked. Apps were still slow.
The first real signal: latency
From inside a pod:
time curl http://service.prod.svc.cluster.localreal 0m0.87sNo errors.
Just waiting.
Latency was being added before the application logic even ran.
DNS was slow, not failing
dig kubernetes.default.svc.cluster.local;; Query time: 180 msDNS resolved successfully — but 180ms per lookup adds up fast.
Applications often:
resolve DNS on startup
resolve again on reconnect
resolve on retries
Slow DNS becomes application latency.
CoreDNS was busy
kubectl top pod -n kube-systemcoredns-abc CPU: 180m MEM: 140MiCoreDNS wasn’t down.
It was overloaded.
That means:
queries queue
responses slow
apps block
No alerts fired because nothing was “broken”.
TCP told the truth
ss -i dst <service-ip>cwnd:10 rtt:120/30 rto:400What this means:
RTT was high and unstable
TCP reduced how much it sent
the app spent more time waiting
Even without packet loss.
Retransmissions existed
ss -sTCP:
retransmits: 12489ICMP showed no loss.
TCP showed retries.
Packets were delayed or queued, then resent quietly.
Kubernetes networking added invisible cost
This cluster used VXLAN.
VXLAN wraps packets and adds extra hops.
Actual packet path (simplified)
Pod
↓
veth
↓
Linux bridge
↓
VXLAN encapsulation
↓
Node NIC
↓
Network
↓
Node NIC
↓
VXLAN decapsulation
↓
Linux bridge
↓
veth
↓
PodEach hop adds:
buffering
queueing
scheduling delay
Nothing fails.
Everything waits.
Readiness probes amplified the problem
kubectl describe pod api-1Readiness probe failed: context deadline exceededThe app wasn’t dead — it was slow.
Readiness probes:
have short timeouts
remove pods from traffic immediately
Under load:
app slows
readiness fails
pod removed from Service
remaining pods get more traffic
they slow down too
A feedback loop.
Why dashboards stayed green
We were watching:
CPU
memory
pod status
packet loss
We were not watching:
DNS latency
TCP RTT variance
queue depth
throttling time
readiness failures over time
The system wasn’t broken.
It was waiting.
Fixes that actually helped
Reduced DNS amplification (
ndots:2)Gave CoreDNS more CPU
Tuned readiness probe timeouts
Verified MTU alignment for VXLAN
Focused on latency, not error rate
No blind scaling.
No random restarts.
The real lesson
Production systems don’t always fail loudly.
More often, they fail by:
waiting
queueing
backing off
timing out
Green dashboards only tell you nothing crashed.
They don’t tell you the system is fast.
Final takeaway
If users say the system is slow and everything looks healthy:
measure latency
trust TCP over ping
assume DNS matters
remember Kubernetes adds hops
remember “Running” ≠ “Ready”
Most incidents start quietly.
You only catch them if you know where to look.