Everything Was Green, But Production Was Slow

Jan 27

Written By Krishna Chaitanya Kathari

This wasn’t an outage.
Nothing crashed. No alerts fired.

Users reported slow requests and timeouts.
Dashboards were green.

CPU low
Memory stable
Pods running
No packet loss
No restarts

At first glance, there was nothing to fix.

That’s exactly why it was dangerous.

What looked healthy (and fooled us)

CPU usage looked fine

kubectl top pod -n prod

NAME        CPU(cores)   MEMORY(bytes)
api-1       120m         300Mi
api-2       130m         310Mi

CPU wasn’t close to limits.

But CPU usage only tells you how much time the process ran, not how long it waited.

In Kubernetes, CPU limits are enforced using cgroups.
A container can be paused even when the node has idle CPU.

Low CPU can mean throttling, not health.

Ping showed no packet loss

ping service.prod.svc.cluster.local

64 bytes from 10.96.0.12: time=0.5 ms

Ping only proves:

packets are tiny
ICMP works

Ping does not exercise:

TCP congestion control
queueing
TLS handshakes
retransmissions

Ping worked. Apps were still slow.

The first real signal: latency

From inside a pod:

time curl http://service.prod.svc.cluster.local

real    0m0.87s

No errors.
Just waiting.

Latency was being added before the application logic even ran.

DNS was slow, not failing

dig kubernetes.default.svc.cluster.local

;; Query time: 180 ms

DNS resolved successfully — but 180ms per lookup adds up fast.

Applications often:

resolve DNS on startup
resolve again on reconnect
resolve on retries

Slow DNS becomes application latency.

CoreDNS was busy

kubectl top pod -n kube-system

coredns-abc   CPU: 180m   MEM: 140Mi

CoreDNS wasn’t down.
It was overloaded.

That means:

queries queue
responses slow
apps block

No alerts fired because nothing was “broken”.

TCP told the truth

ss -i dst <service-ip>

cwnd:10 rtt:120/30 rto:400

What this means:

RTT was high and unstable
TCP reduced how much it sent
the app spent more time waiting

Even without packet loss.

Retransmissions existed

ss -s

TCP:
  retransmits: 12489

ICMP showed no loss.
TCP showed retries.

Packets were delayed or queued, then resent quietly.

Kubernetes networking added invisible cost

This cluster used VXLAN.

VXLAN wraps packets and adds extra hops.

Actual packet path (simplified)

Pod
 ↓
veth
 ↓
Linux bridge
 ↓
VXLAN encapsulation
 ↓
Node NIC
 ↓
Network
 ↓
Node NIC
 ↓
VXLAN decapsulation
 ↓
Linux bridge
 ↓
veth
 ↓
Pod

Each hop adds:

buffering
queueing
scheduling delay

Nothing fails.
Everything waits.

Readiness probes amplified the problem

kubectl describe pod api-1

Readiness probe failed: context deadline exceeded

The app wasn’t dead — it was slow.

Readiness probes:

have short timeouts
remove pods from traffic immediately

Under load:

app slows
readiness fails
pod removed from Service
remaining pods get more traffic
they slow down too

A feedback loop.

Why dashboards stayed green

We were watching:

CPU
memory
pod status
packet loss

We were not watching:

DNS latency
TCP RTT variance
queue depth
throttling time
readiness failures over time

The system wasn’t broken.

It was waiting.

Fixes that actually helped

Reduced DNS amplification (ndots:2)
Gave CoreDNS more CPU
Tuned readiness probe timeouts
Verified MTU alignment for VXLAN
Focused on latency, not error rate

No blind scaling.
No random restarts.

The real lesson

Production systems don’t always fail loudly.

More often, they fail by:

waiting
queueing
backing off
timing out

Green dashboards only tell you nothing crashed.

They don’t tell you the system is fast.

Final takeaway

If users say the system is slow and everything looks healthy:

measure latency
trust TCP over ping
assume DNS matters
remember Kubernetes adds hops
remember “Running” ≠ “Ready”

Most incidents start quietly.

You only catch them if you know where to look.

Krishna Chaitanya Kathari