Everything Was Green, But Production Was Slow

This wasn’t an outage.
Nothing crashed. No alerts fired.

Users reported slow requests and timeouts.
Dashboards were green.

  • CPU low

  • Memory stable

  • Pods running

  • No packet loss

  • No restarts

At first glance, there was nothing to fix.

That’s exactly why it was dangerous.

What looked healthy (and fooled us)

CPU usage looked fine

kubectl top pod -n prod
NAME        CPU(cores)   MEMORY(bytes)
api-1       120m         300Mi
api-2       130m         310Mi

CPU wasn’t close to limits.

But CPU usage only tells you how much time the process ran, not how long it waited.

In Kubernetes, CPU limits are enforced using cgroups.
A container can be paused even when the node has idle CPU.

Low CPU can mean throttling, not health.

Ping showed no packet loss

ping service.prod.svc.cluster.local
64 bytes from 10.96.0.12: time=0.5 ms

Ping only proves:

  • packets are tiny

  • ICMP works

Ping does not exercise:

  • TCP congestion control

  • queueing

  • TLS handshakes

  • retransmissions

Ping worked. Apps were still slow.

The first real signal: latency

From inside a pod:

time curl http://service.prod.svc.cluster.local
real    0m0.87s

No errors.
Just waiting.

Latency was being added before the application logic even ran.

DNS was slow, not failing

dig kubernetes.default.svc.cluster.local
;; Query time: 180 ms

DNS resolved successfully — but 180ms per lookup adds up fast.

Applications often:

  • resolve DNS on startup

  • resolve again on reconnect

  • resolve on retries

Slow DNS becomes application latency.

CoreDNS was busy

kubectl top pod -n kube-system
coredns-abc   CPU: 180m   MEM: 140Mi

CoreDNS wasn’t down.
It was overloaded.

That means:

  • queries queue

  • responses slow

  • apps block

No alerts fired because nothing was “broken”.

TCP told the truth

ss -i dst <service-ip>
cwnd:10 rtt:120/30 rto:400

What this means:

  • RTT was high and unstable

  • TCP reduced how much it sent

  • the app spent more time waiting

Even without packet loss.

Retransmissions existed

ss -s
TCP:
  retransmits: 12489

ICMP showed no loss.
TCP showed retries.

Packets were delayed or queued, then resent quietly.

Kubernetes networking added invisible cost

This cluster used VXLAN.

VXLAN wraps packets and adds extra hops.

Actual packet path (simplified)

Pod
 ↓
veth
 ↓
Linux bridge
 ↓
VXLAN encapsulation
 ↓
Node NIC
 ↓
Network
 ↓
Node NIC
 ↓
VXLAN decapsulation
 ↓
Linux bridge
 ↓
veth
 ↓
Pod

Each hop adds:

  • buffering

  • queueing

  • scheduling delay

Nothing fails.
Everything waits.

Readiness probes amplified the problem

kubectl describe pod api-1
Readiness probe failed: context deadline exceeded

The app wasn’t dead — it was slow.

Readiness probes:

  • have short timeouts

  • remove pods from traffic immediately

Under load:

  1. app slows

  2. readiness fails

  3. pod removed from Service

  4. remaining pods get more traffic

  5. they slow down too

A feedback loop.

Why dashboards stayed green

We were watching:

  • CPU

  • memory

  • pod status

  • packet loss

We were not watching:

  • DNS latency

  • TCP RTT variance

  • queue depth

  • throttling time

  • readiness failures over time

The system wasn’t broken.

It was waiting.

Fixes that actually helped

  • Reduced DNS amplification (ndots:2)

  • Gave CoreDNS more CPU

  • Tuned readiness probe timeouts

  • Verified MTU alignment for VXLAN

  • Focused on latency, not error rate

No blind scaling.
No random restarts.

The real lesson

Production systems don’t always fail loudly.

More often, they fail by:

  • waiting

  • queueing

  • backing off

  • timing out

Green dashboards only tell you nothing crashed.

They don’t tell you the system is fast.

Final takeaway

If users say the system is slow and everything looks healthy:

  • measure latency

  • trust TCP over ping

  • assume DNS matters

  • remember Kubernetes adds hops

  • remember “Running” ≠ “Ready”

Most incidents start quietly.

You only catch them if you know where to look.

Next
Next

I Built a Home Lab Using a GMKtec Mini PC (What Actually Worked)