Code of the Day
AdvancedObservability & Operations

Monitoring and alerting

Turn signals into timely warnings — without drowning in noise.

FundamentalsAdvanced8 min read
By the end of this lesson you will be able to:
  • Define SLIs and SLOs and why they anchor alerts
  • Alert on symptoms users feel, not every internal metric
  • Avoid alert fatigue

watches your metrics and logs; alerting notifies a human when something needs attention. The hard part isn't collecting data — it's deciding what's worth waking someone up for. Good alerting catches real problems early; bad alerting trains everyone to ignore it.

Measure what users feel: SLIs and SLOs

Anchor monitoring in the user's experience, not internal trivia:

  • An SLI (Service Level Indicator) is a — request latency, error rate, availability.
  • An SLO (Service Level Objective) is the target for an SLI — e.g. "99.9% of requests succeed" or "95% complete under 300ms."

SLOs give alerts an objective threshold and give the team a shared definition of "healthy" — and explicit permission to not chase perfection beyond the target.

Alert on symptoms, not causes

Alert on what the user feels — error rate up, latency up, the site down — not on every internal fluctuation ("CPU at 80%"). High CPU might be fine; a spike in failed checkouts is never fine. Symptom-based alerts catch real problems regardless of cause and don't fire on harmless internal noise.

A useful frame: page a human for things that are urgent and actionable; everything else goes to a dashboard or a ticket, not someone's phone at night.

Beware alert fatigue

The fastest way to make monitoring useless is too many alerts. When alerts fire constantly — especially false ones — people stop trusting them and miss the real one. Every alert should be actionable: if there's nothing to do about it, it's not an alert, it's noise. Prune relentlessly.

An alert that fires and is ignored is worse than no alert — it adds noise and erodes trust in the whole system. If an alert isn't actionable and urgent, turn it into a dashboard panel instead.

Where to go next

When an alert does fire on something real, you need a calm process. Next: incident response.

Finished reading? Mark it complete to track your progress.

On this page