Incident response

A calm, repeatable process for when production breaks — and learning from it after.

Production will break. An incident is any unplanned disruption, and how a team responds separates the calm from the chaotic. Good monitoring is what catches it first. The goal during an incident is narrow and clear: restore service first, understand fully later.

Stabilise before you diagnose

The instinct to find the root cause immediately is usually wrong. Priorities, in order:

Mitigate — stop the bleeding. Roll back the recent deploy, flip the feature flag off, scale up, fail over. Get users working again.
Communicate — tell stakeholders what's happening (see below).
Diagnose — now find the root cause, with the pressure off.
Fix and verify — apply the real fix and confirm recovery.

The deployment lesson's point returns: a fast rollback is the most powerful incident tool you have, because it mitigates without needing a diagnosis.

Communicate while you work

Silence during an outage breeds panic and duplicated effort. Designate someone to post regular, honest updates ("investigating", "identified, mitigating", "resolved"), even when there's little news. For user-facing outages, a status page sets expectations and cuts the flood of "is it down?" questions.

A clear incident commander — one person coordinating, not necessarily fixing — keeps a multi-person response from descending into chaos.

The blameless postmortem

After recovery, write up what happened: timeline, impact, root cause, and — most importantly — what will prevent recurrence. The cardinal rule is blameless: focus on the systemic gaps (missing alert, fragile deploy, unclear runbook), not on who typed the command. People are honest only when they're not on trial, and honesty is what makes the lesson stick.

The best teams treat incidents as expensive lessons they refuse to waste. Each postmortem produces concrete follow-ups — a new alert, a guardrail, a fixed runbook — so the same failure can't happen twice.

Where to go next

That completes Observability & Operations. Many of the hardest production problems come from systems that aren't safe by default. Next module: Security Fundamentals.

Finished reading? Mark it complete to track your progress.

Stabilise before you diagnose

Communicate while you work

The blameless postmortem

Where to go next

On this page