Logging, metrics, and tracing
The three pillars that let you see what a running system is actually doing.
- Distinguish logs, metrics, and traces and what each answers
- Write structured logs that are actually searchable
- Explain observability as designing systems to be debuggable
You can't debug what you can't see, and once code is in production you can't step through it with a debugger. Observability is building in the means to understand a running system from the outside. The debugging discipline lesson said to make systems explain themselves — this is how.
The three pillars
Each answers a different question:
- Logs — discrete, timestamped events. "What happened, and when?" The detailed narrative of individual events.
- Metrics — numbers aggregated over time. "How much, how often, how fast?" Request rate, error rate, latency, memory. Cheap to store, great for trends and dashboards.
- Traces — the path of a single request as it flows across functions and services. "Where did the time go in this one request?" Essential once a system spans multiple services.
Logs tell the story, metrics show the trends, traces follow one request through the whole system. You want all three.
Structured logs
A log you can't search is nearly useless at scale. Prefer structured logs — key/value or JSON — over free-form prose:
{"level":"error","event":"payment_failed","user_id":42,"amount":19.99,"reason":"card_declined"}Now you can query "all payment_failed events for user 42." Log the context that
makes an event findable and actionable; skip noise. And never log secrets or
personal data you wouldn't want leaked.
Observability is a design choice
The key shift: observability isn't bolted on after an outage — it's designed in before you need it. Instrument the important flows (the systems-thinking inputs, outputs, and side effects) as you build them. A system that explains itself turns a 2am incident from guesswork into reading the evidence.
A practical signal of good observability: when something breaks, can you tell what and where from your dashboards and logs alone, without adding new logging and redeploying? If you have to redeploy to debug, you instrumented too little.
Where to go next
Collecting signals is step one; being told when they go wrong is step two. Next: monitoring and alerting.