Incident Postmortem (Blameless)
SRE-grade postmortem: timeline · 5-whys · impact · action items · audit chain — assumes blameless culture.
✓ Auto-saved to this browser · works offline · nothing leaves your device
This document assumes everyone acted with the best information they had at the time. We are looking at systems, not people.
Impact
Timeline
Action Items (SMART · owner · due date)
The blameless premise
Every postmortem starts with the same assumption: every person who acted during the incident did so with the best information they had at the time. The reason they made the decision they made is because the information available to them — the alerts that fired, the dashboards they saw, the runbook they consulted — pointed that way. The post-incident question is never "why did Alice do X" — it is "why did the system suggest X looked like the right move." If you fix the latent system property, the next on-caller won't face the same trap.
Capture the timeline within 24 hours
Memory decays fast. Within 24 hours of resolution, capture every observation, decision, and action with timestamps. Detected at 13:42 UTC. Paged at 13:43. Initial diagnosis at 13:55. First mitigation attempt at 14:01. Successful rollback at 14:17. Resolved at 14:22. The timeline is the spine of the postmortem; everything else hangs off it.
Quantify the impact
Three numbers turn an abstract incident into a concrete business case for fixing it:
- Users affected — by count or by percentage of monthly actives
- Revenue impact — direct revenue loss during outage
- SLO burn — percentage of monthly error budget consumed in this single incident
A 40-minute outage that burned 35% of your monthly error budget is a different conversation from one that burned 2%.
The postmortem meeting
For SEV-0 and SEV-1: meeting within 7 days, all responders present, facilitator (often not the on-caller) reads the blameless statement aloud, walks the timeline together, runs the 5-whys. Goal of the meeting: agreement on action items. Goal NOT of the meeting: identifying who messed up.
5-whys done right
The 5-whys finds the deepest layer of cause. Example for a real-shape outage:
- Why did the API return 500s? → DB pool exhausted
- Why was the pool exhausted? → Connection leak in error path
- Why was the connection leaked? → Transaction not closed when an exception fired
- Why was the exception path not closing transactions? → No integration test for that path
- Why is there no integration test? → PR template doesn't ask for error-path test (root cause / action item)
Stop when the answer is a system property. If your answer at depth 5 is "Alice forgot," ask another why — "why is it possible to forget?"
Detection gap
Always ask: why didn't monitoring catch this earlier? The answer often reveals an alert that should have existed (DB pool utilisation), a runbook that was stale, or a dashboard nobody looks at. Action items from detection-gap analysis tend to have the highest long-term value.
Action items: SMART, owned, due-dated
Each action item: Specific, Measurable, Achievable, Relevant, Time-bound. Assigned to a named person. Has a due date. Has a priority (P0, P1, P2, P3). Tracked to closure in your normal sprint planning. The most common postmortem failure mode is "write action items, never do them." Make them tickets, assign them, close them.
Publish — internally, blamelessly
Postmortems are organisational learning. Share with engineering, support, leadership. The pattern an incident exposed may be present in three other systems nobody's looking at. Public postmortems (like Cloudflare's) demonstrate trust; internal-only is fine for proprietary detail. Never name-shame.