Skip to content

Incident Postmortem (Blameless)

SRE-grade postmortem: timeline · 5-whys · impact · action items · audit chain — assumes blameless culture.

✓ Auto-saved to this browser · works offline · nothing leaves your device

This document assumes everyone acted with the best information they had at the time. We are looking at systems, not people.

Impact

Timeline

Action Items (SMART · owner · due date)

The blameless premise

Every postmortem starts with the same assumption: every person who acted during the incident did so with the best information they had at the time. The reason they made the decision they made is because the information available to them — the alerts that fired, the dashboards they saw, the runbook they consulted — pointed that way. The post-incident question is never "why did Alice do X" — it is "why did the system suggest X looked like the right move." If you fix the latent system property, the next on-caller won't face the same trap.

Capture the timeline within 24 hours

Memory decays fast. Within 24 hours of resolution, capture every observation, decision, and action with timestamps. Detected at 13:42 UTC. Paged at 13:43. Initial diagnosis at 13:55. First mitigation attempt at 14:01. Successful rollback at 14:17. Resolved at 14:22. The timeline is the spine of the postmortem; everything else hangs off it.

Quantify the impact

Three numbers turn an abstract incident into a concrete business case for fixing it:

A 40-minute outage that burned 35% of your monthly error budget is a different conversation from one that burned 2%.

The postmortem meeting

For SEV-0 and SEV-1: meeting within 7 days, all responders present, facilitator (often not the on-caller) reads the blameless statement aloud, walks the timeline together, runs the 5-whys. Goal of the meeting: agreement on action items. Goal NOT of the meeting: identifying who messed up.

5-whys done right

The 5-whys finds the deepest layer of cause. Example for a real-shape outage:

Stop when the answer is a system property. If your answer at depth 5 is "Alice forgot," ask another why — "why is it possible to forget?"

Detection gap

Always ask: why didn't monitoring catch this earlier? The answer often reveals an alert that should have existed (DB pool utilisation), a runbook that was stale, or a dashboard nobody looks at. Action items from detection-gap analysis tend to have the highest long-term value.

Action items: SMART, owned, due-dated

Each action item: Specific, Measurable, Achievable, Relevant, Time-bound. Assigned to a named person. Has a due date. Has a priority (P0, P1, P2, P3). Tracked to closure in your normal sprint planning. The most common postmortem failure mode is "write action items, never do them." Make them tickets, assign them, close them.

Publish — internally, blamelessly

Postmortems are organisational learning. Share with engineering, support, leadership. The pattern an incident exposed may be present in three other systems nobody's looking at. Public postmortems (like Cloudflare's) demonstrate trust; internal-only is fine for proprietary detail. Never name-shame.