What is a blameless postmortem?

A blameless postmortem is an incident review that assumes everyone acted with the best information available at the time. It focuses on systems, processes, and detection gaps — never on assigning fault. Pioneered at Google, Etsy, and Netflix, blameless culture is now standard at high-functioning engineering organisations.

What sections does a postmortem need?

A modern postmortem includes: summary, severity, key times (detected/mitigated/resolved), incident commander and scribe, impact (users/revenue/SLO burn), timeline, root cause, 5-whys, contributing factors, what went well/wrong, detection gap, and action items with owners and due dates.

What is the 5-whys technique?

Start with the symptom and ask 'why' five times. Each answer should lead deeper than the last. Stop when you reach a system property — not a person. Example: 'why did the API return 500s?' → 'DB pool exhausted' → 'connection leak' → 'transaction not closed in error path' → 'no test for that path' → 'PR template doesn't ask for error-path tests' (action item: update PR template).

When is a postmortem required?

Always for SEV-0 and SEV-1 incidents. Strongly recommended for SEV-2. Optional for SEV-3+ if there's a learning opportunity. The cost of writing one is hours; the cost of repeating the same incident is days plus customer trust.

How is a postmortem different from an RCA?

Root Cause Analysis (RCA) is the section of a postmortem that identifies the deepest causal factor. A postmortem is the complete document including timeline, impact, contributing factors, action items, and lessons. The postmortem is what gets published; the RCA is the most important section inside it.

Incident Postmortem (Blameless)

SRE-grade postmortem: timeline · 5-whys · impact · action items · audit chain — assumes blameless culture.

✓ Auto-saved to this browser · works offline · nothing leaves your device

This document assumes everyone acted with the best information they had at the time. We are looking at systems, not people.

Incident ID

Date

Severity

Status

Title

Detected At

Mitigated At

Resolved At

Duration (minutes)

Incident Commander

Scribe

On-call / Responders

Summary (TL;DR)

Impact

Users Affected

Revenue Impact

SLO Burn

Timeline

Root Cause

5 Whys

Contributing Factors

What Went Well

What Went Wrong

Detection Gap

Action Items (SMART · owner · due date)

The blameless premise

Every postmortem starts with the same assumption: every person who acted during the incident did so with the best information they had at the time. The reason they made the decision they made is because the information available to them — the alerts that fired, the dashboards they saw, the runbook they consulted — pointed that way. The post-incident question is never "why did Alice do X" — it is "why did the system suggest X looked like the right move." If you fix the latent system property, the next on-caller won't face the same trap.

Capture the timeline within 24 hours

Memory decays fast. Within 24 hours of resolution, capture every observation, decision, and action with timestamps. Detected at 13:42 UTC. Paged at 13:43. Initial diagnosis at 13:55. First mitigation attempt at 14:01. Successful rollback at 14:17. Resolved at 14:22. The timeline is the spine of the postmortem; everything else hangs off it.

Quantify the impact

Three numbers turn an abstract incident into a concrete business case for fixing it:

Users affected — by count or by percentage of monthly actives
Revenue impact — direct revenue loss during outage
SLO burn — percentage of monthly error budget consumed in this single incident

A 40-minute outage that burned 35% of your monthly error budget is a different conversation from one that burned 2%.

The postmortem meeting

For SEV-0 and SEV-1: meeting within 7 days, all responders present, facilitator (often not the on-caller) reads the blameless statement aloud, walks the timeline together, runs the 5-whys. Goal of the meeting: agreement on action items. Goal NOT of the meeting: identifying who messed up.

5-whys done right

The 5-whys finds the deepest layer of cause. Example for a real-shape outage:

Why did the API return 500s? → DB pool exhausted
Why was the pool exhausted? → Connection leak in error path
Why was the connection leaked? → Transaction not closed when an exception fired
Why was the exception path not closing transactions? → No integration test for that path
Why is there no integration test? → PR template doesn't ask for error-path test (root cause / action item)

Stop when the answer is a system property. If your answer at depth 5 is "Alice forgot," ask another why — "why is it possible to forget?"

Detection gap

Always ask: why didn't monitoring catch this earlier? The answer often reveals an alert that should have existed (DB pool utilisation), a runbook that was stale, or a dashboard nobody looks at. Action items from detection-gap analysis tend to have the highest long-term value.

Action items: SMART, owned, due-dated

Each action item: Specific, Measurable, Achievable, Relevant, Time-bound. Assigned to a named person. Has a due date. Has a priority (P0, P1, P2, P3). Tracked to closure in your normal sprint planning. The most common postmortem failure mode is "write action items, never do them." Make them tickets, assign them, close them.

Publish — internally, blamelessly

Postmortems are organisational learning. Share with engineering, support, leadership. The pattern an incident exposed may be present in three other systems nobody's looking at. Public postmortems (like Cloudflare's) demonstrate trust; internal-only is fine for proprietary detail. Never name-shame.

Incident Postmortem (Blameless)

Impact

Timeline

Action Items (SMART · owner · due date)

The blameless premise

Capture the timeline within 24 hours

Quantify the impact

The postmortem meeting

5-whys done right

Detection gap

Action items: SMART, owned, due-dated

Publish — internally, blamelessly

Release Readiness

Bug Report

Quality Goals & SLOs

QA RFC / ADR

Incident Postmortem (Blameless)

Impact

Timeline

Action Items (SMART · owner · due date)

The blameless premise

Capture the timeline within 24 hours

Quantify the impact

The postmortem meeting

5-whys done right

Detection gap

Action items: SMART, owned, due-dated

Publish — internally, blamelessly

Related Templates

Release Readiness

Bug Report

Quality Goals & SLOs

QA RFC / ADR