Why should you wait before writing a postmortem instead of doing it the same day?

Same-day postmortems are written while people are still tired and looking for a quick close, which tends to produce a blame-shaped conclusion instead of a systemic one. Waiting at least a day lets the adrenaline wear off so the team can ask one more "why" and find the actual gap, such as a missing alert or an untested rollback, rather than settling on a person's in-the-moment judgment call. If your postmortem template doesn't force this delay, you will ship the blame version by default.

Who should be in the room when a postmortem is written, besides the engineer involved in the incident?

Include the on-call engineer who made the decision, someone from the team that owns the affected system or monitoring, and ideally someone from a completely unrelated team who can ask naive questions. Fresh eyes catch systemic gaps faster than people who are too close to the specific decision to see the pattern around it, as in the case where a dashboard looked healthy for eleven days because nobody outside the immediate team questioned why it never changed.

How do you stop postmortem action items from just sitting in a spreadsheet unfixed?

Triage action items into the same sprint planning as feature work so they compete for real engineering capacity, rather than parking them in a separate tech-debt backlog. Tie remediation explicitly to SLO or error-budget risk, have leadership ask about follow-through on the same cadence as roadmap reviews, and track time-to-close as a metric: an action item open for four months is functionally the same as having no action item at all.

Do you need expensive incident management tooling to make blameless postmortems work?

No, tooling matters far less than whether someone with the authority to reprioritize work actually reviews open items on a schedule. Teams have bought sophisticated platforms and seen no improvement in repeat-incident rate, while teams running the whole process from a plain wiki page with real discipline have cut repeat incidents by more than half within two quarters. A shared document with a status column and a monthly fifteen-minute review can outperform an unconfigured expensive platform.

Incident Response Playbooks: Postmortems That Change the System, Not the People

Incidents are stress tests for your sociotechnical system. If postmortems end with “human error” or a single owner scapegoated, you will see repeat failures—because the incentives hide systemic gaps instead of fixing them.

Clarify roles before the pager fires: incident commander coordinates, communications lead handles customers, scribe captures timeline facts. Rotating these roles builds muscle without hero dependency.

Playbooks should live next to services: how to fail over, how to drain traffic, where logs live, which dashboards matter. During the incident, prefer short status updates on a single channel; after, preserve timelines with UTC stamps and decision rationale.

Blameless does not mean consequence-free—it means focusing on conditions that allowed mistakes. Ask why safeguards were missing: absent canary, missing feature flag, unclear rollback, brittle test gap.

Action items need owners and due dates tracked like product work. Tie remediation to SLO risk: if a gap threatens the error budget, prioritize it explicitly in the next sprint.

Practice with game days on non-critical paths. Synthetic incidents reveal whether runbooks are accurate and whether permissions are actually granted to on-call engineers.

Related: DevOps consulting and more resources.

A worked example: the postmortem that almost blamed the wrong layer

A payments team we worked with had a two-hour outage where a deploy triggered cascading timeouts across three downstream services. The first draft of the postmortem, written the same day under pressure, concluded that the on-call engineer "deployed without checking the dashboard." That sentence would have closed the incident with a training reminder and nothing else changed in the system. It also happened to be false in a useful way: the dashboard in question had been silently broken for eleven days, showing stale green checks because a metrics exporter had crashed and nobody noticed.

The revised postmortem, written two days later once the initial adrenaline wore off, asked a different question: why did a broken dashboard look identical to a healthy one? The answer was that the dashboard had no self-monitoring—no alert for stale data, no heartbeat check on the exporter. That is a fixable, unglamorous gap, and fixing it prevents the next five incidents that would have hidden behind the same false green light, not just this one. The engineer who deployed was never at fault; the fault was a monitoring system that lied by omission.

This is the pattern worth internalizing: the first draft of almost every postmortem blames a person or a single decision, because that is the simplest story available under time pressure and the one that requires the least further investigation. The useful draft comes from asking one more "why" than feels natural—not five whys mechanically applied, just enough to reach a system property (a missing alert, an untested rollback, a runbook nobody had opened in a year) rather than a person's judgment call in the moment. If your postmortem template does not force a delay of at least a day between the incident and the write-up, you will ship the blame-shaped version by default, because it is the one written while everyone is still tired and looking for a quick close.

It also matters who is in the room when the second draft gets written. The on-call engineer who made the deploy decision should be present, but so should someone from the team that owns the dashboard, and ideally someone from a completely unrelated team who can ask naive questions without the shared assumptions that let a broken monitor go unnoticed for eleven days in the first place. Fresh eyes catch the systemic gap faster than the people closest to the incident, who are often too close to the specific decision to see the pattern around it.

What separates a review that changes behavior from one that gets filed away

Most teams that adopt blameless postmortems get the language right and the outcomes wrong. The review is genuinely blameless, the meeting is well facilitated, everyone nods at the right moments—and then the action items sit in a spreadsheet nobody revisits until the next similar incident forces a re-read. The failure at that point is not cultural. It is a tracking and prioritization problem, and it deserves the same rigor you would apply to a product backlog, not a separate, lower-status process that only gets attention right after something breaks.

Three things distinguish teams where postmortems actually change the system, in our experience running this process across different orgs. First, action items get triaged into the same sprint planning as feature work, competing for the same engineering capacity, rather than living in a separate "tech debt" graveyard that only gets attention during a slow quarter that rarely arrives. Second, someone re-reads the last quarter's postmortems before writing a new one, explicitly checking whether this incident is a repeat of an unfixed gap—if it is, that fact belongs in the opening summary, not buried three paragraphs into the timeline where it is easy to skim past. Third, leadership asks about postmortem follow-through in the same review cadence as they ask about roadmap progress, which signals, in a way that a values statement in an onboarding doc never will, that the work is taken as seriously as it is claimed to be.

None of this requires elaborate tooling. A shared document with a status column and a monthly fifteen-minute review of open items outperforms an expensive incident management platform that nobody configured to actually enforce follow-up. We have seen teams buy sophisticated tooling and see no improvement in repeat-incident rate, and we have seen teams run the whole process from a plain wiki page with real discipline behind it and cut their repeat incidents by more than half within two quarters. The tooling matters far less than whether someone with the authority to reprioritize work actually looks at the list on a schedule, and says no to new feature requests when an unresolved action item is sitting on a known risk to the error budget.

The last thing worth measuring, and rarely tracked, is time-to-close on action items themselves. An action item open for four months is functionally the same as no action item at all—the gap it was meant to close has been sitting there the whole time, waiting for the next incident to find it again. Track that number alongside your incident count and your MTTR, and it will tell you more about whether your postmortem culture is real or performative than any survey question ever will. A rising open count with a stable incident count is an early warning that your review process has quietly become theater, well before anyone notices the trend in a retro. Treat that number as a leading indicator worth a dashboard of its own, not a footnote in a quarterly slide deck nobody opens until the next audit.

Incident Response Playbooks: Postmortems That Change the System, Not the People

A worked example: the postmortem that almost blamed the wrong layer

What separates a review that changes behavior from one that gets filed away

Frequently Asked Questions

Keep exploring

Ready to transform your infrastructure?