Conducting post-mortems after responding to incidents is a powerful practice. It enables organisational learning and helps increasing service reliability. However, the terminology used in this context is misleading. In this article, I propose a new definition of the term “post-mortem” and point out four key aspects of the post-mortem practice: culture, structure, facilitation and documentation.
Post-Mortems in the Field – an Observation
In 2020 it became popular for teams at my company to publish post-mortem reports after incidents. I have participated in on-call rotation for incident response for many years as a software and Linux system engineer. Therefore I love reading in-depth reports of incidents.
One day I received an email from a colleague notifying me of a newly published post-mortem report. Having some spare time I started reading it. In particular, I like to read the timeline of an incident to understand how the events unfolded. In this case, it seemed as if the team’s attention had been caught by “red herring” leading them down a “rabbit hole”: the team had been investigating a presumed network issue for four hours until they realised the outage was being caused by something completely different. A couple of minutes later they were able to restore the service — without the help of the network team.
- a clue or piece of information which is or is intended to be misleading or distracting. “the book is fast-paced, exciting, and full of red herrings”
The total downtime lasted five hours and in hindsight, it turned out that the team had been following a “red herring” for over four out of five hours. It seemed odd to me, that this was not discussed further.
I contacted the colleague to share some thoughts I had taken from a presentation by John Allspaw:
For every incident that has a “red herring” episode … capture the red herring part of the story in detail in the write-up, especially on what made following the “rabbit hole” seem reasonable at the time.
—John Allspaw, “Findings from the Field: Two Years of Studying Incidents Closely”, DevOps Enterprise Summit London-Virtual 2020
One statement in my colleague’s response surprised me — a lot:
It occurs to me that we write and proofread post-mortems, but never use them as a learning tool within our team.
It turned out that for that particular team the term “post-mortem” was synonym to “written record of an incident” — a written record based on a specific template. Hence the use of statements such as “we write post-mortems”. More surprisingly I also found out, that those records were written by single engineers — all by himself. After writing it, the engineer would publish it and notify a group of people — end of the story. No questions asked, no collaborative process, no review, no meeting in which people involved in the incident would come together to discuss what happened and learn from it.
I was utterly surprised at how different that team’s interpretation of “post-mortem” was compared to my understanding.
Misleading Post-Mortem Terminology
After reading lots of articles on post-mortems it became apparent to me that the terminology in this context is insufficient and even misleading. Many articles contained statements like “how to write a post-mortem” or “post-mortem templates”. Those statements seem to convey the notion that a post-mortem is primarily a document, one that is written after an incident.
Compare the following definition from ITIL 4:
A post-mortem is a formal record of an incident in terms of its impact, resolution/mitigation efforts, causes, and measures to prevent recurrence.
—“ITIL 4: High-velocity IT”, Chapter 188.8.131.52 “Blameless post-mortems”
And a similar definition from the seminal book Site Reliability Engineering
A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring.
Interestingly enough, John Lunney and Sue Lueder, the authors of the definition above, reference an article written by John Allspaw in 2012, titled Blameless PostMortems and a Just Culture. However, in my perception John describes something far more comprehensive than just «a written record of an incident».
On the Importance of Terminology
Language is important, it shapes the way we think. I believe we should be careful about how we use the term “post-mortem” and how we define its meaning.
Terminology is a necessity for all professionals involved in the representation, expression, communication and teaching of specialized knowledge. Scientists, technicians or professionals in any field require terms to represent and express their knowledge to inform, transfer or buy and sell their products. There is no specialty that does not have specific units to denominate their concepts.
—Besharat Fathi, Some Important Reasons for Studying Terminology
I propose a new definition of the word “post-mortem” which, so I hope, describes the concept more clearly and more holistically:
post-mortem | pəʊs(t)ˈmɔːtəm |
the practice of recording, analysing and discussing an incident soon after it has occurred, especially in order to understand how the incident occurred and to learn from it: a post-mortem was conducted after the service disruption.
My proposal is based on the meaning of the word “post-mortem” as per the Oxford Dictionary provided by Google (retrieved in March 2021).
How to conduct post-mortems is just as important as actually conducting them. In 2016 John Allspaw, Morgan Evans and Daniel Schauenberg published a Debriefing Facilitation Guide (PDF Version) for post-mortem debriefings. In it, they emphasise the importance of the skills required to facilitate effective post-mortems.
How a postmortem debriefing (hereafter “debriefing”) is done (in whatever form it takes) is at the core of the approach, and therefore hinges on the expertise of the debriefing facilitator
—John Allspaw, Morgan Evans, Daniel Schauenberg, Debriefing Facilitation Guide
Based on that I think there are four key aspects to the practice of conducting effective post-mortems:
- Culture: an effective post-mortem requires psychological safety, a “restorative just culture” as described in “Just Culture” in 4 short courses by Sidney Dekker.
- Meeting Structure: a post-mortem is a collaborative, co-creative process with the purpose of shared organisational learning and improvement. This requires people to come together to discuss the various and multi-faceted aspects of an incident, in order to learn from and with each other.
- Facilitation: the ability of a facilitator to ask the right questions, create a favourable ambience and keep the people involved focused on learning is crucial for a truly successful post-mortem. It requires both expertise and experience. Otherwise, people inexperienced with the post-mortem practice tend to drift into merely wanting to prevent a future event from happening, justify their decisions and actions, or even worse: blaming.
- Documentation: the documentation of a post-mortem is one building block of the practice as a whole. The documentation is created throughout the post-mortem process and is then published to enable widespread organisational learning.
Establishing an effective post-mortem practice must comprise all four of those four key aspects.
A Final Note on Terminology
I hope you now agree that post-mortem is a comprehensive practice. And as such much more than “formal record of an incident” as suggested in ITIL 4, much more than “written record of an incident” as suggested by John Lunney, Sue Lueder et. al. in their seminal book Site Reliability Engineering.
To help others understand more easily and avoid misunderstanding, I invite to using formulations such as “document a post-mortem” (or “write a post-mortem report“) and “publish a post-mortem documentation” and avoid misleading statements like “write a post-mortem” or “publish a post-mortem”.
Following is a list of ressources I found useful to delve into the topic of post-mortems and Just Culture:
- «Advanced Postmortem Fu and Human Error 101» by John Allspaw at Velocity Conference 2011, 2011
- Blameless PostMortems and a Just Culture, 2012
- Debriefing Facilitation Guide by John Allspaw, Morgan Evans, Daniel Schauenberg, 2016
- “Just Culture” in 4 short courses by Sidney Dekker, 2015
- Stella Report by SNAFUcatchers, 2017
- “Working at the Center of the Cyclone” by Richard Cook at the DevOps Enterprise Summit in London 2018, 2018
- “The Bone Talk: Resilience and Resilience Engineering” by Richard Cook at the DevOpsDays Chicago 2020, 2020
- Learning from Incident
- “Top 12 Best Practices for Better Incident Management Postmortems” by The New Stack, 2020
- Culturing Resiliency with Data: a Taxonomy of Outages, 2020
- PagerDuty Incident Response, 2017
- John Allspaw - Findings From the Field: 2 Years of Learning From Incidents, 2020
Templates for writing post-mortem documentation: