BreachExchange mailing list archives
IT Postmortems: How to continuously improve by learning from failure and success
From: Destry Winant <destry () riskbasedsecurity com>
Date: Fri, 24 Aug 2018 08:50:54 -0500
http://www.bmc.com/blogs/it-postmortems/ The worst thing that can happen to an IT team is a production outage – critical systems, services, or data are unavailable. No matter what, you immediately go from a normal day to feeling stressed, angry, frustrated, and pressured to get it fixed ASAP. Once you have the problem fixed and major systems restored, you probably want to forget the whole thing ever happened. Don’t. Instead, reflect back on what went wrong to determine a way to minimize the chances this exact outage will happen again. The good news is you don’t have to create a process for this reflection from scratch. The even better news is you can do it just a couple hours, if you’re really focused. The process is called an IT postmortem and you can follow some very specific templates, too. What is a postmortem? Performing a postmortem may sound a little dark and depressing, but it’s actually meant to shed light on a significant problem. A postmortem process comes at the end of a project and helps you both determine and analyze successes, non-successes, and failures. The outcome of this process is usually a report that aims to inform best practices and mitigate risks in the future. You may know it by other names like lessons learned. In IT, postmortems tend to be very focused: when a severe problem happens, like an event that has an immediate impact on users. This could be an outage or downtown or a data loss. The problem with IT postmortems The idea of an IT postmortem probably isn’t foreign to you. In fact, maybe you’ve been involved in one but decided to scrap it for more “important” work. Or, maybe you filed the report but now that it’s hidden away somewhere, the recommendations therein haven’t been adopted. These are the two biggest problems with creating IT postmortems: people dismiss them as non-essential, so the reports aren’t always read, let alone adopted, by the people who can affect change. Because of this, many people immediately see postmortems as an unworthy investment of time and resources. Depending on the workplace, you may think that it’s just a blame game: determining who did what incorrectly at a moment of significance. Or you may just think your memory is better than it actually is, that you’ll remember what to do or not do the next time this arises. For a postmortem to be useful, it must provide specific recommendations for changes, such as policy or processes. If it’s just documenting for documenting sake, it’s a waste of everyone’s time. Creating a good IT postmortem The responsibility to research, write, and publish a postmortem report lies with the project manager or the person most responsible for a particular outage or data loss. (By responsible for, we mean the person who immediately begins fixing it, not the person who caused it – as many times, these outages occur without human interference.) An IT postmortem report does not need to be complicated. In fact, its simplicity encourages completion. A good report should list a lot of information, but most of it is readily known or quickly determined upon addressing the problem. Reports should include the following information: - Report details: title, date, authors, status, and summary - Problem details: size and time of event, software used, impact and objects, detection - Resolution details: triggering event(s), root cause(s), and who worked on it - Recommendations: lessons learned and actions items to affect change For a little more structure, make sure your IT postmortem answers these questions: What happened? Why it happened? This can include: - Identifying major events isolating root causes, if possible - Looking at technical pieces: Were design, process, poor maintenance the underlying cause or the trigger that lead to a technical failure? - Looking at non-technical pieces: How did organization, management, and team environment improve or detract from the problem and its resolution? - What about the effect of things like culture, time crunches, and budget pressures? How did the team respond? - Include each attempt to fix something, whether it resulted in a fix or not. What steps will prevent this from occurring again? - This is the crucial step, so it might feel the hardest. - Create an action plan that continues to implement the successes and begins to address what didn’t work. - Be bold in identifying big sweeping changes that need to occur, but might be beyond your authority or budget. - Identify smaller changes that take no time or money to implement, perhaps just a process change or an added step that can verify something. Tips for conducting IT postmortems - Do it right away! The time for a postmortem is immediately after you’ve wrapped the project or as soon after the triggering incident as possible, especially if it had an immediate impact on users, such as an outage, downtime, or data loss. The postmortem process should be built into your scheduling. If not, you lose precious recall around exactly what happened and how good or bad something was. We tend to remember really bad things, gloss over other things, and forget our successes - Do it quickly. Do not spend a lot of time on this – as the project manager, you should have the answers to most tracked items. It can take a quick half-day or even just 1-2 hours to provide impactful information. - Use a tried and true template. You’re not writing award-winning stuff here, it’s the content and recommendations that matters. A good template should track a lot of things, worrying little about how well written it is. A strong template also helps you get a postmortem completed in an afternoon, so that it doesn’t have to take a long time. (A quick online search turns up dozens of templates – experiment to find what works best for your team.) Involve more parties. Different people have different insights, and involving the whole team prevents scapegoating. This can be as simple as asking each person who was involved to send the thing they thought went well when dealing with the problem and the thing that went the worst. - Ignore punishment. The point of a postmortem is not to find fault, place blame, or punish. The point is to improve, so encourage honesty. - Track positives and negatives. Not all postmortems have to be gloom and doom – some can highlight positives in a process that you may not have been aware of. In that case, perhaps your recommendation is to rollout these positives more widely. - Publish the report. Postmortems don’t have to lurk in a basement storage area, among old files. In fact, you don’t even have to print it out – simply share the findings with the team, the department, or the company and decision makers as whole, whatever makes sense for your work environment. A bonus: publishing will help you keep things short and concise, too! The outcome of (and attitude around) IT postmortems won’t improve if you continue to minimize the importance of IT postmortems. Next time you create a postmortem, consider following a reliable template and commit to implementing the changes, as much as you have the authority to do so. _______________________________________________ BreachExchange mailing list sponsored by Risk Based Security BreachExchange () lists riskbasedsecurity com If you wish to Edit your membership or Unsubscribe you can do so at the following link: https://lists.riskbasedsecurity.com/listinfo/breachexchange
Current thread:
- IT Postmortems: How to continuously improve by learning from failure and success Destry Winant (Aug 24)