Last Updated on January 14, 2024
In Part 1 of this blog post, I explained why it’s important to operationally test your disaster recovery (DR) plan, and why a tabletop session is insufficient. In this post, I’d like to talk about what an operational recovery exercise might look like.
Any exercise isn’t something you do off-the-cuff. There are several different approaches you can take, but they all require forethought and preparation.
In between an operational recovery test and a tabletop is a walk-through—rather like a “tabletop plus” because a walk-through also includes an inventory. As you talk through your response to a hypothetical incident, you perform an “eyeball” inventory: are the fire extinguishers in place and inspected? Are the backup tapes in the bin where they’re supposed to be? Are the relevant SOPs on the SharePoint site in the folder where you think they are, and are they current?
In terms of actual operational tests, there are two flavors:
- Parallel processing, where you continue to operate using your primary systems as usual, while a select subset of users access the backup systems and make sure they can still get work done; and
- A simulation, where you actually turn off your primary systems, initiate your recovery plan and truly see what works and what doesn’t.
Simulations take time, training, correct documentation, current backups… and if you don’t have a mature recovery capability, something’s wrong, especially the first time through, which could very well be disruptive to normal operations. But that’s the nature of disasters—they’re disruptive to normal operations.
Many organizations are understandably hesitant to perform a simulation. But that mindset creates a Catch-22: you don’t think your recovery capability will stand up to the testing so you don’t test it. Thus, you don’t know what you don’t know and when a real disaster strikes your recovery may not be as efficient and effective as you need it to be. As I rhetorically asked in Part 1 of this post: Do you want to find this out in a controlled, simulated environment, or in an actual disaster?
Because it really shows you what works and what doesn’t, a simulation is the foundation of continuous improvement in your recovery capability. The first test-drive may be bumpy. But then you’ll have a wealth of recovery-related considerations and experience that you can incorporate into everyday operations to provide you with a far more robust recovery capability.
If you run simulations regularly and incorporate lessons learned into everyday operations, you can achieve resiliency. Resiliency is the recovery end-state where everyone is so comfortable with how and what to do if an event occurs that there’s no guesswork—critical functions continue within their recovery time objectives (RTOs) and customers may not even notice you had an issue.
Having said all of that, the question remains: How do you develop a realistic and worthwhile exercise scenario? There’s considerable homework involved. Start by taking a look at your risk assessment to see what disruptions are most likely to occur, and pick one. You might want to consider designing a scenario that will enable you to exercise your incident response (IR) capability and then invoke your recovery capability based on those results.
If your industry is getting slammed by viruses and data breaches, you might pick a cyberattack or malware infection as the cause of the outage for your exercise.
For example, start with the help desk. Tell them there’s a simulated system failure. They should refer to their procedures (which you read when you wrote the scenario so you know what they would do) and escalate the problem to the Incident Response (IR) team. When the IR team comes up with an estimated time to repair (ETR) that exceeds the RTO for the impacted system(s) (which it will because you figured that out when you wrote the scenario), the IR team should coordinate with the business continuity/disaster recovery folks, kicking off your BC/DR plan.
Here’s another example scenario: You have a fire, which kicks off your emergency response protocol. How do you get people out the door, how do you respond to things like evacuation and building shutdown, etc.? Then have the impact of the incident result in outages that exceed RTOs for key systems, which should cause BC/DR plan implementation.
But since your scenario will be simulated, how do you move it forward? You develop what I call injects (the US federal government calls them the Master Scenario Events List (MSELS, pronounced “measles”… really). Regardless of what you call it, these are the tidbits of information that allow an exercise to move forward.
You might walk into the help desk and hand them a scrap of paper and it says “Five users called in in the last ten minutes saying they can’t access System X.” Then the help desk goes through its procedures and does its assessment. Then a few minutes later you hand the help desk another scrap of paper that says: “Investigation determines that the system is down.” Or the database is corrupted, or whatever it is you’re going to put them through.
In other words, you ratchet up the activity based on a series of carefully planned, hypothetical injects that tell people what they would find if they were working from procedures and policies to respond to an actual disaster.
But what if you want to complete your exercise in four hours and your recovery capabilities don’t kick in until twelve hours into an event? Using injects, you develop your exercise timeline and map it to real-time. It’s all part of how you design the exercise: How much “real time” do you want to simulate? How do you “fake” time passing to get responses to kick in as they would in real life?
Other things to consider when developing a scenario include: Who needs to play, and where are they going to play from? What systems are involved, etc.? Are we going to serve coffee and donuts? (People are more likely to show up for exercises if you feed them.)
Among the key players in your scenario are your BC coordinator and possibly the command team. You’ll also need someone to act as a facilitator. This is the person who hands out the injects and tracks the exercise, usually your BC coordinator. Working with the facilitator will be monitors. These are the people walking around observing and evaluating the responses. For example, if IT is involved in your scenario, the IT director might be walking around monitoring the IT staff’s responses.
Regardless of who’s involved, at the end of the day, everybody should be providing feedback: what worked, what didn’t work, what procedures weren’t complete, what documentation wasn’t available, what would’ve been “nice to have”—all that stuff. Based on this feedback and experience you then update your IR procedures and DR plan.
Then when a real incident occurs and a real disaster is declared, you’ll be ready.