Educause Security Discussion mailing list archives

Re: DR/BC Planning

From: Joe St Sauver <joe () OREGON UOREGON EDU>
Date: Mon, 10 May 2010 10:42:24 -0700
<dsarazen () UMASSP EDU> asked:

#To what degree do you conduct disaster recovery and business resumption
#planning? Do you test your plans? If so, how (I.E.: table top testing, call
#trees, fail over testing?) Are you using software or templates to write
#your plans?
#
#And finally, how do you ensure departments have completed and tested their
#plans? How do you ensure they are kept up to date.
#
#Any BRP policies/procedures you can share would be helpful.

Internet2 has had a Disaster Recovery/Business Continuity activity going
for a while now... you can see an outline of the sort of disasters that
we've talked about (hurricanes, earthquakes, facilities fires,
widespread loss of power, loss of facilities access, DDoS, etc.) and some
of the constraints that may largely determine what's a viable potential
solution, in the introductory slides from the Winter 2007 Joint Techs
meeting of the Salsa DR group -- see:
http://www.uoregon.edu/~joe/dr-bcp-bof/disaster-recovery-bof.pdf

The key slides from that talk are probably slides 12-19:

Slide 12: The Old Disaster Recovery Paradigm

-- Reciprocal shared space at a partner site
-- Data archived to tape
-- Just-in-time delivery of replacement hardware
-- Small number of key applications (typically enterprise ERP system)
-- At least some down time is acceptable
-- Proforma/low probability of occurring
-- Is that still a realistic paradigm? NO.

Slide 13: What's Mission Critical?

-- Domain name system?
-- Enterprise SAN/NAS (data storage)
-- Enterprise Identity Management System?
-- ERP System?
-- Voice over IP?
-- Teaching and Learning System?
-- Institutional Web Presence?
-- Email and Calendaring?
-- Building control and access systems (smart building HVAC,
   elevators, door controls, alarm systems, etc.)?
-- The network itself?
-- All of the above and more?

Slide 14: What Are Today's Restoration/Recovery Time Frames?

-- Hitless/non-interruptible?
-- Restoration on the order of seconds?
-- Minutes?
-- Hours? <== I suspect this is what we need
-- Days?
-- Weeks? <== Is this where we are?
-- Longer?

-- Assertion: time to recover is a key driver.

Slide 15: Key Driver? Total Data Volume

-- How many GB/TB/PB worth of data needs to be available post-event?
-- If that data needed to be transferred over a network or restored
   from archival media post-event, how long would it take to do that?
-- What about failing back over to a primary system once the crisis
   is over (including moving all the data that's been modified during
   the outage)?

Slide 16: Key Driver? Data Change Rate

-- If restoration has to occur from a checkpoint/periodically archived
   media, how much data would be at risk of loss since that snapshot?
-- Are the transactions which occurred since that time securely
   journal'd, and can they be replayed if need be? Or would those
   transactions simply be lost?

Slide 17: Key Driver? Required Lower Level Infrastructure

-- Secure space with rackage
-- Power and cooling
-- Local loop and wide area connectivity
-- System and network hardware
-- How long would it take to get/install/configure that lower level
   infrastructure from scratch, if it isn't already there?
-- Office space for staff?

Slide 18: Key Driver? System Complexity

-- Today's systems are complex
-- Replicating complex systems takes time and may require specialized
   expertise
-- Specialized expertise may not be available during a crisis
-- Detailed system documentation may not be available during a crisis
-- Debugging a specialized system may take time...
-- Not going to want to try rebuilding everything on a crash basis

Slide 19: Strawman Proposal/Suggestion

-- Doing disaster recovery/business continuity today requires a hot/
   spinning off site facility with synchronized data.

But obviously disaster recovery and business continuity can cover a lot
of additional ground, including things like:

-- Real Time Notification During a Disaster or Other Emergency
   http://www.uoregon.edu/~joe/notification/emergency-notification.pdf

-- Pandemic Flu and Computer and Network Disaster Recovery Planning
   http://www.uoregon.edu/~joe/flu/flu.pdf

-- Loss of Network Control Incidents
   www.uoregon.edu/~joe/loss-of-network-control/loss-of-network-control.pdf

-- Electromagnetic Pulse (EMP)
   http://www.uoregon.edu/~joe/infragard-2009/infragard-eugene-2009.pdf

-- Volcanoes ("secretly," this is really a talk about particulate control
   in machine rooms, etc.)
   http://www.uoregon.edu/~joe/volcanoes/volcanoes.pdf

If you're interested, I'd encourage you to join the Internet2 Disaster
Recovery group and discuss your questions in more depth there...

To join it, send email to sympa () internet2 edu with the subject line
subscribe salsa-dr yourfirstname yourlastname

The list is typically fairly quiet, so we'd welcome your input/comments.

Regards,

Joe St Sauver
Current thread:

DR/BC Planning Sarazen, Daniel (May 09)
- <Possible follow-ups>
- Re: DR/BC Planning Kimberly Heimbrock (May 10)
- Re: DR/BC Planning Joe St Sauver (May 10)