Educause Security Discussion mailing list archives
Re: DR/BC Planning
From: Joe St Sauver <joe () OREGON UOREGON EDU>
Date: Mon, 10 May 2010 10:42:24 -0700
<dsarazen () UMASSP EDU> asked: #To what degree do you conduct disaster recovery and business resumption #planning? Do you test your plans? If so, how (I.E.: table top testing, call #trees, fail over testing?) Are you using software or templates to write #your plans? # #And finally, how do you ensure departments have completed and tested their #plans? How do you ensure they are kept up to date. # #Any BRP policies/procedures you can share would be helpful. Internet2 has had a Disaster Recovery/Business Continuity activity going for a while now... you can see an outline of the sort of disasters that we've talked about (hurricanes, earthquakes, facilities fires, widespread loss of power, loss of facilities access, DDoS, etc.) and some of the constraints that may largely determine what's a viable potential solution, in the introductory slides from the Winter 2007 Joint Techs meeting of the Salsa DR group -- see: http://www.uoregon.edu/~joe/dr-bcp-bof/disaster-recovery-bof.pdf The key slides from that talk are probably slides 12-19: Slide 12: The Old Disaster Recovery Paradigm -- Reciprocal shared space at a partner site -- Data archived to tape -- Just-in-time delivery of replacement hardware -- Small number of key applications (typically enterprise ERP system) -- At least some down time is acceptable -- Proforma/low probability of occurring -- Is that still a realistic paradigm? NO. Slide 13: What's Mission Critical? -- Domain name system? -- Enterprise SAN/NAS (data storage) -- Enterprise Identity Management System? -- ERP System? -- Voice over IP? -- Teaching and Learning System? -- Institutional Web Presence? -- Email and Calendaring? -- Building control and access systems (smart building HVAC, elevators, door controls, alarm systems, etc.)? -- The network itself? -- All of the above and more? Slide 14: What Are Today's Restoration/Recovery Time Frames? -- Hitless/non-interruptible? -- Restoration on the order of seconds? -- Minutes? -- Hours? <== I suspect this is what we need -- Days? -- Weeks? <== Is this where we are? -- Longer? -- Assertion: time to recover is a key driver. Slide 15: Key Driver? Total Data Volume -- How many GB/TB/PB worth of data needs to be available post-event? -- If that data needed to be transferred over a network or restored from archival media post-event, how long would it take to do that? -- What about failing back over to a primary system once the crisis is over (including moving all the data that's been modified during the outage)? Slide 16: Key Driver? Data Change Rate -- If restoration has to occur from a checkpoint/periodically archived media, how much data would be at risk of loss since that snapshot? -- Are the transactions which occurred since that time securely journal'd, and can they be replayed if need be? Or would those transactions simply be lost? Slide 17: Key Driver? Required Lower Level Infrastructure -- Secure space with rackage -- Power and cooling -- Local loop and wide area connectivity -- System and network hardware -- How long would it take to get/install/configure that lower level infrastructure from scratch, if it isn't already there? -- Office space for staff? Slide 18: Key Driver? System Complexity -- Today's systems are complex -- Replicating complex systems takes time and may require specialized expertise -- Specialized expertise may not be available during a crisis -- Detailed system documentation may not be available during a crisis -- Debugging a specialized system may take time... -- Not going to want to try rebuilding everything on a crash basis Slide 19: Strawman Proposal/Suggestion -- Doing disaster recovery/business continuity today requires a hot/ spinning off site facility with synchronized data. But obviously disaster recovery and business continuity can cover a lot of additional ground, including things like: -- Real Time Notification During a Disaster or Other Emergency http://www.uoregon.edu/~joe/notification/emergency-notification.pdf -- Pandemic Flu and Computer and Network Disaster Recovery Planning http://www.uoregon.edu/~joe/flu/flu.pdf -- Loss of Network Control Incidents www.uoregon.edu/~joe/loss-of-network-control/loss-of-network-control.pdf -- Electromagnetic Pulse (EMP) http://www.uoregon.edu/~joe/infragard-2009/infragard-eugene-2009.pdf -- Volcanoes ("secretly," this is really a talk about particulate control in machine rooms, etc.) http://www.uoregon.edu/~joe/volcanoes/volcanoes.pdf If you're interested, I'd encourage you to join the Internet2 Disaster Recovery group and discuss your questions in more depth there... To join it, send email to sympa () internet2 edu with the subject line subscribe salsa-dr yourfirstname yourlastname The list is typically fairly quiet, so we'd welcome your input/comments. Regards, Joe St Sauver
Current thread:
- DR/BC Planning Sarazen, Daniel (May 09)
- <Possible follow-ups>
- Re: DR/BC Planning Kimberly Heimbrock (May 10)
- Re: DR/BC Planning Joe St Sauver (May 10)