nanog mailing list archives

Re: [outages] News item: Blackberry services down worldwide, Egypt affected (not N.A.)

From: -Hammer- <bhmccie () gmail com>
Date: Wed, 12 Oct 2011 11:06:29 -0500

I have been witness to N+1 HUMAN failures but never a N+1 hardwarefailure or system/design failure that warranted questioning the need forN+2. Usually your N+1 failure is (as already referenced) pasting in abad config that gets replicated or something like that. Not saying thehardware is perfect. It's just that I haven't personally seen a fullblown failure like that without human help.

Closest example would be an update that wasn't properly vetted indev/test before migrating to prod. I've seen a few of those that I guessyou could blame on the system. Even though the humans could have testedbetter....


-Hammer-

"I was a normal American nerd"
-Jack Herer



On 10/12/2011 10:58 AM, Chris Campbell wrote:

I think it raises serious questions about RIM's DR strategy if a DB corruption or switch failure or whatever can cause this much 
outage. 'Surely' RIM have an second site that is independent of the primary (within reason) that they could of flipped to 
when they realised the DB was borked. If not then any business that relies on them needs to be shouting from the rooftops to get RIM 
to fix it.

Chris.


On 12 Oct 2011, at 16:49, Valdis.Kletnieks () vt edu wrote:

On Wed, 12 Oct 2011 09:52:02 CDT, -Hammer- said:

What kills me is what they have told the public. The lost a "core
switch". I don't know if they actually mean network switch or not but
I'm pretty sure any of us that work on an enterprise environment know
how to factor N+1 just for these types of days. And then the backup
solution failed? I'm not buying it either.

Yeah, and that extra comma in the one config file that didn't make a difference
when you tested the failover in the lab *never* makes a difference when it hits
in the production network, right?  Or they changed the config of the primary and
it didn't get propogated just right to the backup, or they had mismatched firmware
levels on blades in the blades on the primary and backup switches, so traffic that
didn't tickle a bug on the primary blades caused the blade to crash on the backup,
or...

Anybody on this list who's been around long enough probably has enough "We
should have had N+2 because the N+1'th device failed too" stories to drain
*several* pitchers of beer at a good pub... I've even had one case where my
butt got *saved* from a ohnosecond-class whoops because the N+1'th device *was*
crashed (stomped a config file, it replicated, was able to salvage a copy from
a device that didn't replicate because it was down at the time).

Current thread:

Re: [outages] News item: Blackberry services down worldwide, Egypt affected (not N.A.) andrew.wallace (Oct 12)
- Re: [outages] News item: Blackberry services down worldwide, Egypt affected (not N.A.) -Hammer- (Oct 12)
  - Re: [outages] News item: Blackberry services down worldwide, Egypt affected (not N.A.) Valdis . Kletnieks (Oct 12)
    - Re: [outages] News item: Blackberry services down worldwide, Egypt affected (not N.A.) Charles Mills (Oct 12)
    - Re: [outages] News item: Blackberry services down worldwide, Egypt affected (not N.A.) Tayeb Meftah (Oct 12)
    - Re: [outages] News item: Blackberry services down worldwide, Egypt affected (not N.A.) Chris Campbell (Oct 12)
    - Re: [outages] News item: Blackberry services down worldwide, Egypt affected (not N.A.) -Hammer- (Oct 12)
    - RE: [outages] News item: Blackberry services down worldwide, Egypt affected (not N.A.) Leigh Porter (Oct 12)
    - Re: [outages] News item: Blackberry services down worldwide, Egypt affected (not N.A.) Mike Gatti (Oct 12)
    - Re: [outages] News item: Blackberry services down worldwide, Egypt affected (not N.A.) -Hammer- (Oct 12)
  - RE: [outages] News item: Blackberry services down worldwide, Egypt affected (not N.A.) Brandt, Ralph (Oct 12)
- Re: [outages] News item: Blackberry services down worldwide, Egypt affected (not N.A.) Andrew Mulholland (Oct 12)
- RE: [outages] News item: Blackberry services down worldwide, Egypt affected (not N.A.) Paul Stewart (Oct 12)
- Re: [outages] News item: Blackberry services down worldwide, Egypt affected (not N.A.) Joel jaeggli (Oct 12)
- Re: [outages] News item: Blackberry services down worldwide, Egypt affected (not N.A.) Valdis . Kletnieks (Oct 12)
  - Re: [outages] News item: Blackberry services down worldwide, Egypt affected (not N.A.) Jay Ashworth (Oct 12)