nanog mailing list archives

Re: Fiber cut in SF area


From: George William Herbert <gherbert () retro com>
Date: Mon, 13 Apr 2009 18:30:27 -0700



Matthew Petach wrote:
George William Herbert <gherbert () retro com> wrote:
 Matthew Petach writes:
 >"protected rings" are a technology of the past.  Don't count on your
 >vendor to provide "redundancy" for you.  Get two unprotected runs
 >for half the cost each, from two different providers, and verify the
 >path separation and diversity yourself with GIS data from the two
 >providers; handle the failover yourself.  That way, you *know* what
 >your risks and potential impact scenarios are.  It adds a bit of
 >initial planning overhead, but in the long run, it generally costs a
 >similar amount for two unprotected runs as it does to get a
 >protected run, and you can plan your survival scenarios *much*
 >better, including surviving things like one provider going under,
 >work stoppages at one provider, etc.

This completely ignores the grooming problem.

Not completely; it just gives you teeth for exiting your
contract earlier and finding a more responsible provider
to go with who won't violate the terms of the contract
and re-groom you without proper notification. 

That's a post-facto financial recovery / liability limitation
technique, not a high availability / hardening technique...

I'll admit
I'm somewhat simplifying the scenario, in that I also
insist on no single point of failure, so even an entire
site going dark doesn't completely knock out service;
those who have been around since the early days will
remember my email to NANOG about the gas main cut
in Santa Clara that knocked a good chunk of the area's
connectivity out, *not* because the fiber was damaged,
but because the fire marshall insisted that all active
electrical devices be powered off (including all UPSes)
until the gas in the area had dissipated.  Ever since then,
I've just acknowledged you can't keep a single site always
up and running; there *will* be events that require it to be
powered down, and part of my planning process accounts
for that, as much as possible, via BCP planning. 

I was less than a mile away from that, I remember it well.
My corner cube even faced in that direction.

I heard the noise then the net went poof.  One of those
"Oh, that's not good at all" combinations.

Now, I'll
be the first to admit it's a different game if you're providing
last-mile access to single-homed customers.  But sitting
on the content provider side of the fence, it's entirely possible
to build your infrastructure such that having 3 or more OC192s
cut at random places has no impact on your ability to carry
traffic and continue functioning.

 You have to get out of the game the fiber owners are playing.
 They can't even keep score for themselves, much less accurately
 for the rest of us.  If you count on them playing fair or
 right, they're going to break your heart and your business.

You simply count on them not playing entirely fair, and penalize
them when they don't; and you have enough parallel contracts with
different providers at different sites that outages don't take you
completely offline.

The problem with grooming is that in many cases, due to provider
consolidation and fiber vendor consolidation and cable swap and
so forth, you end up with parallel contracts with different
providers at different sites that all end up going through
one fiber link anyways.

I had (at another site) separate vendors with fiber going
northbound and southbound out of the two diverse sites.

Both directions from both sites got groomed without notification.

Slightly later, the northbound fiber was Then rerouted a bit up the road,
into a southbound bundle (same one as our now-groomed southbound link),
south to another datacenter then north again via another path.
To improve route reduncancy northbound overall, for the providers'
overall customer links.

And the shared link south of us was what got backhoed.

This was all in one geographical area.  Diversity out of area will get
you around single points like that, if you know the overall topology
of the fiber networks around the US and chose locations carefully.

But even that won't protect you against common mode vendor hardware
failures, or a largescale BGP outage, or the routing chaos that comes
with a very serious regional net outage (exchange points, major
undersea cable cuts, etc)....

There may be 4 or 5 nines, but the 1 at the end has your name on it.


-george william herbert
gherbert () retro com



Current thread: