nanog mailing list archives

Re: Centurylink having a bad morning?


From: Saku Ytti <saku () ytti fi>
Date: Mon, 31 Aug 2020 07:55:08 +0300

On Sun, 30 Aug 2020 at 20:00, Baldur Norddahl <baldur.norddahl () gmail com> wrote:

Not really the point. BGP is designed such that if I take down the link, the prefixes MUST be withdrawn within 
reasonable time. The self healing aspect of the internet entirely depends on this. Clearly they have some kind of 
system that does not respect that by design. I am guessing they have something homebrewn going on with their route 
reflectors?

Add scale and BGP implementations can take a lot of time, hours of it.
Best thing you can do is add contractual obligations so people at your
provider who agree with you have some ammo. Instant is not on the
table, I'm sure that is obvious after that it's less than obvious what
is good enough.

It is like a plane. It is impossible to prove or even design a plane that can never fall out of the sky. But now we 
had a plane that crashed in a very bad way, so that plane (Centurylink) is grounded until they can prove that 
something like this can not happen again. Which means they need to redesign whatever the hell they have going on here.

Nothing ever works like this, it's naive to think any RCA leads to
something fixed so that it can never happen again. Only thing that can
be affected is the frequency of an event, removing it is not on the
cards. And usually affecting frequency is mostly about belief not
something provable. In addition to MTBF, questions should be raised
about MTTR, provable MTTR efforts are far more likely to exist than
provable MTBF efforts, but if we buy-in to the notion that it never
will happen again, because we is good, then no MTTR focus is needed,
why fix something that will never happen.
What if this outage took 5min to solve?

-- 
  ++ytti


Current thread: