nanog mailing list archives

Re: CenturyLink RCA?


From: Brielle <bruns () 2mbit com>
Date: Mon, 31 Dec 2018 08:20:52 -0700

(Forgive my top posting, not on my desktop as I’m out of town)

Wild guess, based on my own experience as a NOC admin/head of operations at a large ISP - they have an automated 
deployment system for new firmware for a (mission critical) piece of backbone hardware.

They may have tested said firmware on a chassis with cards that did not exactly match the hardware they had in actual 
deployment (ie: card was older hw revision in deployed hardware), and while it worked fine there, it proceeded shit the 
bed in the production.

Or, they missed a mandatory low level hardware firmware upgrade that has to be applied separately before the other main 
upgrade.

Kinda picturing in my mind that they staged all the updates, set a timer, staggered reboot, and after the first hit the 
fan, they couldn’t stop the rest as it fell apart as each upgraded unit fell on its own sword on reboot.

I’ve been bit by the ‘this card revision is not supported under this platform/release’ bug more often then I’d like to 
admit.

And, yes, my eyes did start to get glossy and hazy the more I read their explanation as well.  It’s exactly the kind of 
useless post I’d write when I want to get (stupid) people off my back about a problem.

Sent from my iPad

On Dec 31, 2018, at 7:53 AM, Naslund, Steve <SNaslund () medline com> wrote:

Not buying this explanation for a number of reasons :

1.  Are you telling me that several line cards failed in multiple cities in the same way at the same time?  Don't 
think so unless the same software fault was propagated to all of them.  If the problem was that they needed to be 
reset, couldn't that be accomplished by simply reseating them?

2.  Do we believe that an OOB management card was able to generate so much traffic as to bring down the optical 
switching?  Very doubtful which means that the systems were actually broken due to trying to PROCESS the "invalid 
frames".  Seems like very poor control plane management if the system is attempting to process invalid data and 
bringing down the forwarding plane.

3.  In the cited document it was stated that the offending packet did not have source or destination information.  If 
so, how did it get propagated throughout the network?

My guess at the time and my current opinion (which has no real factual basis, just years of experience) is that a bad 
software package was propagated through their network.

Steven Naslund
Chicago IL


One thing that is troubling when reading that URL is that it appears several steps of restoration required teams to 
go onsite for local login, etc.,. Granted, to troubleshoot hardware you need to be physically present to pop a line 
card in and out, but CTL/LVL3 should have full out-of-band console and power control to all core devices, we 
shouldn't be waiting for someone to drive to a location to get console or do power cycling. And I would imagine the 
first step to alot of the troubleshooting was power cycling and local console logs.


-John



On 12/30/18 10:42 AM, Mike Hammett wrote:

It's technical enough so that laypeople immediately lose interest, yet completely useless to anyone that works with 
this stuff.



-----
Mike Hammett
Intelligent Computing Solutions
http://www.ics-il.com

Midwest-IX
http://www.midwest-ix.com

________________________________
From: "Saku Ytti" <saku () ytti fi>
To: "nanog list" <nanog () nanog org>
Sent: Sunday, December 30, 2018 7:42:49 AM
Subject: CenturyLink RCA?

Apologies for the URL, I do not know official source and I do not 
share the URLs sentiment.
https://fuckingcenturylink.com/

Can someone translate this to IP engineer? What did actually happen?
From my own history, I rarely recognise the problem I fixed from 
reading the public RCA. I hope CenturyLink will do better.

Best guess so far that I've heard is

a) CenturyLink runs global L2 DCN/OOB
b) there was HW fault which caused L2 loop (perhaps HW dropped BPDU, 
I've had this failure mode)
c) DCN had direct access to control-plane, and L2 congested 
control-plane resources causing it to deprovision waves

Now of course this is entirely speculation, but intended to show what 
type of explanation is acceptable and can be used to fix things.
Hopefully CenturyLink does come out with IP-engineering readable 
explanation, so that we may use it as leverage to support work in our 
own domains to remove such risks.

a) do not run L2 DCN/OOB
b) do not connect MGMT ETH (it is unprotected access to control-plane, 
it  cannot be protected by CoPP/lo0 filter/LPTS ec)
c) do add in your RFP scoring item for proper OOB port (Like Cisco 
CMP)
d) do fail optical network up

--
 ++ytti



--
 ++ytti


Current thread: