nanog mailing list archives

RE: CenturyLink RCA?


From: "Naslund, Steve" <SNaslund () medline com>
Date: Mon, 31 Dec 2018 15:23:52 +0000

See my comments in line.

Steve

Hey Steve,

I will continue to speculate, as that's all we have.

1.  Are you telling me that several line cards failed in multiple cities in the same way at the same time?  Don't 
think so unless the same software fault was propagated to all of them.  If the problem was that they needed to be 
reset, >couldn't that be accomplished by simply reseating them?

L2 DCN/OOB, whole network shares single broadcast domain. 

Bad design if that’s the case, that would be a huge subnet.  However even if that was the case, you would not need to 
replace hardware in multiple places.  You might have to reset it but not replace it.  Also being an ILEC it seems hard 
to believe how long their dispatches to their own central office took.  It might have taken awhile to locate the 
original problem but they should have been able to send a corrective procedure to CO personnel who are a lot closer to 
the equipment.  In my region (Northern Illinois) we can typically get access to a CO in under 30 minutes 24/7.  They 
are essentially smart hands technicians that can reseat or replace line cards.

2.  Do we believe that an OOB management card was able to generate so much traffic as to bring down the optical 
switching?  Very doubtful which means that the systems were actually broken due to trying to PROCESS the "invalid 
frames".  Seems like very poor control plane management if the system is attempting to process invalid data and 
bringing down the forwarding plane.

L2 loop. You will kill your JNPR/CSCO with enough trash on MGMT ETH.
However I can be argued that optical network should fail up in absence of control-plane, IP network has to fail down.

Most of the optical muxes I have worked with will run without any management card or control plane at all.  Usually the 
line cards keep forwarding according to the existing configuration even in the absence of all management functions.  It 
would help if we knew what gear this was.  True optical muxes do not require much care and feeding once they have a 
configuration loaded.  If they are truly dependent on that control plane, then it needs to be redundant enough with 
watch dogs to reset them if they become non responsive and they need policers and rate limiter on their interfaces.  
Seems they would be vulnerable to a DoS if a bad 
BPDU can wipe them out.

3.  In the cited document it was stated that the offending packet did not have source or destination information.  If 
so, how did it get propagated throughout the network?

BPDU

Maybe, it would be strange that it was invalid but valid enough to continue forwarding.  In any case loss of the 
management network should not interrupt forwarding.  I also would not be happy with an optical network that relies on 
spanning tree to remain operational.

My guess at the time and my current opinion (which has no real factual basis, just years of experience) is that a bad 
software package was propagated through their network.

Lot of possible reasons, I choose to believe what they've communicated is what the writer of the communication thought 
that happened, but as they likely are not SME it's broken radio communication. BCAST storm on L2 DCN >would plausibly 
fit the very ambiguous reason offered and is something people actually are doing.

My biggest problem with their explanation is the replacement of line cards in multiple cities.  The only way that 
happens is when bad code gets pushed to them.  If it took them that long to fix an L2 broadcast storm, something is 
seriously wrong with their engineering.  Resetting the management interfaces should be sufficient once the offending 
line card is removed.  That is why I think this was a software update failure or a configuration push.  Either way, 
they should be jumping up and down on their vendor as to why this caused such large scale effects.

Current thread: