nanog mailing list archives

Re: Worldcom and Qwest switch places


From: michael.dillon () gtsip net
Date: 7 Feb 2000 19:28:41 +0000


On Sat, 05 February 2000, Sean Donelan wrote:

Since Lucent equipment was also involved in the 10 days of Worldcom problems,
is there a common root cause between the Worldcom's problems and Qwest's
problems?  Is there some lesson other providers should be learning from
these events?  Or is each service provider expected to learn and re-learn
these lessons individually?  Is there some network design decision engineers
are getting wrong?

Lucent people told me that the Worldcom problem resulted from a software upgrade to Worldcom's Lucent switches that was 
done without having a good fallback plan. Lucent engineers had recommended a different strategy to Worldcom but 
Worldcom went ahead and did it their way. Then the software upgrade triggered some kind of cascading problem that 
either affected the old code or travelled through the network or both.

In other words, they created a problem as a side effect of the upgrade but didn't have agood strategy to contain or 
kill the problem that propogated like some kind of living organism. Seems to me that we *HAVE* seen this type of 
problem before in the Internet with things like AS7007 routes which seemed to hang around parts of the net for days.

How do you plan to rollback to a known state when you can't simply backtrack or reverse your actions?

---
Michael Dillon   Phone: +44 (20) 7769 8489   
                 Mobile: +44 (79) 7099 2658
Director of Product Engineering, GTS IP Services
151 Shaftesbury Ave.
London WC2H 8AL
UK



Current thread: