nanog mailing list archives

Level 3 RFO


From: "erikk" <erikk () eknetwork com>
Date: Sat, 22 Oct 2005 23:15:29 -0400


Customer Information



           Customer Company Name:  (Internap)



           Customer Contact Information:  (noc () internap com)



           Customer Location:  (All services with Level3 Communications)



           Original Ticket Number:  SM Parent 1429209



           Customer Impact:  Outage



Event Summary



           Outage location:  IP North America, Trans-Atlantic and European
Markets



           Ticket Create Date and Time:  10/21/2005 12:01 MDT



           Service Restore Date and Time:  Between 10/21/2005 12:25 MDT to

10/21/2005 5:31 MDT depending on Location



           Total Duration:  Varied by Location



           Event Description:

A configuration update was applied to an edge router in Chicago as part of
approved low risk maintenance activity. This validated and approved
configuration change was applied to four other major markets with no impact.
However; in this specific case the configuration was corrupted during the
deployment process on this specific edge router.  Upon load of the corrupted
configuration, the device created an open-ended policy allowing this router's
routes to be redistributed to OSPF.



The engineering team immediately reverted to the previous saved
configuration to mitigate route propagation.  The rollback was followed by
deliberate router isolation and complete device reload to ensure no stale
LSAs (Link State Announcements), existed on the device and completed by
12:08 MDT.  After reloading the edge router, the initial cause of the event
was effectively mitigated.  However, due to the number of flooded LSAs,
other devices in the Level 3 network had difficulty fully loading the OSPF
tables and processing the volume of updates.  This caused abnormal
conditions within portions of the Level 3 network.  Manual intervention on
specific routers was required to allow a number of routers to return to a
normal routing state.





Root Cause Analysis



Committed redistribution of loopback statement in an erroneous state.









Repair



           On devices with large number of adjacent neighbors a selective
process of disabling interfaces on redundant paths or OSPF process restarts
stabilized the affected portions to the network.



Future Preventive Actions



The Level 3 engineering team is currently analyzing the event in order to
determine an appropriate action plan.  Details of this specific plan will be
available after the analysis is complete.


Current thread: