nanog mailing list archives

Re: Famous operational issues


From: "Daniel Karrenberg" <dfk () ripe net>
Date: Fri, 19 Feb 2021 12:07:58 +0100



On 16 Feb 2021, at 20:37, John Kristoff wrote:

I'd like to start a thread about the most famous and widespread Internet
operational issues, outages or implementation incompatibilities you
have seen.

Which examples would make up your top three?


My absolute top one happened 1995. Traffic engineering was not a widely used term then. A bright colleague who will remain un-named decided that he could make AS paths longer by repeating the same AS number more than once. Unfortunately the prevalent software on CISCO routers was not resilient to such trickery and reacted with a reboot. This caused an avalanche of jo-jo-ing routers. Think it through!

It took some time before that offending path could be purged from the whole Internet; yes we all roughly knew the topology and the players of the BGP speaking parts of it at that time. Luckily this happened during the set-up for the Danvers IETF and co-ordination between major operators was quick because most of their routing geeks happened to be in the same room, the ‘terminal room’; remember those?

Since at the time I personally had no responsibility for operations any more I went back to pulling cables and crimping RJ45s.

Lessons: HW/SW mono-cultures are dangerous. Input testing is good practice at all levels software. Operational co-ordination is key in times of crisis.

Daniel


Current thread: