nanog mailing list archives

Re: Famous operational issues


From: Justin Streiner <streinerj () gmail com>
Date: Tue, 23 Feb 2021 17:11:11 -0500

Beyond the widespread outages, I have so many personal war stories that
it's hard to pick a favorite.

My first job out of college in the mid-late 90s was at an ISP in Pittsburgh
that I joined pretty early in its existence, and everyone did a bit of
everything. I was hired to do sysadmin stuff, networking, pretty much
whatever was needed. About a year after I started, we brought up a new mail
system with an external RAID enclosure for the mail store itself.  One day,
we saw indications that one of the disks in the RAID enclosure was starting
to fail, so I scheduled a maintenance window to replace the disk and let
the controller rebuild the data and integrate it back into the RAID set.
No big worries, right?

It's Tuesday at about 2 AM.

Well, the kernel on the RAID controller itself decided that when I pulled
the failing drive would be a fine time to panic, and more or less turn
itself into a bit-blender, and take all the mailstore down with it.  After
a few hours of watching fsck make no progress on anything, in terms of
trying to un-fsck the mailstore, we made the decision in consultation with
the CEO to pull the plug on trying to bring the old RAID enclosure back to
life, and focus on finding suitable replacement hardware and rebuild from
scratch.  We also discovered that the most recent backups of the mailstore
were over a month old :(

I think our CEO ended up driving several hours to procure a suitable
enclosure.  By the time we got the enclosure installed, filesystems built,
and got whatever tape backups we had restored, and tested the integrity of
the system, it was now Thursday around 8 AM. Coincidentally, that was the
same day the company hosted a big VIP gathering (the mayor was there, along
with lots of investors and other bigwigs), so I had to come back and put on
a suit to hobnob with the VIPs after getting a total of 6 hours of sleep in
about the previous 3 days.  I still don't know how I got home that night
without wrapping my vehicle around a utility pole (due to being over-tired,
not due to alcohol).

Many painful lessons learned over that stretch of days, as often the case
as a company grows from startup mode and builds more robust technology and
business processes as a consequence of growth.

jms

On Tue, Feb 16, 2021 at 2:37 PM John Kristoff <jtk () dataplane org> wrote:

Friends,

I'd like to start a thread about the most famous and widespread Internet
operational issues, outages or implementation incompatibilities you
have seen.

Which examples would make up your top three?

To get things started, I'd suggest the AS 7007 event is perhaps  the
most notorious and likely to top many lists including mine.  So if
that is one for you I'm asking for just two more.

I'm particularly interested in this as the first step in developing a
future NANOG session.  I'd be particularly interested in any issues
that also identify key individuals that might still be around and
interested in participating in a retrospective.  I already have someone
that is willing to talk about AS 7007, which shouldn't be hard to guess
who.

Thanks in advance for your suggestions,

John


Current thread: