nanog mailing list archives

Re: Famous operational issues


From: Shawn L via NANOG <nanog () nanog org>
Date: Tue, 23 Feb 2021 17:37:28 -0500 (EST)


That brings back memories....I had a similar experience.  First month on the job, large Sun raid array storing ~ 5k of 
mailboxes dies in the middle of the afternoon.  So, I start troubleshooting and determine it's most likely a bad disk.  
The CEO walked into the server room right about the time I had 20 disks laid out on a table.  He had a fit and called 
the desktop support guy to come and 'show me how to fix a pc'.
 
Never mind the fact that we had a 90% ready to go replacement box sitting at another site, and just needed to either go 
get it, or bring the disks to it..... So we sat there until the desktop who was 30 minutes away guy got there.  He took 
one look at it and said 'never touched that thing before, looks like he knows what he's doing' and pointed to me.  4 
hours later we were driving the new server to the data center strapped down in the back of a pickup.  Fun times.
 
 
-----Original Message-----
From: "Justin Streiner" <streinerj () gmail com>
Sent: Tuesday, February 23, 2021 5:11pm
To: "John Kristoff" <jtk () dataplane org>
Cc: "NANOG" <nanog () nanog org>
Subject: Re: Famous operational issues



Beyond the widespread outages, I have so many personal war stories that it's hard to pick a favorite.
My first job out of college in the mid-late 90s was at an ISP in Pittsburgh that I joined pretty early in its 
existence, and everyone did a bit of everything. I was hired to do sysadmin stuff, networking, pretty much whatever was 
needed. About a year after I started, we brought up a new mail system with an external RAID enclosure for the mail 
store itself.  One day, we saw indications that one of the disks in the RAID enclosure was starting to fail, so I 
scheduled a maintenance window to replace the disk and let the controller rebuild the data and integrate it back into 
the RAID set.  No big worries, right?
It's Tuesday at about 2 AM.
Well, the kernel on the RAID controller itself decided that when I pulled the failing drive would be a fine time to 
panic, and more or less turn itself into a bit-blender, and take all the mailstore down with it.  After a few hours of 
watching fsck make no progress on anything, in terms of trying to un-fsck the mailstore, we made the decision in 
consultation with the CEO to pull the plug on trying to bring the old RAID enclosure back to life, and focus on finding 
suitable replacement hardware and rebuild from scratch.  We also discovered that the most recent backups of the 
mailstore were over a month old :(
I think our CEO ended up driving several hours to procure a suitable enclosure.  By the time we got the enclosure 
installed, filesystems built, and got whatever tape backups we had restored, and tested the integrity of the system, it 
was now Thursday around 8 AM. Coincidentally, that was the same day the company hosted a big VIP gathering (the mayor 
was there, along with lots of investors and other bigwigs), so I had to come back and put on a suit to hobnob with the 
VIPs after getting a total of 6 hours of sleep in about the previous 3 days.  I still don't know how I got home that 
night without wrapping my vehicle around a utility pole (due to being over-tired, not due to alcohol).
Many painful lessons learned over that stretch of days, as often the case as a company grows from startup mode and 
builds more robust technology and business processes as a consequence of growth.
jms


On Tue, Feb 16, 2021 at 2:37 PM John Kristoff <[ jtk () dataplane org ]( mailto:jtk () dataplane org )> wrote:Friends,

 I'd like to start a thread about the most famous and widespread Internet
 operational issues, outages or implementation incompatibilities you
 have seen.

 Which examples would make up your top three?

 To get things started, I'd suggest the AS 7007 event is perhaps  the
 most notorious and likely to top many lists including mine.  So if
 that is one for you I'm asking for just two more.

 I'm particularly interested in this as the first step in developing a
 future NANOG session.  I'd be particularly interested in any issues
 that also identify key individuals that might still be around and
 interested in participating in a retrospective.  I already have someone
 that is willing to talk about AS 7007, which shouldn't be hard to guess
 who.

 Thanks in advance for your suggestions,

 John

Current thread: