nanog mailing list archives

Re: Global Akamai Outage


From: Jared Mauch <jared () puck nether net>
Date: Sun, 25 Jul 2021 11:12:43 -0400

Work hat is not on, but context is included from prior workplaces etc. 

On Jul 25, 2021, at 2:22 AM, Saku Ytti <saku () ytti fi> wrote:

It doesn't seem like a tenable solution, when the solution is 'do
better', since I'm sure whoever did those checks did their best in the
first place. So we must assume we have some fundamental limits what
'do better' can achieve, we have to assume we have similar level of
outage potential in all work we've produced and continue to produce
for which we exert very little control over.

I have seen a very strong culture around risk and risk avoidance whenever possible at akamai. Some minor changes are 
taken very seriously. 

I appreciate that on a daily basis, and when we make mistakes (I am human after all) are made, reviews of the mistakes 
and corrective steps are planned and followed up on. I'm sure this time will not be different. 

I also get how easy it is to be cynical about these issues. There's always someone with power who can break things, but 
those can also often fix them just as fast. 

Focus on how you can do a transactional routing change and roll it back, how you can test etc. 

This is why for years I told one vendor that had a line-by-line parser their system was too unsafe for operation. 

There's also other questions like:

How can we improve response times when things are routed poorly? Time to mitigate hijacks is improved my majority of 
providers doing RPKI OV, but interprovider response time scales are much longer. I also think about the two big CTL 
long haul and routing issues last year. How can you mitigate these externalities. 

- Jared 

Current thread: