nanog mailing list archives

Re: Global Akamai Outage

From: Mark Tinka <mark@tinka.africa>
Date: Sun, 25 Jul 2021 16:37:37 +0200



On 7/25/21 08:18, Saku Ytti wrote:

Hey,

Not a critique against Akamai specifically, it applies just the same
to me. Everything seems so complex and fragile.

Very often the corrective and preventive actions appear to be
different versions and wordings of 'dont make mistakes', in this case:

- Reviewing and improving input safety checks for mapping components
- Validate and strengthen the safety checks for the configuration
deployment zoning process

It doesn't seem like a tenable solution, when the solution is 'do
better', since I'm sure whoever did those checks did their best in the
first place. So we must assume we have some fundamental limits what
'do better' can achieve, we have to assume we have similar level of
outage potential in all work we've produced and continue to produce
for which we exert very little control over.

I think the mean-time-to-repair actions described are more actionable
than the 'do better'.  However Akamai already solved this very fast
and may not be very reasonable to expect big improvements to a 1h
start of fault to solution for a big organisation with a complex
product.

One thing that comes to mind is, what if Akamai assumes they cannot
reasonably make it fail less often and they can't fix it faster. Is
this particular product/solution such that the possibility of having
entirely independent A+B sides, for which clients fail over is not
available? If it was a DNS problem, it seems like it might have been
possible to have entirely failed A, and clients automatically
reverting to B, perhaps adding some latencies but also allowing the
system to automatically detect that A and B are performing at an
unacceptable delta.

Did some of their affected customers recover faster than Akamai due to
their own actions automated or manual?

Can we learn something from how the airline industry has incrementallyimproved safety through decades of incidents?

"Doing better" is the lowest hanging fruit any network operator canstrive for. Unlike airlines, the Internet community - despite beingbuilt on standards - is quite diverse in how we choose to operate ourown islands. So "doing better", while a universal goal, means differentthings to different operators. This is why we would likely see differentRFO's and remedial recommendations from different operators for the"same kind of" outage.

In most cases, continuing to "do better" may be most appealing prospectbecause anything better than that will require significantly morefunding, in an industry where most operators are generally threading theP&L needle.


Mark.

Current thread:

Re: Global Akamai Outage, (continued)
- Re: Global Akamai Outage Matt Harris (Jul 22)
  - Re: Global Akamai Outage Mark Tinka (Jul 22)
- Re: Global Akamai Outage Andy Ringsmuth (Jul 22)
  - Re: Global Akamai Outage Jared Mauch (Jul 22)
  - Re: Global Akamai Outage Grant Taylor via NANOG (Jul 22)
    - Re: Global Akamai Outage Andy Ringsmuth (Jul 22)
- Re: Global Akamai Outage Hank Nussbacher (Jul 22)
  - Re: Global Akamai Outage Hank Nussbacher (Jul 24)
    - Re: Global Akamai Outage Saku Ytti (Jul 24)
    - Re: Global Akamai Outage Hank Nussbacher (Jul 25)
    - Re: Global Akamai Outage Mark Tinka (Jul 25)
    - Re: Global Akamai Outage Jared Mauch (Jul 25)
    - Re: Global Akamai Outage Saku Ytti (Jul 25)
    - Re: Global Akamai Outage Mark Tinka (Jul 25)
    - Re: Global Akamai Outage Saku Ytti (Jul 25)
    - Re: Global Akamai Outage Mark Tinka (Jul 26)
    - Re: Global Akamai Outage Lukas Tribus (Jul 26)
    - Re: Global Akamai Outage Mark Tinka (Jul 26)
    - Re: Global Akamai Outage heasley (Jul 26)
    - Re: Global Akamai Outage Mark Tinka (Jul 26)
    - Re: Global Akamai Outage Lukas Tribus (Jul 26)

(Thread continues...)