nanog mailing list archives

Re: CloudFlare issues?

From: Ross Tajvar <ross () tajvar io>
Date: Mon, 24 Jun 2019 21:39:13 -0400

On Mon, Jun 24, 2019 at 9:01 PM Jared Mauch <jared () puck nether net> wrote:

On Jun 24, 2019, at 8:50 PM, Ross Tajvar <ross () tajvar io> wrote:

Maybe I'm in the minority here, but I have higher standards for a T1

than any of the other players involved. Clearly several entities failed to
do what they should have done, but Verizon is not a small or inexperienced
operation. Taking 8+ hours to respond to a critical operational problem is
what stood out to me as unacceptable.


Are you talking about a press response or a technical one?  The impacts I

saw were for around 2h or so based on monitoring I’ve had up since 2007.
Not great but far from the worst as Tom mentioned.  I’ve seen people cease
to announce IP space we reclaimed from them for months (or years) because
of stale config.  I’ve also seen routes come back from the dead because
they were pinned to an interface that was down for 2 years but never fully
cleaned up.  (Then the telco looped the circuit, interface came up, route
in table, announced globally — bad day all around).


A technical one - see below from CF's blog post:
"It is unfortunate that while we tried both e-mail and phone calls to reach
out to Verizon, at the time of writing this article (over 8 hours after the
incident), we have not heard back from them, nor are we aware of them
taking action to resolve the issue."

And really - does it matter if the protection *was* there but something

broke it? I don't think it does. Ultimately, Verizon failed implement
correct protections on their network. And then failed to respond when it
became a problem.


I think it does matter.  As I said in my other reply, people do things

like drop ACLs to debug.  Perhaps that’s unsafe, but it is something you do
to debug.  Not knowing what happened, I dunno.  It is also 2019 so I hold
networks to a higher standard than I did in 2009 or 1999.


Dropping an ACL is fine, but then you have to clean it up when you're done.
Your customers don't care that you *almost* didn't have an outage because
you *almost* did your job right. Yeah, there's a difference between not
following policy and not having a policy, but neither one is acceptable
behavior from a T1 imo. If it's that easy to cause an outage by not
following policy, then I argue that the policy should be better, or *something
*should be better - monitoring, automation, sanity checks. etc. There are
lots of ways to solve that problem. And in 2019 I really think there's no
excuse for a T1 not to be doing that kind of thing.

- Jared

Current thread:

Re: CloudFlare issues?, (continued)
- - - Re: CloudFlare issues? Robbie Trencheny (Jun 24)
    - Re: CloudFlare issues? Robbie Trencheny (Jun 24)
    - Re: CloudFlare issues? Job Snijders (Jun 24)
    - Re: CloudFlare issues? Pavel Lunin (Jun 24)
    - Re: CloudFlare issues? Mark Tinka (Jun 24)
    - Re: CloudFlare issues? Justin Paine via NANOG (Jun 24)
    - Re: CloudFlare issues? Tom Beecher (Jun 24)
    - Re: CloudFlare issues? James Jun (Jun 24)
    - Re: CloudFlare issues? Ross Tajvar (Jun 24)
    - Re: CloudFlare issues? Jared Mauch (Jun 24)
    - Re: CloudFlare issues? Ross Tajvar (Jun 24)
    - Re: CloudFlare issues? Jared Mauch (Jun 24)
    - Re: CloudFlare issues? Rich Kulawiec (Jun 25)
    - Re: CloudFlare issues? Jared Mauch (Jun 24)
    - Re: CloudFlare issues? Randy Bush (Jun 25)
    - Re: CloudFlare issues? Hank Nussbacher (Jun 24)
    - Re: CloudFlare issues? Christopher Morrow (Jun 24)
    - Re: CloudFlare issues? Hank Nussbacher (Jun 25)
    - Re: CloudFlare issues? Tom Beecher (Jun 25)
    - Re: CloudFlare issues? Katie Holly (Jun 25)
    - Are network operators morons? [was: CloudFlare issues?] Patrick W. Gilmore (Jun 25)

(Thread continues...)