nanog mailing list archives

Re: CloudFlare issues?

From: Jared Mauch <jared () puck nether net>
Date: Mon, 24 Jun 2019 20:57:29 -0400

On Jun 24, 2019, at 8:03 PM, Tom Beecher <beecher () beecher cc> wrote:

Disclaimer : I am a Verizon employee via the Yahoo acquisition. I do not work on 701.  My comments are my own 
opinions only. 

Respectfully, I believe Cloudflare’s public comments today have been a real disservice. This blog post, and your CEO 
on Twitter today, took every opportunity to say “DAMN THOSE MORONS AT 701!”. They’re not.


I presume that seeing a CF blog post isn’t regular for you. :-). — please read on

You are 100% right that 701 should have had some sort of protection mechanism in place to prevent this. But do we
know they didn’t? Do we know it was there and just setup wrong? Did another change at another time break what was
there? I used 701 many jobs ago and they absolutely had filtering in place; it saved my bacon when I screwed up once
and started readvertising a full table from a 2nd provider. They smacked my session down an I got a nice call about
it.

You guys have repeatedly accused them of being dumb without even speaking to anyone yet from the sounds of it.
Shouldn’t we be working on facts?

Should they have been easier to reach once an issue was detected? Probably. They’re certainly not the first vendor to
have a slow response time though. Seems like when an APAC carrier takes 18 hours to get back to us, we write it off
as the cost of doing business.

It also would have been nice, in my opinion, to take a harder stance on the BGP optimizer that generated he bogus
routes, and the steel company that failed BGP 101 and just gladly reannounced one upstream to another. 701 is
culpable for their mistakes, but there doesn’t seem like there is much appetite to shame the other contributors.

You’re right to use this as a lever to push for proper filtering , RPKI, best practices. I’m 100% behind that. We can
all be a hell of a lot better at what we do. This stuff happens more than it should, but less than it could.

But this industry is one big ass glass house. What’s that thing about stones again?

I’m careful to not talk about the people impacted. There were a lot of people impacted, roughly 3-4% of the IP space
was impacted today and I personally heard from more providers than can be counted on a single hand about their impact.

Not everyone is going to write about their business impact in public. I’m not authorized to speak for my employer
about any impacts that we may have had (for example) but if there was impact to 3-4% of IP space, statistically
speaking there’s always a chance someone was impacted.

I do agree about the glass house thing. There’s a lot of blame to go around, and today I’ve been quoting “go read
_normal accidents_” to people. It’s because sufficiently complex systems tend to have complex failures where numerous
safety systems or controls were bypassed. Those of us with more than a few days of experience likely know what some of
them are, we also don’t know if those safety systems were disabled as part of debugging by one or more parties. Who
hasn’t dropped an ACL to debug why it isn’t working, or if that fixed the problem?

I don’t know what happened, but I sure know the symptoms and sets of fixes that the industry should apply and enforce.
I have been communicating some of them in public and many of them in private today, including offering help to other
operators with how to implement some of the fixes.

It’s a bad day when someone changes your /16 to two /17’s and sends them out regardless of if the packets flow through
or not. These things aren’t new, nor do I expect things to be significantly better tomorrow either. I know people at
VZ and suspect once they woke up they did something about it. I also know how hard it is to contact someone you don’t
have a business relationship with. A number of the larger providers have no way for a non-customer to phone, message
or open a ticket online about problems they may have. Who knows, their ticket system may be in the cloud and was also
impacted.

What I do know is that if 3-4% of the home/structures were flooded or temporarily unusable because of some form of
disaster or evacuation, people would be proposing better engineering methods or inspection techniques for these
structures.

If you are a small network and just point default, there is nothing for you to see here and nothing that you can do.
If you speak BGP with your upstream, you can filter out some of the bad routes. You perhaps know that 1239, 3356 and
others should only be seen directly from a network like 701 and can apply filters of this sort to prevent from
accepting those more specifics. I don’t believe it’s just 174 that the routes went to, but they were one of the
networks aside from 701 where I saw paths from today.

(Now the part where you as a 3rd party to this event can help!)

If you peer, build some pre-flight and post-flight scripts to check how many routes you are sending. Most router
vendors support either on-box scripting, or you can do a show | display xml, JSON or some other structured language you
can automate with. AS_PATH filters are simple, low cost and can help mitigate problems. Consider monitoring your
routes with a BMP server (pmacct has a great one!). Set max-prefix (and monitor if you near thresholds!). Configure
automatic restarts if you won’t be around to fix it.

I hate to say “automate all the things”, but at least start with monitoring so you can know when things go bad. Slack
and other things have great APIs and you can have alerts sent to your systems telling you of problems. Try hard to
automate your debugging. Monitor for announcements of your space. The new RIS Live API lets you do this and it’s
super easy to spin something up.

Hold your suppliers accountable as well. If you are a customer of a network that was impacted or accepted these
routes, ask for a formal RFO and what the corrective actions are. Don’t let them off the hook as it will happen again.

If you are using route optimization technology, make double certain it’s not possible to leak routes. Cisco IOS and
Noction are two products that I either know or have been told don’t have default safe settings enabled. I learned
early on in the 90s the perils of having “everything on, unprotected” by default. There were great bugs in software
that allowed devices to be compromised at scale which made comparable cleanup problems to what we’ve seen in recent
years with IoT or other technologies. Tell your vendors you want them to be secure by default, and vote with your
personal and corporate wallet when you can.

It won’t always work, some vendors will not be able or willing to clean up their acts, but unless we act together as an
industry to clean up the glass inside our own homes, expect someone from the outside to come at some point who can
force it, and it may not even make sense (ask anyone who deals with security audit checklists) but you will be required
to do it.

Please take action within your power at your company. Stand up for what is right for everyone with this shared risk
and threat. You may not enjoy who the messenger is (or the one who is the loudest) but set that aside for the industry.

</soapbox>

- Jared

PS. We often call ourselves network engineers or architects. If we are truly that, we are using those industry
standards as building blocks to ensure a solid foundation. Make sure your foundation is stable. Learn from others
mistakes to design and operate the best network feasible.

Current thread:

Re: CloudFlare issues?, (continued)
- - - Re: CloudFlare issues? Pavel Lunin (Jun 24)
    - Re: CloudFlare issues? Mark Tinka (Jun 24)
    - Re: CloudFlare issues? Justin Paine via NANOG (Jun 24)
    - Re: CloudFlare issues? Tom Beecher (Jun 24)
    - Re: CloudFlare issues? James Jun (Jun 24)
    - Re: CloudFlare issues? Ross Tajvar (Jun 24)
    - Re: CloudFlare issues? Jared Mauch (Jun 24)
    - Re: CloudFlare issues? Ross Tajvar (Jun 24)
    - Re: CloudFlare issues? Jared Mauch (Jun 24)
    - Re: CloudFlare issues? Rich Kulawiec (Jun 25)
    - Re: CloudFlare issues? Jared Mauch (Jun 24)
    - Re: CloudFlare issues? Randy Bush (Jun 25)
    - Re: CloudFlare issues? Hank Nussbacher (Jun 24)
    - Re: CloudFlare issues? Christopher Morrow (Jun 24)
    - Re: CloudFlare issues? Hank Nussbacher (Jun 25)
    - Re: CloudFlare issues? Tom Beecher (Jun 25)
    - Re: CloudFlare issues? Katie Holly (Jun 25)
    - Are network operators morons? [was: CloudFlare issues?] Patrick W. Gilmore (Jun 25)
    - Re: Are network operators morons? [was: CloudFlare issues?] Matthew Walster (Jun 25)
    - Re: Are network operators morons? [was: CloudFlare issues?] Adam Kennedy via NANOG (Jun 25)
    - Re: Are network operators morons? [was: CloudFlare issues?] Mark Tinka (Jun 25)

(Thread continues...)