nanog mailing list archives

Re: CloudFlare issues?


From: Jared Mauch <jared () puck nether net>
Date: Mon, 24 Jun 2019 20:57:29 -0400



On Jun 24, 2019, at 8:03 PM, Tom Beecher <beecher () beecher cc> wrote:

Disclaimer : I am a Verizon employee via the Yahoo acquisition. I do not work on 701.  My comments are my own 
opinions only. 

Respectfully, I believe Cloudflare’s public comments today have been a real disservice. This blog post, and your CEO 
on Twitter today, took every opportunity to say “DAMN THOSE MORONS AT 701!”. They’re not. 

I presume that seeing a CF blog post isn’t regular for you. :-). — please read on

You are 100% right that 701 should have had some sort of protection mechanism in place to prevent this. But do we 
know they didn’t? Do we know it was there and just setup wrong? Did another change at another time break what was 
there? I used 701 many  jobs ago and they absolutely had filtering in place; it saved my bacon when I screwed up once 
and started readvertising a full table from a 2nd provider. They smacked my session down an I got a nice call about 
it. 

You guys have repeatedly accused them of being dumb without even speaking to anyone yet from the sounds of it. 
Shouldn’t we be working on facts? 

Should they have been easier to reach once an issue was detected? Probably. They’re certainly not the first vendor to 
have a slow response time though. Seems like when an APAC carrier takes 18 hours to get back to us, we write it off 
as the cost of doing business. 

It also would have been nice, in my opinion, to take a harder stance on the BGP optimizer that generated he bogus 
routes, and the steel company that failed BGP 101 and just gladly reannounced one upstream to another. 701 is 
culpable for their mistakes, but there doesn’t seem like there is much appetite to shame the other contributors. 

You’re right to use this as a lever to push for proper filtering , RPKI, best practices. I’m 100% behind that. We can 
all be a hell of a lot better at what we do. This stuff happens more than it should, but less than it could. 

But this industry is one big ass glass house. What’s that thing about stones again? 

I’m careful to not talk about the people impacted.  There were a lot of people impacted, roughly 3-4% of the IP space 
was impacted today and I personally heard from more providers than can be counted on a single hand about their impact.

Not everyone is going to write about their business impact in public.  I’m not authorized to speak for my employer 
about any impacts that we may have had (for example) but if there was impact to 3-4% of IP space, statistically 
speaking there’s always a chance someone was impacted.

I do agree about the glass house thing.  There’s a lot of blame to go around, and today I’ve been quoting “go read 
_normal accidents_” to people.  It’s because sufficiently complex systems tend to have complex failures where numerous 
safety systems or controls were bypassed.  Those of us with more than a few days of experience likely know what some of 
them are, we also don’t know if those safety systems were disabled as part of debugging by one or more parties.  Who 
hasn’t dropped an ACL to debug why it isn’t working, or if that fixed the problem?

I don’t know what happened, but I sure know the symptoms and sets of fixes that the industry should apply and enforce.  
I have been communicating some of them in public and many of them in private today, including offering help to other 
operators with how to implement some of the fixes.

It’s a bad day when someone changes your /16 to two /17’s and sends them out regardless of if the packets flow through 
or not.  These things aren’t new, nor do I expect things to be significantly better tomorrow either.  I know people at 
VZ and suspect once they woke up they did something about it.  I also know how hard it is to contact someone you don’t 
have a business relationship with.  A number of the larger providers have no way for a non-customer to phone, message 
or open a ticket online about problems they may have.  Who knows, their ticket system may be in the cloud and was also 
impacted.

What I do know is that if 3-4% of the home/structures were flooded or temporarily unusable because of some form of 
disaster or evacuation, people would be proposing better engineering methods or inspection techniques for these 
structures.

If you are a small network and just point default, there is nothing for you to see here and nothing that you can do.  
If you speak BGP with your upstream, you can filter out some of the bad routes.  You perhaps know that 1239, 3356 and 
others should only be seen directly from a network like 701 and can apply filters of this sort to prevent from 
accepting those more specifics.  I don’t believe it’s just 174 that the routes went to, but they were one of the 
networks aside from 701 where I saw paths from today.

(Now the part where you as a 3rd party to this event can help!)

If you peer, build some pre-flight and post-flight scripts to check how many routes you are sending.  Most router 
vendors support either on-box scripting, or you can do a show | display xml, JSON or some other structured language you 
can automate with.  AS_PATH filters are simple, low cost and can help mitigate problems.  Consider monitoring your 
routes with a BMP server (pmacct has a great one!).  Set max-prefix (and monitor if you near thresholds!).  Configure 
automatic restarts if you won’t be around to fix it.

I hate to say “automate all the things”, but at least start with monitoring so you can know when things go bad.  Slack 
and other things have great APIs and you can have alerts sent to your systems telling you of problems.  Try hard to 
automate your debugging.  Monitor for announcements of your space.  The new RIS Live API lets you do this and it’s 
super easy to spin something up.

Hold your suppliers accountable as well.  If you are a customer of a network that was impacted or accepted these 
routes, ask for a formal RFO and what the corrective actions are.  Don’t let them off the hook as it will happen again.

If you are using route optimization technology, make double certain it’s not possible to leak routes.  Cisco IOS and 
Noction are two products that I either know or have been told don’t have default safe settings enabled.  I learned 
early on in the 90s the perils of having “everything on, unprotected” by default.  There were great bugs in software 
that allowed devices to be compromised at scale which made comparable cleanup problems to what we’ve seen in recent 
years with IoT or other technologies.  Tell your vendors you want them to be secure by default, and vote with your 
personal and corporate wallet when you can.

It won’t always work, some vendors will not be able or willing to clean up their acts, but unless we act together as an 
industry to clean up the glass inside our own homes, expect someone from the outside to come at some point who can 
force it, and it may not even make sense (ask anyone who deals with security audit checklists) but you will be required 
to do it.

Please take action within your power at your company.  Stand up for what is right for everyone with this shared risk 
and threat.  You may not enjoy who the messenger is (or the one who is the loudest) but set that aside for the industry.

</soapbox>

- Jared

PS. We often call ourselves network engineers or architects.  If we are truly that, we are using those industry 
standards as building blocks to ensure a solid foundation.  Make sure your foundation is stable.  Learn from others 
mistakes to design and operate the best network feasible.

Current thread: