nanog mailing list archives

Re: Verizon Routing issue


From: Jared Mauch <jared () puck nether net>
Date: Mon, 24 Jun 2019 11:15:56 -0400



On Jun 24, 2019, at 11:12 AM, Max Tulyev <maxtul () netassist ua> wrote:

24.06.19 17:44, Jared Mauch пише:
1. Why Cloudflare did not immediately announced all their address space by /24s? This can put the service up 
instantly for almost all places.
They may not want to pollute the global routing table with these entries.  It has a cost for everyone.  If we all 
did this, the table would be a mess.

yes, it is. But it is a working, quick and temporary fix of the problem.

Like many things (eg; ATT had similar issues with 12.0.0.0/8) now there’s a bunch of /9’s in the table that will likely 
never go away.

2. Why almost all carriers did not filter the leak on their side, but waited for "a better weather on Mars" for 
several hours?
There’s several major issues here
- Verizon accepted garbage from their customer
- Other networks accepted the garbage from Verizon (eg: Cogent)
- known best practices from over a decade ago are not applied

That's it.

We have several IXes connected, all of them had a correct aggregated route to CF. And there was one upstream 
distributed leaked more specifics.

I think 30min maximum is enough to find out a problem and filter out it's source on their side. Almost nobody did it. 
Why?

I have heard people say “we don’t look for problems”.  This is often the case, there is a lack of monitoring/awareness. 
 I had several systems detect the problem, plus things like bgpmon also saw it.

My guess is people that passed this on weren’t monitoring either.  It’s often manual procedures vs automated scripts 
watching things.  Instrumentation of your network elements tends to be a small set of people who invest in it.  You 
tend to need some scale for it to make sense, and it also requires people who understand the underlying data for what 
is “odd”.

This is why I’ve had my monitoring system up for the past 12+ years.  It’s super simple (dumb) and catches a lot of 
issues.  I implemented it again for the RIPE RIS Live service, but haven’t cut it over to be the primary (realtime) 
monitoring method vs watching route-views.

I think it’s time to do that.

- Jared


Current thread: