nanog mailing list archives

Re: Setting sensible max-prefix limits


From: Tom Beecher <beecher () beecher cc>
Date: Wed, 18 Aug 2021 11:51:01 -0400


Depending on what failure cases you actually see from your peers in the
wild, I can see (at least as a thought experiment), a two-bucket solution -
"transit" and "everyone else".  (Excluding downstream customers, who you
obviously hold some responsibility for the hygiene of.)


Although I didn't say it clearly, that's exactly what we do. The described
'bucket' logic is only applied to the 'everyone else' pile ; our transit
stuff gets its own special care and feeding.

How often do folks see a failure case that's "deaggregated something and
announced you 1000 /24s, rather than the expected/configured 100 max", vs
"fat-fingered being a transit provider, and announced you the global table"?


I can count on one hand the number of times I can remember that a peer has
gone on a deagg party and ran over limits. Maybe twice in the last 8 years?
It's possible it's happened more that I'm not aware of.

We have additional protections in place for that second scenario. If a
generic peer tries to send us a route with a transit provider in the
as-path, we just toss the route on the floor. That protection has been much
more useful than prefix limits IMO.

On Wed, Aug 18, 2021 at 11:37 AM tim () pelican org <tim () pelican org> wrote:

On Wednesday, 18 August, 2021 14:21, "Tom Beecher" <beecher () beecher cc>
said:

We created 5 or 6 different buckets of limit values (for v4 and v6 of
course.) Depending on what you have published in PeeringDB (or told us
directly what to expect), you're placed in a bucket that gives you a
decent
amount of headroom to that bucket's max. If your ASN reaches 90% of your
limit, our ops folks just move you up to the next bucket. If you start to
get up there in the last bucket, then we'll take a manual look and decide
what is appropriate. This covers well over 95% of our non-transit
sessions,
and has dramatically reduced the volume of tickets and changes our ops
team
has had to sort through.

Depending on what failure cases you actually see from your peers in the
wild, I can see (at least as a thought experiment), a two-bucket solution -
"transit" and "everyone else".  (Excluding downstream customers, who you
obviously hold some responsibility for the hygiene of.)

How often do folks see a failure case that's "deaggregated something and
announced you 1000 /24s, rather than the expected/configured 100 max", vs
"fat-fingered being a transit provider, and announced you the global table"?

My gut says it's the latter case that breaks things and you need to make
damn sure doesn't happen.  Curious to hear others' experience.

Thanks,
Tim.




Current thread: