nanog mailing list archives

Re: Cogent Layer 2


From: Saku Ytti <saku () ytti fi>
Date: Thu, 15 Oct 2020 09:29:16 +0300

On Thu, 15 Oct 2020 at 09:11, Ryan Hamel <ryan () rkhtech org> wrote:

Yep. Make sure you run BFD with your peering protocols, to catch outages
very quickly.


Make sure you get higher availability with BFD than without it, it is easy
to get this wrong and end up losing availability.

First issue is that BFD has quite a lot of bug surface, because unlike most
of your control-plane protocols, BFD is implemented in your NPU ucode when
done right.
We've had the entire linecard down on ASR9k due to BFD, their BFD-of-death
packet you can send over the internet to crash JNPR FPC.
When done in a control-plane, poor scheduling can cause false positives
more often than it protects from actual outages (CISCO7600).

In a world where BFD is perfect you still need to consider what you are
protecting yourself from, so you bought Martini from someone and run your
backbone over that Martini. What is an outage? Is your provider IGP
rerouting due to backbone outage an outage to you? Or would you rather the
provider convergees their network and you don't converge, you take the
outage?
If provider rerouting is not an outage, you need to know what their SLA is
regarding rerouting time and make BFD less aggressive than that. If
provider rerouting is an outage, you can of course run as aggressive timers
as you want, but you probably have lower availability than without BFD.

Also, don't add complexity to solve problems you don't have. If  you don't
know if BFD improved your availability, you didn't need it.
Networking is full of belief practices, we do things because we believe
they help and faux data is used often to dress the beliefs as science. The
problem space tends to be complex and good quality data is sparse to come
by, we do necessarily fly a lot by the seat of our pants, if we admit or
not.
My belief is the majority of BFD implementations in real life on average
reduce availability, my belief is you need frequently failing link which
does not propagate link-down to reliability improve availability by
deploying BFD.





-- 
  ++ytti

Current thread: