nanog mailing list archives

Re: Cogent Layer 2


From: Ryan Hamel <ryan () rkhtech org>
Date: Thu, 15 Oct 2020 00:28:26 -0700

Saku,

My experience with multiple carriers is that reroutes happen in under a minute but rarely happen, I also have redundant 
backup circuits to another datacenter, so no traffic is truly lost. If an outage lasts longer than 5 minutes, or it's 
flapping very frequently, then I call the carrier. Last mile carriers install CPE equipment at the sites, which makes 
BFD a requirement to account for the fiber uplink on it going down, or an issue upstream.
As for security vulnerabilities, none can be leveraged if they are using internal IPs, and if not, a quick ACL can drop 
BFD traffic from unknown sources the same way BGP sessions are filtered.
In Juniper speak, the ACL would look like:
(under policy-options)
prefix-list bgp_hosts {
apply-path "protocols bgp group <*> neighbor <*>";
}

(under firewall family inet(6) filter mgmt_acl)
term allow_bfd {
from {
protocol udp;
destination-port [ 3784 3785 4784 ];
source-prefix-list bgp_hosts;
}
then accept;
}
term deny_bfd {
from {
protocol udp;
destination-port [ 3784 3785 4784 ];
}
then discard;
}

Ryan
On Oct 14 2020, at 11:29 pm, Saku Ytti <saku () ytti fi> wrote:
On Thu, 15 Oct 2020 at 09:11, Ryan Hamel <ryan () rkhtech org (mailto:ryan () rkhtech org)> wrote:


Yep. Make sure you run BFD with your peering protocols, to catch outages very quickly.

Make sure you get higher availability with BFD than without it, it is easy to get this wrong and end up losing 
availability.

First issue is that BFD has quite a lot of bug surface, because unlike most of your control-plane protocols, BFD is 
implemented in your NPU ucode when done right.
We've had the entire linecard down on ASR9k due to BFD, their BFD-of-death packet you can send over the internet to 
crash JNPR FPC.
When done in a control-plane, poor scheduling can cause false positives more often than it protects from actual 
outages (CISCO7600).

In a world where BFD is perfect you still need to consider what you are protecting yourself from, so you bought 
Martini from someone and run your backbone over that Martini. What is an outage? Is your provider IGP rerouting due 
to backbone outage an outage to you? Or would you rather the provider convergees their network and you don't 
converge, you take the outage?
If provider rerouting is not an outage, you need to know what their SLA is regarding rerouting time and make BFD less 
aggressive than that. If provider rerouting is an outage, you can of course run as aggressive timers as you want, but 
you probably have lower availability than without BFD.

Also, don't add complexity to solve problems you don't have. If you don't know if BFD improved your availability, you 
didn't need it.
Networking is full of belief practices, we do things because we believe they help and faux data is used often to 
dress the beliefs as science. The problem space tends to be complex and good quality data is sparse to come by, we do 
necessarily fly a lot by the seat of our pants, if we admit or not.
My belief is the majority of BFD implementations in real life on average reduce availability, my belief is you need 
frequently failing link which does not propagate link-down to reliability improve availability by deploying BFD.





--
++ytti



Current thread: