nanog mailing list archives

Re: Arista “IP-SLA” / Active Probing


From: David Zimmerman via NANOG <nanog () nanog org>
Date: Fri, 22 Dec 2023 20:13:11 +0000

Hi, Alex.  If it helps, I've had a variant of this on our transit routers for enterprise purposes for a few years.  We 
run DFZ and originate 0/0 and ::/0 internally, but because we follow them to the nearest egress (0/0 using NAT for path 
symmetry, ::/0 using conditional advertisement for path symmetry), we want to only originate the internal default 
routes if the external peering at that egress is "healthy".  For IOS-XR, it effectively looks like this:

route-policy PS-TRANSIT-UP-IPV4
  if rib-has-route in TRANSIT-SUBNETS-V4 then
    pass
  endif
end-policy
!
route-policy PS-TRANSIT-UP-IPV6
  if rib-has-route in TRANSIT-SUBNETS-V6 then
    pass
  endif
end-policy

prefix-set TRANSIT-SUBNETS-V4
  212.123.212.184/30
end-set
!
prefix-set TRANSIT-SUBNETS-V6
  2001:920:3815::64/127
end-set

neighbor-group EBGP-CRT-IPV4
  ...
  address-family ipv4 unicast
  ...
   default-originate route-policy PS-TRANSIT-UP-IPV4

neighbor-group EBGP-CRT-IPV6
  ...
  address-family ipv6 unicast
   ...
   default-originate route-policy PS-TRANSIT-UP-IPV6

We keep those stanzas simple — is the direct link to the peer up, and therefore the direct route is in our RIB — but 
depending on the platform you're using, you may have more knobs to check things.  For example, I can't directly check 
if a BGP peer is up/down, though I can match on routes the peer has given us (or lack thereof).

-dp

From: NANOG <nanog-bounces+dzimmerman=linkedin.com () nanog org> on behalf of Alex Buie <abuie () cytracom com>
Date: Wednesday, December 20, 2023 at 10:45 AM
To: nanog () nanog org <nanog () nanog org>
Subject: Arista “IP-SLA” / Active Probing
Hello all,

We find ourselves trying to solve a requirement where we would like to test the viability of our paths to the internet 
and tear down the bgp session if it is determined to be faulty. We had an issue recently where we did not lose link or 
bgp but the carrier lost the ability to route traffic to the internet for us and our existing automatic detection and 
remediation strategies failed to detect this condition and we lost customer packets.

Conceptually, we have a pair of DCS7050-QX landing a fiber each from two ISPs with default routes on BGP at a dozen 
POPs around the US.

One of the ISPs is our primary transit, and one is predominantly for peered customers, but we can use it for transit 
during issues with the primary circuits.

I did some research on this and it seems like perhaps the on-boot event handler launching a python daemon to do this 
active probing out each isp circuit and then making config changes in response to transit failures might be the best 
option available to us.

However, I thought I’d reach out to the broader community to see if there’s a better way to solve this, has an example 
script, or if anyone has recommendations for methods of active monitoring for protecting against this sort of failure.

Thanks in advance for any insight and time.





Alex Buie
Senior Cloud Operations Engineer

450 Century Pkwy # 100 Allen, TX 
75013<https://maps.google.com/?q=450+Century+Pkwy+STE+100+%7C+Allen,+TX+%7C+75013&entry=gmail&source=g>
D: 469-884-0225 | www.cytracom.com<http://www.cytracom.com/>

Current thread: