nanog mailing list archives

Re: best practice for advertising peering fabric routes


From: Clay Fiske <clay () bloomcounty org>
Date: Wed, 15 Jan 2014 11:33:57 -0800

On Jan 15, 2014, at 10:26 AM, William Herrin <bill () herrin us> wrote:


Of course working, monitorable and testable are three different
things. If my NMS can't reach the IXP's addresses, my view of the IXP
is impaired. And "the Internet is broken" is not a trouble report that
leads to a successful outcome with customer support... it helps to be
able to pin things down with some specificity.

This approach concerns me for a number of reasons.

First, having your NMS ping your upstream’s IXP peers probably doesn’t scale. If I’m a peer of a reasonably large 
provider, I’m pretty sure I don’t want all their customers hammering my management plane. Even if you’re the only one 
doing it, you also don’t know if I’m rate-limiting pings for that or any other reason.

Second, what information do you get that you didn’t already have? If you saw the IP in a traceroute then you know it 
exists, is alive, is in the path, and a rough estimation of the latency. Pinging it may even give you negative 
information. Platforms vary and all, but in my experience pinging a router, especially a potentially busy one peering 
at an IXP, shows notably worse performance than “real” traffic experiences (admittedly somewhat true of TTL Expired 
responses, but less so in my experience). Now you’re potentially seeing high latency and packet loss which in reality 
might not even be there at all.

Third, you don’t know that your ping to the peering IP is even taking the same path as the packets addressed to the 
real destination. MTR for example looks nice, but it would probably be more accurate if it simply ran the traceroute 
over and over instead of pinging each hop directly. You would also detect path changes for the real destination that 
pinging intermediate hops wouldn’t show you.

While I appreciate the desire to be able to do as much of your own detective work as possible, I can also see where 
you’re now shifting workload onto someone else’s support organization when they’re not necessarily the problem either 
(“Hey, my NMS says your peering router is causing latency and packet loss, fix it!”).

I’m also not saying there isn’t a troubleshooting gap caused by this. I’m just not sure being able to ping the IXP hop 
solves that problem either.


Semi-related tangent: Working in an IXP setting I have seen weird corner cases cause issues in conjunction with the IXP 
subnet existing in BGP. Say someone’s got proxy ARP enabled on their router (sadly, more common than it should be, and 
not just from noobs at startups). Now say your IXP is growing and you expand the subnet. No matter how much you harp on 
the customers to make the change, they don’t all do it at once. Someone announces the new, larger subnet in BGP. Now 
when anyone ARPs for IPs in the new part of the range, proxy ARP guy (still on the smaller subnet) says “hey I have a 
route for that, send it here”. That was fun to troubleshoot. :)


-c



Current thread: