nanog mailing list archives

Re: interesting troubleshooting


From: Saku Ytti <saku () ytti fi>
Date: Sat, 21 Mar 2020 00:07:16 +0200

Hey Nimrod,

I was contacted by my NOC to investigate a LAG that was not distributing traffic evenly among the members to the 
point where one member was congested while the utilization on the LAG was reasonably low. Looking at my netflow data, 
I was able to confirm that this was caused by a single large flow of ESP traffic. Fortunately, I was able to shift 
this flow to another path that had enough headroom available so that the flow could be accommodated on a single 
member link.

With the increase in remote workers and VPN traffic that won't hash across multiple paths, I thought this anecdote 
might help someone else track down a problem that might not be so obvious.

This problem is called elephant flow. Some vendors have solution for
this, by dynamically monitoring utilisation and remapping the
hashResult => egressInt table to create bias to offset the elephant
flow.

One particular example:
https://www.juniper.net/documentation/en_US/junos/topics/reference/configuration-statement/adaptive-edit-interfaces-aex-aggregated-ether-options-load-balance.html

Ideally VPN providers would be defensive and would use SPORT for
entropy, like MPLSoUDP does.

-- 
  ++ytti


Current thread: