nanog mailing list archives

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey


From: Yang Yu <yang.yu.list () gmail com>
Date: Fri, 9 Jul 2021 16:16:41 -0500

On Thu, Jul 8, 2021 at 4:03 PM William Herrin <bill () herrin us> wrote:

On Thu, Jul 8, 2021 at 5:31 AM Saku Ytti <saku () ytti fi> wrote:
Network experiences gray failures all the time, and I almost never
care, unless a customer does.

I would suggest that your customer does care, but as there is no
simple test to demonstrate gray failures, your customer rarely makes
it past first tier support to bring the issue to your attention and
gives up trying. Indeed, name the networks with the worst reputations
around here and many of them have those reputations because of a
routine, uncorrected state of gray failure.

Networks originating/receiving the traffic tend to have more
incentives to resolve these issues, which might be not so rare

If you have connection/application level health metrics (e.g. TLS
handshake failures, TCP retransmits), identifying a problem exists is
not too difficult. Having health metrics associated with network paths
can greatly simplify repro. Then it's mostly troubleshooting datapath
issues on your favorite platform.

It takes quite some effort to figure out/collect relevant metrics and
present them in a usable way. Something like connections from PoP A to
destination ASN/prefix (via interface X) had TLS handshake failure
rate increased from 0.02% to 1% is a good starting point for
troubleshooting (may or may not be a network issue, the
origin/receiver probably wants to fix it regardless).

Things can get more complicated when traffic crosses network
boundaries with things you don't have visibility into (IX fabric,
remote peering, another networks' optical systems, complicated setups
like stateful firewall / MC-LAG)


Current thread: