nanog mailing list archives
Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey
From: Yang Yu <yang.yu.list () gmail com>
Date: Fri, 9 Jul 2021 16:16:41 -0500
On Thu, Jul 8, 2021 at 4:03 PM William Herrin <bill () herrin us> wrote:
On Thu, Jul 8, 2021 at 5:31 AM Saku Ytti <saku () ytti fi> wrote:Network experiences gray failures all the time, and I almost never care, unless a customer does.I would suggest that your customer does care, but as there is no simple test to demonstrate gray failures, your customer rarely makes it past first tier support to bring the issue to your attention and gives up trying. Indeed, name the networks with the worst reputations around here and many of them have those reputations because of a routine, uncorrected state of gray failure.
Networks originating/receiving the traffic tend to have more incentives to resolve these issues, which might be not so rare If you have connection/application level health metrics (e.g. TLS handshake failures, TCP retransmits), identifying a problem exists is not too difficult. Having health metrics associated with network paths can greatly simplify repro. Then it's mostly troubleshooting datapath issues on your favorite platform. It takes quite some effort to figure out/collect relevant metrics and present them in a usable way. Something like connections from PoP A to destination ASN/prefix (via interface X) had TLS handshake failure rate increased from 0.02% to 1% is a good starting point for troubleshooting (may or may not be a network issue, the origin/receiver probably wants to fix it regardless). Things can get more complicated when traffic crosses network boundaries with things you don't have visibility into (IX fabric, remote peering, another networks' optical systems, complicated setups like stateful firewall / MC-LAG)
Current thread:
- Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey, (continued)
- Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey Chriztoffer Hansen (Jul 09)
- Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey Vanbever Laurent (Jul 08)
- Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey Saku Ytti (Jul 08)
- Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey Vanbever Laurent (Jul 08)
- Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey Saku Ytti (Jul 08)
- Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey Tom Beecher (Jul 08)
- Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey Warren Kumari (Jul 08)
- Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey William Herrin (Jul 08)
- Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey Saku Ytti (Jul 08)
- Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey Warren Kumari (Jul 09)
- Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey Yang Yu (Jul 09)