nanog mailing list archives

Do you care about "gray" failures? Can we (network academics) help? A 10-min survey


From: "Vanbever Laurent" <lvanbever () ethz ch>
Date: Thu, 8 Jul 2021 11:57:14 +0000

Dear NANOG,

Detecting whole-link and node failures is relatively easy nowadays (e.g., using BFD). But what about detecting gray 
failures that only affect a *subset* of the traffic, e.g. a router randomly dropping 0.1% of the packets? Does your 
network often experience these gray failures? Are they problematic? Do you care? And can we (network researchers) do 
anything about it?”

Please help us out to find out by answering our short (<10 minutes) anonymous survey.

Survey URL: https://forms.gle/v99mBNEPrLjcFCEu8

## Context:

When we think about network failures, we often think about a link or a network device going down. These failures are 
"obvious" in that *all* the traffic crossing the corresponding resource is dropped. But network failures can also be 
more subtle and only affect a *subset* of the traffic (e.g. 0.01% of the packets crossing a link/router). These 
failures are commonly referred to as "gray" failures. Because they don't drop *all* the traffic, gray failures are much 
harder to detect.

Many studies revealed that cloud and datacenter networks routinely suffer from gray failures and, as such, many 
techniques exist to track them down in these environments (see e.g. this study from Microsoft Azure 
https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/paper-1.pdf). What is less known though is how much 
gray failures affect *other* types of networks such as Internet Service Providers (ISPs), Wide Area Networks (WAN), or 
Enterprise networks. While the bug reports submitted to popular routing vendors (Cisco, Juniper, etc.) suggest that 
gray failures are pervasive and hard to catch for all networks, we would love to know more about first-hand experiences.

## About the survey:

The questionnaire is intended for network operators. It has a total of 15 questions and should take at most 10 minutes 
to complete. The survey and the collected data are totally anonymous (so please do not include information that may 
help to identify you or your organization). All questions are optional, so if you don't like a question or don't know 
the answer, just skip it.

Thank you so much in advance, and we look forward to read your responses!

Laurent Vanbever, ETH Zurich

PS: Of course, we would be extremely grateful if you could forward this email to any operator you might know who may 
not read NANOG ( assuming those even exist? :-) )!

Current thread: