nanog mailing list archives

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey


From: "Vanbever Laurent" <lvanbever () ethz ch>
Date: Thu, 8 Jul 2021 13:13:57 +0000


On 8 Jul 2021, at 14:29, Saku Ytti <saku () ytti fi> wrote:

On Thu, 8 Jul 2021 at 15:00, Vanbever Laurent <lvanbever () ethz ch> wrote:

Detecting whole-link and node failures is relatively easy nowadays (e.g., using BFD). But what about detecting gray 
failures that only affect a *subset* of the traffic, e.g. a router randomly dropping 0.1% of the packets? Does your 
network often experience these gray failures? Are they problematic? Do you care? And can we (network researchers) do 
anything about it?”

Network experiences gray failures all the time, and I almost never
care, unless a customer does. If there is a network which does not
experience these, then it's likely due to lack of visibility rather
than issues not existing.

Fixing these can take months of working with vendors and attempts to
remedy will usually cause planned or unplanned outages. So it rarely
makes sense to try to fix as they usually impact a trivial amount of
traffic.

Thanks for chiming in. That's also my feeling: a *lot* of gray failures routinely happen, a small percentage of which 
end up being really damaging (the ones hitting customer traffic, as you pointed out). For this small percentage though, 
I can imagine being able to detect / locate them rapidly (i.e. before the customer submit a ticket) would be 
interesting? Even if fixing the root cause might take up months (since it is up to the vendors), one could still hope 
to remediate to the situation transiently by rerouting traffic combined with the traditional rebooting of the affected 
resources?

Networks also routinely mangle packets in-memory which are not visible
to FCS check.

Added to the list... Thanks!

Best,
Laurent

Current thread: