nanog mailing list archives

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey


From: Warren Kumari <warren () kumari net>
Date: Fri, 9 Jul 2021 08:51:52 -0400

On Thu, Jul 8, 2021 at 5:04 PM William Herrin <bill () herrin us> wrote:

On Thu, Jul 8, 2021 at 5:31 AM Saku Ytti <saku () ytti fi> wrote:
Network experiences gray failures all the time, and I almost never
care, unless a customer does.

Greetings,

I would suggest that your customer does care, but as there is no
simple test to demonstrate gray failures, your customer rarely makes
it past first tier support to bring the issue to your attention and
gives up trying. Indeed, name the networks with the worst reputations
around here and many of them have those reputations because of a
routine, uncorrected state of gray failure.

To answer Laurent 's question:

Yes, gray failures are a regular problem. Yes, most of us care. And
for the most part we don't have particularly good ways to detect and
isolate the problems, let alone fix them.

Depending on the actual failure mode, and the architecture of the
device itself, one technique is to run test traffic through the
box/path/whatever while twiddling the source and destination ports,
and sometimes the source IP as well.
This sometimes helps find the issue if there is a bad interface in a
LAG, or in a device which sprays packets/cells across an internal
fabric, etc. If you are really lucky you can convince the vendor to
share how they spray/hash (or, at least demonstrate deterministic
failure and hopefully they can hash and tell which of the N fabric
cards is broken)

Hopefully you noticed the number of weasel words in there...

W



 When it's not a clean
failure we really are driven by: customer says blank is broken, often
followed by grueling manual effort just to duplicate the problem
within our view.

Can network researchers do anything about it? Maybe. Because of the
end to end principle, only the endpoints understand the state of the
connection and they don't know the difference capacity and error. They
mostly process that information locally sharing only limited
information with the other endpoint. Which means there's not much
passing over the wire for the middle to examine and learn that there's
a problem... and when there is it often takes correlating multiple
packets to understand that a problem exists which, in the stateless
middle with asymmetric routing, is not usable. The middle can only
look at its immediate link stats which, when there's a bug, are
misleading.

What would you change to dig us out of this hole?

Regards,
Bill Herrin


--
William Herrin
bill () herrin us
https://bill.herrin.us/



-- 
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
  -- E. W. Dijkstra


Current thread: