nanog mailing list archives

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey


From: Saku Ytti <saku () ytti fi>
Date: Thu, 8 Jul 2021 16:53:33 +0300

On Thu, 8 Jul 2021 at 16:13, Vanbever Laurent <lvanbever () ethz ch> wrote:

Thanks for chiming in. That's also my feeling: a *lot* of gray failures routinely happen, a small percentage of which 
end up being really damaging (the ones hitting customer traffic, as you pointed out). For this small percentage 
though, I can imagine being able to detect / locate them rapidly (i.e. before the customer submit a ticket) would be 
interesting? Even if fixing the root cause might take up months (since it is up to the vendors), one could still hope 
to remediate to the situation transiently by rerouting traffic combined with the traditional rebooting of the 
affected resources?

One method is collecting lookup exceptions. We scrape these:

npu_triton_trapstats.py:    command = "start shell sh command \"for
fpc in $(cli -c 'show chassis fpc' | grep Online | awk '{print $1;}');
do echo FPC$fpc; vty -c 'show cda trapstats' fpc$fpc; done\""
ptx1k_trapstats.py:    command = "start shell sh command \"for fpc in
$(cli -c 'show chassis fpc' | grep Online | awk '{print $1;}'); do
echo FPC$fpc; vty -c 'show pechip trapstats' fpc$fpc; done\""
asr9k_npu_counters.py:    command = "show controllers np counters all"
junos_trio_exceptions.py:    command = "show pfe statistics exceptions"

No need for ML or AI, as trivial algorithms like 'what counter is
incrementing which isn't incrementing elsewhere' or 'what counter is
not incrementing is incrementing elsewhere' shows a lot of real
problems, and capturing those exceptions and reviewing confirms them.

We do not use these to proactively find problems, as it would yield to
poorer overall availability. But we regularly use them to expedite
time to resolution.
Very recently we had Tomahawk (EZchip) reset the whole linecard and
looking at counters identifying counter which is incrementing but
likely should not yielded the problem. Customer was sending us IP
packets, where ethernet header and IP header until total length was
missing on the wire, this accidentally fuzzed the NPU ucode
periodically triggering NPU bug, which causes total LC reload when it
happens often enough.

Networks also routinely mangle packets in-memory which are not visible
to FCS check.

Added to the list... Thanks!

The only way I know how to try to find these memory corruptions is to
look at egress PE device backbone facing interface and see if there
are IP checksum errors.

--
  ++ytti


Current thread: