nanog mailing list archives
Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey
From: "Vanbever Laurent" <lvanbever () ethz ch>
Date: Thu, 8 Jul 2021 14:59:23 +0000
One method is collecting lookup exceptions. We scrape these: npu_triton_trapstats.py: command = "start shell sh command \"for fpc in $(cli -c 'show chassis fpc' | grep Online | awk '{print $1;}'); do echo FPC$fpc; vty -c 'show cda trapstats' fpc$fpc; done\"" ptx1k_trapstats.py: command = "start shell sh command \"for fpc in $(cli -c 'show chassis fpc' | grep Online | awk '{print $1;}'); do echo FPC$fpc; vty -c 'show pechip trapstats' fpc$fpc; done\"" asr9k_npu_counters.py: command = "show controllers np counters all" junos_trio_exceptions.py: command = "show pfe statistics exceptions" No need for ML or AI, as trivial algorithms like 'what counter is incrementing which isn't incrementing elsewhere' or 'what counter is not incrementing is incrementing elsewhere' shows a lot of real problems, and capturing those exceptions and reviewing confirms them. We do not use these to proactively find problems, as it would yield to poorer overall availability. But we regularly use them to expedite time to resolution.
Thanks for sharing! I guess this process working means the counters are "standard" / close enough across vendors to allow for comparisons?
Very recently we had Tomahawk (EZchip) reset the whole linecard and looking at counters identifying counter which is incrementing but likely should not yielded the problem. Customer was sending us IP packets, where ethernet header and IP header until total length was missing on the wire, this accidentally fuzzed the NPU ucode periodically triggering NPU bug, which causes total LC reload when it happens often enough.
Networks also routinely mangle packets in-memory which are not visible to FCS check.Added to the list... Thanks!The only way I know how to try to find these memory corruptions is to look at egress PE device backbone facing interface and see if there are IP checksum errors.
Current thread:
- Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey, (continued)
- Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey Mark Tinka (Jul 08)
- Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey Vanbever Laurent (Jul 08)
- Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey colin johnston (Jul 08)
- Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey Mark Tinka (Jul 08)
- Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey Jörg Kost (Jul 08)
- Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey Vanbever Laurent (Jul 08)
- Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey Baldur Norddahl (Jul 08)
- Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey Chriztoffer Hansen (Jul 09)
- Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey Mark Tinka (Jul 08)
- Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey Saku Ytti (Jul 08)
- Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey Vanbever Laurent (Jul 08)
- Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey Saku Ytti (Jul 08)
- Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey Saku Ytti (Jul 08)
- Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey Warren Kumari (Jul 09)
- Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey Yang Yu (Jul 09)