nanog mailing list archives

Re: Flapping Transport


From: Jared Mauch <jared () puck nether net>
Date: Tue, 1 Aug 2023 14:37:51 -0400



On Aug 1, 2023, at 2:18 PM, Mike Hammett <nanog () ics-il net> wrote:

I have a wave transport vendor that suffered issues twice about ten days apart, causing my link to flap a bunch. I 
put in a ticket on the second set of occurrences. I was told that there was a card issue identified and would be 
notified when the replacement happened. Ticket closed.

Three weeks later, I opened a new ticket asking for the status. The new card arrived the next day, but since no more 
flaps were happening, the card would not be replaced. Ticket closed.


A) It doesn't seem like they actually did anything to fix the circuit.
B) They admitted a problem and sent a new card.
C) They later decided to not do anything.


Is that normal?
Is that acceptable?


To avoid issues flapping causes, I disabled that circuit until repaired, but it seems like they're not going to do 
anything and I only know that because I asked.


With passive components like amplifiers and such, or they might have had someone do work that they don’t want to fess 
up to (which is kinda silly) I get that.

I have our junipers configured with a 5 second up timer, eg: "hold-time up 5000”

This way a flapping circuit must be stable for at least a few seconds before it can be placed back into service, 
otherwise if you have a prefix that comes from connected/direct/static/qualified-next-hop it won’t go into another 
protocol and possibly cause a globally visible BGP event.

Some providers have a much more disruptive layer-1 infrastructure and will ask you to configure a 1s+ up timer.  I 
think there’s an interesting question that could go either way, do you want transport side faults to be exposed to you, 
or should the client interface in a system be held up so you don’t have that fault condition forward (sometimes called 
FDI) to the client interface.

They may have had the system misconfigured so you saw a fault on a protected path when there was a switch.  It can take 
some time for the transponder to re-tune if the timing is different if your A path is 25km and B side is 5km and you 
have a optical switch, with the higher PHY rates it will take some extra time.

I know that Cisco also has these interface timers, but some of the others may not (eg: I don’t know if Mikrotik has 
them, but queue the wiki in a reply).

If it’s stable for 48 hours, I would place it back into service, but you should escalate at the same time and determine 
if they were truly hands off.  It may be a fiber was bent and is now fixed and that actually was the root cause.

Hope this helps you and a few others.

- jared

Current thread: