nanog mailing list archives

Re: cross connect reliability

From: Warren Kumari <warren () kumari net>
Date: Fri, 18 Sep 2009 13:55:25 -0400


On Sep 17, 2009, at 7:45 PM, Richard A Steenbergen wrote:

[ SNIP ]

Story 2. Had a customer report that they were getting extremely slow

transfers to another network, despite not being able to find anypacketloss. Shifting the traffic to a different port to reach the samenetworkresolved the problem. After removing the traffic and attempting toping

the far side, I got the following:

<drop>
64 bytes from x.x.x.x: icmp_seq=1 ttl=61 time=0.194 ms
64 bytes from x.x.x.x: icmp_seq=2 ttl=61 time=0.196 ms
64 bytes from x.x.x.x: icmp_seq=3 ttl=61 time=0.183 ms
64 bytes from x.x.x.x: icmp_seq=0 ttl=61 time=4.159 ms
<drop>
64 bytes from x.x.x.x: icmp_seq=5 ttl=61 time=0.194 ms
64 bytes from x.x.x.x: icmp_seq=6 ttl=61 time=0.196 ms
64 bytes from x.x.x.x: icmp_seq=7 ttl=61 time=0.183 ms
64 bytes from x.x.x.x: icmp_seq=4 ttl=61 time=4.159 ms

After a little bit more testing, it turned out that every 4th packet

that was being sent to the peers' router was being queued untilanother

"4th packet" would come along and knock it out. If you increased the
interval time of the ping, you would see the amount of time the packet
spent in the queue increase. At one point I had it up to over 350

seconds (not milliseconds) that the packet stayed in the otherrouters'

queue before that 4th packet came along and knocked it free. I suspect
it could have gone higher, but random scanning traffic on the internet
was coming in. When there was a lot of traffic on the interface you

would never see the packet loss, just reordering of every 4th packetand

thus slow tcp transfers. :)


Story 1:
-----------

I had a router where I was suddenly unable to reach certain hosts onthe (/24) ethernet interface -- pinging form the router worked fine,transit traffic wouldn't. I decided to try and figure out if there wasany sort of rhyme or reason to which hosts had gone unreachable. Icould successfully reach

xxx.yyy.zzz.1
xxx.yyy.zzz.2
xxx.yyy.zzz.3
xxx.yyy.zzz.5
xxx.yyy.zzz.7
xxx.yyy.zzz.11
xxx.yyy.zzz.13
xxx.yyy.zzz.17
...
xxx.yyy.zzz.197
xxx.yyy.zzz.199

There were only 200 hosts on the LAN, but I'd bet dollars to donutsthat I know what the next reachable one would have been if there hadbeen more. Unfortunately the box rebooted itself (when I tried toview the FIB) before I could collect more info.


Story 2:
----------

Had a small router connecting a remote office over a multilink PPP[1]interface (4xE1). Site starts getting massive packet-loss, so I figureone of the circuits has gone bad, but didn't get removed from thebundle. I'm having a hard time reaching the remote side, so I pull theinterfaces from protocols and try ping the remote router -- noreplies.... Luckily I didn't hit Ctrl-C on the ping, because suddenlyI start getting replies with no drops:


64 bytes from x.x.x.x: icmp_seq=1 ttl=120 time=30132.148 ms
64 bytes from x.x.x.x: icmp_seq=2 ttl=120 time=30128.178 ms
64 bytes from x.x.x.x: icmp_seq=3 ttl=120 time=30133.231 ms
64 bytes from x.x.x.x: icmp_seq=4 ttl=120 time=30112.571 ms
64 bytes from x.x.x.x: icmp_seq=5 ttl=120 time=30132.632 ms

What?! I figure it's gotta be MLPPP stupidity and / or depref of ICMP,so I connect OOB and A: remove MLPPP and use just a single interfaceand B: start pinging a host behind the router instead...


64 bytes from x.x.x.x: icmp_seq=1 ttl=120 time=30142.323 ms
64 bytes from x.x.x.x: icmp_seq=2 ttl=120 time=30144.571 ms
64 bytes from x.x.x.x: icmp_seq=3 ttl=120 time=30141.632 ms
64 bytes from x.x.x.x: icmp_seq=4 ttl=120 time=30142.420 ms
64 bytes from x.x.x.x: icmp_seq=5 ttl=120 time=30159.706 ms

I fire up tcpdump and try ssh to a host on the remote side -- I seethe SYN leave my machine and then, 30 *seconds* later I get back a SYN-ACK. I change the queuing on the interface from FIFO to something elseand the problem goes away. I change the queuing back to FIFO and it's30 second RTT again. Somehow it seems to be buffering as much trafficas it can (and anything more than one copy of ping running (or pingwith anything larger than the default packet size) makes if startdropping badly). I ran "show buffers" to try get more of an idea whatwas happening, but it didn't like that and reloaded. Came back up finethough...


Story 3:
----------

Running a network that had a large number of L3 switches from a vendor(lets call them "X") in a single OSPF area. This area also contained alarge number of poor quality international circuits that would flapoften, so there was *lots* of churn. Apparently this vendor X's OSPFimplementation didn't much like this and so would become unhappy. Theway it would express its displeasure was by corrupting a pointer to /in the LSDB so it was off-by-one and you'd get:Nov 24 22:23:53.633 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5,LSID 0.9.32.5

Mask 10.160.8.0 from 10.178.255.252
NOTE: This route will not be installed in the routing table.

Nov 26 11:01:32.997 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5,LSID 0.4.2.3

Mask 10.2.153.0 from 10.178.255.252
NOTE: This route will not be installed in the routing table.

Nov 27 23:14:00.660 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5,LSID 0.4.2.3

Mask 10.2.153.0 from 10.178.255.252
NOTE: This route will not be installed in the routing table.

(This network was addressed out of 10/8 - 10.178.255.252 is one ofvendors X boxes and 10.160.8.0 is a valid subnet, but, surprisinglyenough, not a valid mask..... ).To make mattes even more fun, the OSPF adjacency would go down andthen come back up -- and the grumpy box would flood all of it's(corrupt) LSAs...


W

[1]: Hey, not my idea...

--
Richard A Steenbergen <ras () e-gerbil net>       http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B12CBC)


--
"Real children don't go hoppity-skip unless they are on drugs."

-- Susan, the ultimate sensible governess (Terry Pratchett,Hogfather)

Current thread:

cross connect reliability Michael J McCafferty (Sep 17)
- Re: cross connect reliability Seth Mattinen (Sep 17)
  - Re: cross connect reliability Alex Balashov (Sep 17)
    - Re: cross connect reliability Seth Mattinen (Sep 17)
    - RE: cross connect reliability Deepak Jain (Sep 17)
    - Re: cross connect reliability Pete Carah (Sep 17)
  - Re: cross connect reliability Marshall Eubanks (Sep 17)
    - Re: cross connect reliability Charles Wyble (Sep 17)
    - Re: cross connect reliability Richard A Steenbergen (Sep 17)
    - Re: cross connect reliability Mark Andrews (Sep 17)
    - Re: cross connect reliability Warren Kumari (Sep 18)
    - Re: cross connect reliability Chris Adams (Sep 18)
    - Re: cross connect reliability Valdis . Kletnieks (Sep 18)
    - Re: cross connect reliability Luke S Crawford (Sep 20)
    - Re: cross connect reliability Seth Mattinen (Sep 20)
    - Re: cross connect reliability Richard Golodner (Sep 17)
    - Re: cross connect reliability Mike Lieman (Sep 17)
    - Re: cross connect reliability Jon Lewis (Sep 17)
    - Re: cross connect reliability Mike Lieman (Sep 17)
    - Re: cross connect reliability Jon Lewis (Sep 17)
- RE: cross connect reliability Michael K. Smith - Adhost (Sep 17)

(Thread continues...)