nanog mailing list archives

Europe-to-US congestion and packet loss on he.net network, and their NOC@ won't even respond


From: "Constantine A. Murenin" <mureninc () gmail com>
Date: Fri, 29 Nov 2013 23:30:39 -0800

Dear NANOG@,

I'm not exactly sure how else I can get he.net's attention, because I've been experiencing congestion issues between my dedi and Indiana for a couple of months now, all due to he.net's poor transit, as it turns out. The issue was complicated by the fact that the routes are asymmetric, and it appears as if the traffic loss is going on somewhere where there is none at all.

I will just provide the data, and people can make their own conclusions, any insights are welcome.

During all of this, since some late September 2013, all 4 networks involved have been contacted -- hetzner, init7, he.net, indiana; all except for he.net have responded and did troubleshooting.

After pressing the lack of any kind of response from he.net, all they did was ask for a customer number, and that was back in September. I have not heard from their NOC@ ever since, with requests left unanswered, sans the "we have received your request" autoreply.

Interestingly enough, only some of their Europe-to-US routes are blatantly congested and have very obvious packet loss (often making ssh unusable), whereas others appear to be doing just fine (at least, not losing packets and not experiencing jitter, and the increased latency). E.g. IPv6 routes don't appear affected, for example. IPv4 addresses in North America that are announced directly from AS6939 (e.g. Linode in Fremont) don't appear affected, either. But the multi-homed indiana.edu and wiscnet.net are affected. The single-homed ntp1.yycix.ca is affected, too. Probably other customers are affected as well.

Where's the end to this?

Or is the ongoing 0.5+% traffic loss, and the 140+ms avg latency on a 114ms route, with random spikes and jitter in certain hours of the day (generally around midnight ET), every day for several weeks or even months, an acceptable practice?



From hetzner.de through he.net:


Cns# date ; mtr --report{,-wide,-cycles=600} --interval 0.1 --order "SRL BGAWV" -4 ????c????????.indiana.edu ; date
Fri Nov 29 21:06:17 PST 2013
HOST: Cns??????? Snt Rcv Loss% Best Gmean Avg Wrst StDev 1.|-- static.??.???.4.46.clients.your-server.de 600 600 0.0% 0.5 1.0 1.3 4.9 1.1 2.|-- hos-tr1.juniper1.rz13.hetzner.de 600 600 0.0% 0.1 0.2 1.9 66.0 7.6 3.|-- core21.hetzner.de 600 600 0.0% 0.2 0.2 0.2 5.8 0.4 4.|-- core22.hetzner.de 600 600 0.0% 0.2 0.2 0.2 19.4 1.2 5.|-- core1.hetzner.de 600 600 0.0% 4.8 4.8 4.8 13.2 0.7 6.|-- juniper1.ffm.hetzner.de 600 600 0.0% 4.8 4.8 4.8 27.4 1.4 7.|-- 30gigabitethernet1-3.core1.ams1.he.net 600 600 0.0% 11.2 14.0 14.6 48.7 4.5 8.|-- 10gigabitethernet1-4.core1.lon1.he.net 600 600 0.0% 18.2 19.6 19.9 53.9 4.1 9.|-- 10gigabitethernet10-4.core1.nyc4.he.net 600 599 0.2% 87.0 116.1 116.7 145.7 12.4 10.|-- 100gigabitethernet7-2.core1.chi1.he.net 600 597 0.5% 106.6 135.4 136.1 192.0 13.3 11.|-- ??? 600 0 100.0 0.0 0.0 0.0 0.0 0.0 12.|-- et-11-0-0.945.rtr.ictc.indiana.gigapop.net 600 594 1.0% 113.3 139.3 139.7 166.1 11.4 13.|-- xe-0-3-0.11.br2.ictc.net.uits.iu.edu 600 596 0.7% 113.2 139.8 140.3 177.3 12.0 14.|-- ae-0.0.br2.bldc.net.uits.iu.edu 600 595 0.8% 114.2 140.1 140.6 183.2 11.8 15.|-- ae-10.0.cr3.bldc.net.uits.iu.edu 600 597 0.5% 114.3 140.3 140.8 165.0 11.5 16.|-- ????c????????.indiana.edu 600 597 0.5% 114.7 140.7 141.1 161.6 11.4
Fri Nov 29 21:08:52 PST 2013


Cns# unbuffer hping --icmp-ts ????c????????.indiana.edu | \
perl -ne 'if (/icmp_seq=(\d+) rtt=(\d+\.\d)/) {($s, $p) = ($1, $2);} \
if (/ate=(\d+) Receive=(\d+) Transmit=(\d+)/) {($o, $r, $t) = ($1, $2, $3);} \
if (/tsrtt=(\d+)/) { \
print $s, "\t", $p, "\t", $1, " = ", $r - $o, " + ", $o + $1 - $t, "\n"; }'
0       143.5   144 = 87 + 57
1       125.5   126 = 69 + 57
2       143.6   144 = 87 + 57
3       157.9   158 = 102 + 56
4       122.0   122 = 66 + 56
5       141.6   142 = 85 + 57
6       132.2   133 = 76 + 57
7       146.2   146 = 89 + 57
8       145.1   145 = 88 + 57
9       119.9   119 = 63 + 56
10      132.7   132 = 75 + 57
11      140.1   140 = 83 + 57
12      151.0   151 = 94 + 57
13      152.6   152 = 96 + 56
14      129.1   129 = 72 + 57
15      128.5   128 = 71 + 57
^C



Single-homed at he.net:


Cns# date ; mtr --report{,-cycles=600} --interval 0.1 --order "SRL BGAWV" -4 ntp1.yycix.ca ; date
Fri Nov 29 21:16:14 PST 2013
HOST: Cns??????? Snt Rcv Loss% Best Gmean Avg Wrst StDev 1.|-- static.??.???.4.46.client 600 600 0.0% 0.5 1.0 1.3 10.2 1.2 2.|-- hos-tr4.juniper2.rz13.het 600 600 0.0% 0.1 0.2 2.0 153.9 9.8 3.|-- core22.hetzner.de 600 600 0.0% 0.2 0.2 0.2 10.6 0.6 4.|-- core1.hetzner.de 600 600 0.0% 4.8 4.8 4.8 16.4 0.9 5.|-- juniper1.ffm.hetzner.de 600 600 0.0% 4.8 4.8 4.8 36.4 1.5 6.|-- 30gigabitethernet1-3.core 600 600 0.0% 11.2 13.5 14.0 36.6 4.3 7.|-- 10gigabitethernet1-4.core 600 600 0.0% 18.0 21.5 21.8 43.1 4.0 8.|-- 10gigabitethernet10-4.cor 600 597 0.5% 93.2 128.0 128.3 157.5 8.9 9.|-- 10gigabitethernet1-2.core 600 596 0.7% 103.1 139.4 139.6 157.5 8.2 10.|-- 10gigabitethernet3-1.core 600 597 0.5% 128.2 164.9 165.1 181.9 8.2 11.|-- 10gigabitethernet1-1.core 600 593 1.2% 138.7 175.9 176.1 192.6 7.8 12.|-- sebo-systems-inc.gigabite 600 597 0.5% 139.0 176.4 176.5 187.5 6.9 13.|-- ??? 600 0 100.0 0.0 0.0 0.0 0.0 0.0 14.|-- ntp1.yycix.ca 600 597 0.5% 141.0 176.9 177.0 186.9 6.9
Fri Nov 29 21:18:32 PST 2013
Cns# traceroute -A ntp1.yycix.ca
traceroute to ntp1.yycix.ca (192.75.191.6), 64 hops max, 40 byte packets
1 static.??.???.4.46.clients.your-server.de (46.4.???.??) [AS24940] 0.664 ms 0.648 ms 0.453 ms 2 hos-tr1.juniper1.rz13.hetzner.de (213.239.224.1) [AS24940] 23.985 ms hos-tr2.juniper1.rz13.hetzner.de (213.239.224.33) [AS24940] 0.234 ms hos-tr3.juniper2.rz13.hetzner.de (213.239.224.65) [AS24940] 0.238 ms 3 core22.hetzner.de (213.239.245.121) [AS24940] 0.238 ms core21.hetzner.de (213.239.245.81) [AS24940] 0.234 ms 0.236 ms 4 core1.hetzner.de (213.239.245.177) [AS24940] 4.811 ms 4.809 ms core22.hetzner.de (213.239.245.162) [AS24940] 0.248 ms 5 core1.hetzner.de (213.239.245.177) [AS24940] 4.831 ms juniper1.ffm.hetzner.de (213.239.245.5) [AS24940] 4.842 ms 4.826 ms 6 juniper1.ffm.hetzner.de (213.239.245.5) [AS24940] 4.857 ms 4.864 ms 30gigabitethernet1-3.core1.ams1.he.net (195.69.145.150) [AS1200] 11.233 ms 7 10gigabitethernet1-4.core1.lon1.he.net (72.52.92.81) [AS6939, AS6939] 19.869 ms 30gigabitethernet1-3.core1.ams1.he.net (195.69.145.150) [AS1200] 18.420 ms 11.255 ms 8 10gigabitethernet10-4.core1.nyc4.he.net (72.52.92.241) [AS6939, AS6939] 115.845 ms 101.875 ms 10gigabitethernet1-4.core1.lon1.he.net (72.52.92.81) [AS6939, AS6939] 17.249 ms 9 10gigabitethernet10-4.core1.nyc4.he.net (72.52.92.241) [AS6939, AS6939] 138.302 ms 10gigabitethernet1-2.core1.tor1.he.net (184.105.222.18) [AS6939] 120.449 ms 139.730 ms 10 10gigabitethernet1-2.core1.tor1.he.net (184.105.222.18) [AS6939] 134.755 ms 104.661 ms 10gigabitethernet3-1.core1.ywg1.he.net (184.105.223.221) [AS6939] 167.282 ms 11 10gigabitethernet1-1.core1.yyc1.he.net (184.105.223.214) [AS6939] 139.310 ms 10gigabitethernet3-1.core1.ywg1.he.net (184.105.223.221) [AS6939] 155.983 ms 155.910 ms 12 sebo-systems-inc.gigabitethernet2-23.core1.yyc1.he.net (216.218.214.250) [AS6939] 138.703 ms 178.530 ms 10gigabitethernet1-1.core1.yyc1.he.net (184.105.223.214) [AS6939] 172.423 ms 13 sebo-systems-inc.gigabitethernet2-23.core1.yyc1.he.net (216.218.214.250) [AS6939] 158 ms * *
14  * * ntp1.yycix.ca (192.75.191.6) [AS53339]  181.433 ms
Cns#
Cns#
Cns#
Cns# unbuffer hping --icmp-ts ntp1.yycix.ca | perl -ne \
'if (/icmp_seq=(\d+) rtt=(\d+\.\d)/) {($s, $p) = ($1, $2);} \
if (/ate=(\d+) Receive=(\d+) Transmit=(\d+)/) {($o, $r, $t) = ($1, $2, $3);} \
if (/tsrtt=(\d+)/) { \
print $s, "\t", $p, "\t", $1, " = ", $r - $o, " + ", $o + $1 - $t, "\n"; }'
0       165.0   165 = 95 + 70
1       156.2   156 = 86 + 70
2       178.9   179 = 109 + 70
3       181.0   181 = 111 + 70
4       178.3   179 = 108 + 71
5       163.8   164 = 94 + 70
6       175.7   176 = 106 + 70
7       173.9   174 = 104 + 70
8       172.6   173 = 103 + 70
9       163.5   164 = 94 + 70
10      181.8   182 = 112 + 70
11      161.9   162 = 92 + 70
12      183.1   184 = 113 + 71
13      174.5   174 = 104 + 70
14      181.8   181 = 111 + 70
15      181.7   181 = 111 + 70
^C
Cns#




From indiana.edu to hetzner.de; notice that the mtr by itself gives a false impression of a traffic loss at init7, whereas in reality, it's the reverse path through he.net that's causing the loss, as hping confirms:


m: {5134} date ; sudo mtr --report{,-cycles=600} --interval 0.1 --order "SRL BGAWV" -4 ?????? ; date
Sat Nov 30 00:36:27 EST 2013
HOST: ????c????????.indiana.edu Snt Rcv Loss% Best Gmean Avg Wrst StDev 1.|-- 129.79.???.? 600 600 0.0% 0.4 0.7 0.9 24.7 1.5 2.|-- ae-13.0.br2.bldc.net.uits 600 600 0.0% 0.5 0.7 0.9 22.6 1.8 3.|-- ae-0.0.br2.ictc.net.uits. 600 600 0.0% 1.4 1.7 1.8 20.2 1.6 4.|-- xe-0-1-0.11.rtr.ictc.indi 600 600 0.0% 1.4 2.1 3.8 66.5 8.1 5.|-- 64.57.21.13 600 600 0.0% 6.0 7.2 8.4 72.9 8.0 6.|-- xe-2-2-0.0.ny0.tr-cps.int 600 600 0.0% 32.3 33.9 34.4 81.0 6.9 7.|-- paix-nyc.init7.net 600 600 0.0% 32.5 35.3 35.5 44.7 3.8 8.|-- r1lon1.core.init7.net 600 599 0.2% 100.1 104.7 104.9 146.5 7.5 9.|-- r1nue1.core.init7.net 600 599 0.2% 114.6 115.7 115.7 125.4 2.2 10.|-- gw-hetzner.init7.net 600 594 1.0% 112.4 141.3 142.4 241.9 18.2 11.|-- core12.hetzner.de 600 468 22.0% 112.2 142.7 144.0 203.4 20.3 12.|-- core21.hetzner.de 600 202 66.3% 114.4 143.7 145.0 204.1 20.1 13.|-- juniper1.rz13.hetzner.de 600 594 1.0% 114.7 141.4 142.1 212.2 14.3 14.|-- hos-tr2.ex3k11.rz13.hetzn 600 599 0.2% 113.8 123.9 125.5 218.2 21.8 15.|-- static.88-198-??-??.clien 599 592 1.2% 114.6 137.2 137.9 167.6 13.2
0.244u 1.766s 1:05.52 3.0%      0+0k 0+1io 0pf+0w
Sat Nov 30 00:37:32 EST 2013

m: {5137} sudo script -q /dev/null hping3 --icmp-ts 88.198.??.?? | perl -ne 'if (/icmp_seq=(\d+) rtt=(\d+\.\d)/) {($s, $p) = ($1, $2);} \ if (/ate=(\d+) Receive=(\d+) Transmit=(\d+)/) {($o, $r, $t) = ($1, $2, $3);} \
if (/tsrtt=(\d+)/) { \
print $s, "\t", $p, "\t", $1, " = ", $r - $o, " + ", $o + $1 - $t, "\r\n"; }'
0       131.3   131 = 57 + 74
1       122.4   122 = 56 + 66
2       122.6   123 = 56 + 67
3       127.6   128 = 57 + 71
4       146.5   147 = 57 + 90
5       139.8   140 = 56 + 84
6       131.0   131 = 57 + 74
7       134.6   135 = 57 + 78
8       137.7   138 = 57 + 81
9       148.1   148 = 57 + 91
10      141.2   142 = 57 + 85
11      146.4   146 = 56 + 90
12      153.6   154 = 57 + 97
13      149.4   150 = 57 + 93
14      120.2   121 = 57 + 64
15      120.6   120 = 56 + 64
16      130.7   131 = 57 + 74
17      126.4   126 = 56 + 70
18      117.9   118 = 57 + 61
19      116.9   117 = 57 + 60
20      119.8   119 = 56 + 63
21      132.0   132 = 56 + 76
22      134.2   134 = 56 + 78
23      138.8   139 = 57 + 82



Note the ICMP timestamp data from hping above. From this ICMP timestamping data, it is obvious that the congestion is only happening on one path -- the one over he.net, and init7 is in the clear.

Any further insights are welcome. But finding out about the ICMP timestamp feature has so far been the most useful thing in troubleshooting this issue; I'm surprised it's a rather unknown method to get to the bottom of these problems.

However, even after finding out about the cause and the party responsible, the problem is yet to be exhausted. Any help appreciated.

Best regards,
Constantine.


Current thread: