nanog mailing list archives

Re: Europe-to-US congestion and packet loss on he.net network, and their NOC@ won't even respond


From: Fredy Kuenzler <kuenzler () init7 net>
Date: Sat, 30 Nov 2013 16:10:13 +0100

Constantine,

Please mail me offlist if Init7 can be of any help to resolve the case.

--
Fredy Kuenzler
Init7 (Switzerland) Ltd.
St.-Georgen-Strasse 70
CH-8400 Winterthur
Switzerland

http://www.init7.net/


Am 30.11.2013 um 08:30 schrieb "Constantine A. Murenin" <mureninc () gmail com>:

Dear NANOG@,

I'm not exactly sure how else I can get he.net's attention, because I've been experiencing congestion issues between 
my dedi and Indiana for a couple of months now, all due to he.net's poor transit, as it turns out.  The issue was 
complicated by the fact that the routes are asymmetric, and it appears as if the traffic loss is going on somewhere 
where there is none at all.

I will just provide the data, and people can make their own conclusions, any insights are welcome.

During all of this, since some late September 2013, all 4 networks involved have been contacted -- hetzner, init7, 
he.net, indiana; all except for he.net have responded and did troubleshooting.

After pressing the lack of any kind of response from he.net, all they did was ask for a customer number, and that was 
back in September.  I have not heard from their NOC@ ever since, with requests left unanswered, sans the "we have 
received your request" autoreply.

Interestingly enough, only some of their Europe-to-US routes are blatantly congested and have very obvious packet 
loss (often making ssh unusable), whereas others appear to be doing just fine (at least, not losing packets and not 
experiencing jitter, and the increased latency).  E.g. IPv6 routes don't appear affected, for example.  IPv4 
addresses in North America that are announced directly from AS6939 (e.g. Linode in Fremont) don't appear affected, 
either.  But the multi-homed indiana.edu and wiscnet.net are affected.  The single-homed ntp1.yycix.ca is affected, 
too.  Probably other customers are affected as well.

Where's the end to this?

Or is the ongoing 0.5+% traffic loss, and the 140+ms avg latency on a 114ms route, with random spikes and jitter in 
certain hours of the day (generally around midnight ET), every day for several weeks or even months, an acceptable 
practice?



From hetzner.de through he.net:


Cns# date ; mtr --report{,-wide,-cycles=600} --interval 0.1 --order "SRL BGAWV" -4 ????c????????.indiana.edu ; date
Fri Nov 29 21:06:17 PST 2013
HOST: Cns???????                                   Snt   Rcv Loss% Best Gmean   Avg  Wrst StDev
 1.|-- static.??.???.4.46.clients.your-server.de    600   600  0.0%  0.5   1.0   1.3   4.9   1.1
 2.|-- hos-tr1.juniper1.rz13.hetzner.de             600   600  0.0%  0.1   0.2   1.9  66.0   7.6
 3.|-- core21.hetzner.de                            600   600  0.0%  0.2   0.2   0.2   5.8   0.4
 4.|-- core22.hetzner.de                            600   600  0.0%  0.2   0.2   0.2  19.4   1.2
 5.|-- core1.hetzner.de                             600   600  0.0%  4.8   4.8   4.8  13.2   0.7
 6.|-- juniper1.ffm.hetzner.de                      600   600  0.0%  4.8   4.8   4.8  27.4   1.4
 7.|-- 30gigabitethernet1-3.core1.ams1.he.net       600   600  0.0% 11.2  14.0  14.6  48.7   4.5
 8.|-- 10gigabitethernet1-4.core1.lon1.he.net       600   600  0.0% 18.2  19.6  19.9  53.9   4.1
 9.|-- 10gigabitethernet10-4.core1.nyc4.he.net      600   599  0.2% 87.0 116.1 116.7 145.7  12.4
10.|-- 100gigabitethernet7-2.core1.chi1.he.net      600   597  0.5% 106.6 135.4 136.1 192.0  13.3
11.|-- ???                                          600     0 100.0  0.0   0.0   0.0   0.0   0.0
12.|-- et-11-0-0.945.rtr.ictc.indiana.gigapop.net   600   594  1.0% 113.3 139.3 139.7 166.1  11.4
13.|-- xe-0-3-0.11.br2.ictc.net.uits.iu.edu         600   596  0.7% 113.2 139.8 140.3 177.3  12.0
14.|-- ae-0.0.br2.bldc.net.uits.iu.edu              600   595  0.8% 114.2 140.1 140.6 183.2  11.8
15.|-- ae-10.0.cr3.bldc.net.uits.iu.edu             600   597  0.5% 114.3 140.3 140.8 165.0  11.5
16.|-- ????c????????.indiana.edu                    600   597  0.5% 114.7 140.7 141.1 161.6  11.4
Fri Nov 29 21:08:52 PST 2013


Cns# unbuffer hping --icmp-ts ????c????????.indiana.edu | \
perl -ne 'if (/icmp_seq=(\d+) rtt=(\d+\.\d)/) {($s, $p) = ($1, $2);} \
if (/ate=(\d+) Receive=(\d+) Transmit=(\d+)/) {($o, $r, $t) = ($1, $2, $3);} \
if (/tsrtt=(\d+)/) { \
print $s, "\t", $p, "\t", $1, " = ", $r - $o, " + ", $o + $1 - $t, "\n"; }'
0       143.5   144 = 87 + 57
1       125.5   126 = 69 + 57
2       143.6   144 = 87 + 57
3       157.9   158 = 102 + 56
4       122.0   122 = 66 + 56
5       141.6   142 = 85 + 57
6       132.2   133 = 76 + 57
7       146.2   146 = 89 + 57
8       145.1   145 = 88 + 57
9       119.9   119 = 63 + 56
10      132.7   132 = 75 + 57
11      140.1   140 = 83 + 57
12      151.0   151 = 94 + 57
13      152.6   152 = 96 + 56
14      129.1   129 = 72 + 57
15      128.5   128 = 71 + 57
^C



Single-homed at he.net:


Cns# date ; mtr --report{,-cycles=600} --interval 0.1 --order "SRL BGAWV" -4 ntp1.yycix.ca ; date
Fri Nov 29 21:16:14 PST 2013
HOST: Cns???????                    Snt   Rcv Loss%   Best Gmean   Avg Wrst StDev
 1.|-- static.??.???.4.46.client   600   600  0.0%    0.5   1.0   1.3  10.2   1.2
 2.|-- hos-tr4.juniper2.rz13.het   600   600  0.0%    0.1   0.2   2.0 153.9   9.8
 3.|-- core22.hetzner.de           600   600  0.0%    0.2   0.2   0.2  10.6   0.6
 4.|-- core1.hetzner.de            600   600  0.0%    4.8   4.8   4.8  16.4   0.9
 5.|-- juniper1.ffm.hetzner.de     600   600  0.0%    4.8   4.8   4.8  36.4   1.5
 6.|-- 30gigabitethernet1-3.core   600   600  0.0%   11.2  13.5  14.0  36.6   4.3
 7.|-- 10gigabitethernet1-4.core   600   600  0.0%   18.0  21.5  21.8  43.1   4.0
 8.|-- 10gigabitethernet10-4.cor   600   597  0.5%   93.2 128.0 128.3 157.5   8.9
 9.|-- 10gigabitethernet1-2.core   600   596  0.7%  103.1 139.4 139.6 157.5   8.2
10.|-- 10gigabitethernet3-1.core   600   597  0.5%  128.2 164.9 165.1 181.9   8.2
11.|-- 10gigabitethernet1-1.core   600   593  1.2%  138.7 175.9 176.1 192.6   7.8
12.|-- sebo-systems-inc.gigabite   600   597  0.5%  139.0 176.4 176.5 187.5   6.9
13.|-- ???                         600     0 100.0    0.0   0.0   0.0   0.0   0.0
14.|-- ntp1.yycix.ca               600   597  0.5%  141.0 176.9 177.0 186.9   6.9
Fri Nov 29 21:18:32 PST 2013
Cns# traceroute -A ntp1.yycix.ca
traceroute to ntp1.yycix.ca (192.75.191.6), 64 hops max, 40 byte packets
1  static.??.???.4.46.clients.your-server.de (46.4.???.??) [AS24940] 0.664 ms  0.648 ms  0.453 ms
2  hos-tr1.juniper1.rz13.hetzner.de (213.239.224.1) [AS24940]  23.985 ms hos-tr2.juniper1.rz13.hetzner.de 
(213.239.224.33) [AS24940]  0.234 ms hos-tr3.juniper2.rz13.hetzner.de (213.239.224.65) [AS24940]  0.238 ms
3  core22.hetzner.de (213.239.245.121) [AS24940]  0.238 ms core21.hetzner.de (213.239.245.81) [AS24940]  0.234 ms  
0.236 ms
4  core1.hetzner.de (213.239.245.177) [AS24940]  4.811 ms  4.809 ms core22.hetzner.de (213.239.245.162) [AS24940]  
0.248 ms
5  core1.hetzner.de (213.239.245.177) [AS24940]  4.831 ms juniper1.ffm.hetzner.de (213.239.245.5) [AS24940]  4.842 ms 
 4.826 ms
6  juniper1.ffm.hetzner.de (213.239.245.5) [AS24940]  4.857 ms  4.864 ms 30gigabitethernet1-3.core1.ams1.he.net 
(195.69.145.150) [AS1200] 11.233 ms
7  10gigabitethernet1-4.core1.lon1.he.net (72.52.92.81) [AS6939, AS6939]  19.869 ms 
30gigabitethernet1-3.core1.ams1.he.net (195.69.145.150) [AS1200]  18.420 ms  11.255 ms
8  10gigabitethernet10-4.core1.nyc4.he.net (72.52.92.241) [AS6939, AS6939]  115.845 ms  101.875 ms 
10gigabitethernet1-4.core1.lon1.he.net (72.52.92.81) [AS6939, AS6939]  17.249 ms
9  10gigabitethernet10-4.core1.nyc4.he.net (72.52.92.241) [AS6939, AS6939]  138.302 ms 
10gigabitethernet1-2.core1.tor1.he.net (184.105.222.18) [AS6939]  120.449 ms  139.730 ms
10  10gigabitethernet1-2.core1.tor1.he.net (184.105.222.18) [AS6939] 134.755 ms  104.661 ms 
10gigabitethernet3-1.core1.ywg1.he.net (184.105.223.221) [AS6939]  167.282 ms
11  10gigabitethernet1-1.core1.yyc1.he.net (184.105.223.214) [AS6939] 139.310 ms 
10gigabitethernet3-1.core1.ywg1.he.net (184.105.223.221) [AS6939]  155.983 ms  155.910 ms
12  sebo-systems-inc.gigabitethernet2-23.core1.yyc1.he.net (216.218.214.250) [AS6939]  138.703 ms  178.530 ms 
10gigabitethernet1-1.core1.yyc1.he.net (184.105.223.214) [AS6939] 172.423 ms
13  sebo-systems-inc.gigabitethernet2-23.core1.yyc1.he.net (216.218.214.250) [AS6939]  158 ms * *
14  * * ntp1.yycix.ca (192.75.191.6) [AS53339]  181.433 ms
Cns#
Cns#
Cns#
Cns# unbuffer hping --icmp-ts ntp1.yycix.ca | perl -ne \
'if (/icmp_seq=(\d+) rtt=(\d+\.\d)/) {($s, $p) = ($1, $2);} \
if (/ate=(\d+) Receive=(\d+) Transmit=(\d+)/) {($o, $r, $t) = ($1, $2, $3);} \
if (/tsrtt=(\d+)/) { \
print $s, "\t", $p, "\t", $1, " = ", $r - $o, " + ", $o + $1 - $t, "\n"; }'
0       165.0   165 = 95 + 70
1       156.2   156 = 86 + 70
2       178.9   179 = 109 + 70
3       181.0   181 = 111 + 70
4       178.3   179 = 108 + 71
5       163.8   164 = 94 + 70
6       175.7   176 = 106 + 70
7       173.9   174 = 104 + 70
8       172.6   173 = 103 + 70
9       163.5   164 = 94 + 70
10      181.8   182 = 112 + 70
11      161.9   162 = 92 + 70
12      183.1   184 = 113 + 71
13      174.5   174 = 104 + 70
14      181.8   181 = 111 + 70
15      181.7   181 = 111 + 70
^C
Cns#




From indiana.edu to hetzner.de; notice that the mtr by itself gives a false impression of a traffic loss at init7, 
whereas in reality, it's the reverse path through he.net that's causing the loss, as hping confirms:


m: {5134} date ; sudo mtr --report{,-cycles=600} --interval 0.1 --order "SRL BGAWV" -4 ?????? ; date
Sat Nov 30 00:36:27 EST 2013
HOST: ????c????????.indiana.edu     Snt   Rcv Loss%   Best Gmean   Avg Wrst StDev
 1.|-- 129.79.???.?                600   600  0.0%    0.4   0.7   0.9  24.7   1.5
 2.|-- ae-13.0.br2.bldc.net.uits   600   600  0.0%    0.5   0.7   0.9  22.6   1.8
 3.|-- ae-0.0.br2.ictc.net.uits.   600   600  0.0%    1.4   1.7   1.8  20.2   1.6
 4.|-- xe-0-1-0.11.rtr.ictc.indi   600   600  0.0%    1.4   2.1   3.8  66.5   8.1
 5.|-- 64.57.21.13                 600   600  0.0%    6.0   7.2   8.4  72.9   8.0
 6.|-- xe-2-2-0.0.ny0.tr-cps.int   600   600  0.0%   32.3  33.9  34.4  81.0   6.9
 7.|-- paix-nyc.init7.net          600   600  0.0%   32.5  35.3  35.5  44.7   3.8
 8.|-- r1lon1.core.init7.net       600   599  0.2%  100.1 104.7 104.9 146.5   7.5
 9.|-- r1nue1.core.init7.net       600   599  0.2%  114.6 115.7 115.7 125.4   2.2
10.|-- gw-hetzner.init7.net        600   594  1.0%  112.4 141.3 142.4 241.9  18.2
11.|-- core12.hetzner.de           600   468 22.0%  112.2 142.7 144.0 203.4  20.3
12.|-- core21.hetzner.de           600   202 66.3%  114.4 143.7 145.0 204.1  20.1
13.|-- juniper1.rz13.hetzner.de    600   594  1.0%  114.7 141.4 142.1 212.2  14.3
14.|-- hos-tr2.ex3k11.rz13.hetzn   600   599  0.2%  113.8 123.9 125.5 218.2  21.8
15.|-- static.88-198-??-??.clien   599   592  1.2%  114.6 137.2 137.9 167.6  13.2
0.244u 1.766s 1:05.52 3.0%      0+0k 0+1io 0pf+0w
Sat Nov 30 00:37:32 EST 2013

m: {5137} sudo script -q /dev/null hping3 --icmp-ts 88.198.??.?? | perl -ne 'if (/icmp_seq=(\d+) rtt=(\d+\.\d)/) 
{($s, $p) = ($1, $2);} \
if (/ate=(\d+) Receive=(\d+) Transmit=(\d+)/) {($o, $r, $t) = ($1, $2, $3);} \
if (/tsrtt=(\d+)/) { \
print $s, "\t", $p, "\t", $1, " = ", $r - $o, " + ", $o + $1 - $t, "\r\n"; }'
0       131.3   131 = 57 + 74
1       122.4   122 = 56 + 66
2       122.6   123 = 56 + 67
3       127.6   128 = 57 + 71
4       146.5   147 = 57 + 90
5       139.8   140 = 56 + 84
6       131.0   131 = 57 + 74
7       134.6   135 = 57 + 78
8       137.7   138 = 57 + 81
9       148.1   148 = 57 + 91
10      141.2   142 = 57 + 85
11      146.4   146 = 56 + 90
12      153.6   154 = 57 + 97
13      149.4   150 = 57 + 93
14      120.2   121 = 57 + 64
15      120.6   120 = 56 + 64
16      130.7   131 = 57 + 74
17      126.4   126 = 56 + 70
18      117.9   118 = 57 + 61
19      116.9   117 = 57 + 60
20      119.8   119 = 56 + 63
21      132.0   132 = 56 + 76
22      134.2   134 = 56 + 78
23      138.8   139 = 57 + 82



Note the ICMP timestamp data from hping above.  From this ICMP timestamping data, it is obvious that the congestion 
is only happening on one path -- the one over he.net, and init7 is in the clear.

Any further insights are welcome.  But finding out about the ICMP timestamp feature has so far been the most useful 
thing in troubleshooting this issue; I'm surprised it's a rather unknown method to get to the bottom of these 
problems.

However, even after finding out about the cause and the party responsible, the problem is yet to be exhausted.  Any 
help appreciated.

Best regards,
Constantine.



Current thread: