nanog mailing list archives
TCP and anycast (was Re: ECN)
From: Anoop Ghanwani <anoop () alumni duke edu>
Date: Wed, 13 Nov 2019 22:39:24 -0800
RFC 7094 (https://tools.ietf.org/html/rfc7094) describes the pitfalls & risks of using TCP with an anycast address. It recognizes that there are valid use cases for it, though. Specifically, section 3.1 says this:
Most stateful transport protocols (e.g., TCP), without modification, do not understand the properties of anycast; hence, they will fail probabilistically, but possibly catastrophically, when using anycast addresses in the presence of "normal" routing dynamics. ... This can lead to a protocol working fine in, say, a test lab but not in the global Internet.
On Wed, Nov 13, 2019 at 3:33 PM Warren Kumari <warren () kumari net> wrote:
On Thu, Nov 14, 2019 at 12:25 AM Matt Corallo <nanog () as397444 net> wrote:This sounds like a bug on Cloudflare’s end (cause trying to do anycastTCP is... out of spec to say the least), not a bug in ECN/ECMP. Errrrrr. I really don't think that there is any sort of spec that covers that :-P Using Anycast for TCP is incredibly common - the DNS root servers for one obvious example. More TCP centric well-known examples are Fastly and LinkedIn - LinkedIn in particular did a really good podcast on their experience with this. There is also a good NANOG talk from the ~2000s (?) on people using TCP anycast for long lived (serving ISO files, which were long-lived in those days) flows, and how reliable it is - perhaps that's the talk Todd mentioned? WOn Nov 13, 2019, at 11:07, Toke Høiland-Jørgensen via NANOG <nanog () nanog org> wrote:Hello I have a customer that believes my network has a ECN problem. We do not, we just move packets. But how do I prove it? Is there a tool that checks for ECN trouble? Ideally something I could run on the NLNOG Ring network. I believe it likely that it is the destination that has the problem.Hi Baldur I believe I may be that customer :) First of all, thank you for looking into the issue! We've been having great fun over on the ecn-sane mailing list trying to figure out what's going on. I'll summarise below, but see this thread for the discussion and debugging details:https://lists.bufferbloat.net/pipermail/ecn-sane/2019-November/000527.htmlThe short version is that the problem appears to come from acombinationof the ECMP routing in your network, and Cloudflare's heavy use of anycast. Specifically, a router in your network appears to be doingECMPby hashing on the packet header, *including the ECN bits*. This breaks TCP connections with ECN because the TCP SYN (with no ECN bits set) end up taking a different path than the rest of the flow (which is markedasECT(0)). When the destination is anycasted, this means that the data packets go to a different server than the SYN did. This second server doesn't recognise the connection, and so replies with a TCP RST. To fix this, simply exclude the ECN bits (or the whole TOS byte) from your router's ECMP hash. For a longer exposition, see below. You should be able to verify this from somewhere else in the network, but if there's anything else you want me to test, do let me know. Also, would you mind sharing theroutermake and model that does this? We're trying to collect real-world examples of network problems caused by ECN and this is definitely an interesting example. -Toke The long version: From my end I can see that I have two paths to Cloudflare; which is taken appears to be based on a hash of the packet header, as can beseenby varying the source port: $ traceroute -q 1 --sport=10000 104.24.125.13 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 bytepackets1 _gateway (10.42.3.1) 0.357 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 4.707 ms 3 customer-185-24-168-46.ip4.gigabit.dk (185.24.168.46) 1.283 ms 4 te0-1-1-5.rcr21.cph01.atlas.cogentco.com (149.6.137.49) 1.667 ms 5 netnod-ix-cph-blue-9000.cloudflare.com (212.237.192.246) 1.406 ms 6 104.24.125.13 (104.24.125.13) 1.322 ms $ traceroute -q 1 --sport=10001 104.24.125.13 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 bytepackets1 _gateway (10.42.3.1) 0.293 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 3.430 ms 3 customer-185-24-168-38.ip4.gigabit.dk (185.24.168.38) 1.194 ms 4 10ge1-2.core1.cph1.he.net (216.66.83.101) 1.297 ms 5 be2306.ccr42.ham01.atlas.cogentco.com (130.117.3.237) 6.805 ms 6 149.6.142.130 (149.6.142.130) 6.925 ms 7 104.24.125.13 (104.24.125.13) 1.501 ms This is fine in itself. However, the problem stems from the fact that the ECN bits in the IP header are also included in the ECMP hash (-t sets the TOS byte; -t 1 ends up as ECT(0) on the wire and -t 2 is ECT(1)): $ traceroute -q 1 --sport=10000 104.24.125.13 -t 1 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 bytepackets1 _gateway (10.42.3.1) 0.336 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 6.964 ms 3 customer-185-24-168-46.ip4.gigabit.dk (185.24.168.46) 1.056 ms 4 te0-1-1-5.rcr21.cph01.atlas.cogentco.com (149.6.137.49) 1.512 ms 5 netnod-ix-cph-blue-9000.cloudflare.com (212.237.192.246) 1.313 ms 6 104.24.125.13 (104.24.125.13) 1.210 ms $ traceroute -q 1 --sport=10000 104.24.125.13 -t 2 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 bytepackets1 _gateway (10.42.3.1) 0.339 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 2.565 ms 3 customer-185-24-168-38.ip4.gigabit.dk (185.24.168.38) 1.301 ms 4 10ge1-2.core1.cph1.he.net (216.66.83.101) 1.339 ms 5 be2306.ccr42.ham01.atlas.cogentco.com (130.117.3.237) 6.570 ms 6 149.6.142.130 (149.6.142.130) 6.888 ms 7 104.24.125.13 (104.24.125.13) 1.785 ms So why is this a problem? The TCP SYN packet first needs to negotiate ECN, so it is sent without any ECN bits set in the header; after negotiation succeeds, the data packets will be marked as ECT(0). But because that becomes part of the ECMP hash, those packets will take another path. And since the destination is anycasted, that means they will also end up at a different endpoint. This second endpoint won't recognise the connection, and reply with a TCP RST. This is clearly visible in tcpdump; notice the different TOS values, and that the RST packet has a different TTL than the SYN-ACK: 12:21:47.816359 IP (tos 0x0, ttl 64, id 25687, offset 0, flags [DF],proto TCP (6), length 60)10.42.3.130.34420 > 104.24.125.13.80: Flags [SEW], cksum 0xf2ff(incorrect -> 0x0853), seq 3345293502, win 64240, options [mss 1460,sackOK,TS val 4248691972 ecr 0,nop,wscale 7], length 012:21:47.823395 IP (tos 0x0, ttl 58, id 0, offset 0, flags [DF], protoTCP (6), length 52)104.24.125.13.80 > 10.42.3.130.34420: Flags [S.E], cksum 0x9f4a(correct), seq 1936951409, ack 3345293503, win 29200, options [mss 1400,nop,nop,sackOK,nop,wscale 10], length 012:21:47.823479 IP (tos 0x0, ttl 64, id 25688, offset 0, flags [DF],proto TCP (6), length 40)10.42.3.130.34420 > 104.24.125.13.80: Flags [.], cksum 0xf2eb(incorrect -> 0x503e), seq 1, ack 1, win 502, length 012:21:47.823665 IP (tos 0x2,ECT(0), ttl 64, id 25689, offset 0, flags[DF], proto TCP (6), length 117)10.42.3.130.34420 > 104.24.125.13.80: Flags [P.], cksum 0xf338(incorrect -> 0xc1d4), seq 1:78, ack 1, win 502, length 77: HTTP, length: 77GET / HTTP/1.1 Host: 104.24.125.13 User-Agent: curl/7.66.0 Accept: */* 12:21:47.825485 IP (tos 0x2,ECT(0), ttl 60, id 0, offset 0, flags[DF], proto TCP (6), length 40)104.24.125.13.80 > 10.42.3.130.34420: Flags [R], cksum 0x3a65(correct), seq 1936951410, win 0, length 0The fix is to stop hashing on the ECN bits when doing ECMP. You could keep hashing on the diffserv part of the TOS field if you want, but I think it would also be fine to just exclude the TOS field entirely from the hash.-- I don't think the execution is relevant when it was obviously a bad idea in the first place. This is like putting rabid weasels in your pants, and later expressing regret at having chosen those particular rabid weasels and that pair of pants. ---maf
Current thread:
- Re: ECN, (continued)
- Re: ECN Tore Anderson (Nov 13)
- Re: ECN Warren Kumari (Nov 13)
- TCP and anycast (was Re: ECN) Anoop Ghanwani (Nov 13)
- Re: TCP and anycast (was Re: ECN) Bill Woodcock (Nov 14)
- Re: TCP and anycast (was Re: ECN) William Herrin (Nov 14)
- Re: TCP and anycast (was Re: ECN) Randy Bush (Nov 14)
- Re: TCP and anycast (was Re: ECN) Christopher Morrow (Nov 14)
- Re: TCP and anycast (was Re: ECN) Randy Bush (Nov 14)
- Message not available
- Re: ECN Toke Høiland-Jørgensen via NANOG (Nov 13)
- Re: ECN Matt Corallo (Nov 13)
- Re: ECN Anoop Ghanwani (Nov 13)
- Re: ECN Owen DeLong (Nov 13)
- Re: ECN Toke Høiland-Jørgensen via NANOG (Nov 14)