nanog mailing list archives
RE: BGP and The zero window edge
From: "Jakob Heitz \(jheitz\) via NANOG" <nanog () nanog org>
Date: Wed, 21 Apr 2021 21:22:57 +0000
I'd like to get some data on what actually happened in the real cases and analyze it. If it's a Cisco router at fault, then we have a bug to fix. Even if it's not a Cisco, there may be ways we can help to avoid the situation. However, before we start on solutions, I'd like to get a good understanding of what actually happened. TCP zero window is possible, but many other things could cause it too. Anyone? Regards, Jakob. -----Original Message----- From: Job Snijders <job () fastly com> Sent: Wednesday, April 21, 2021 2:11 PM To: Jakob Heitz (jheitz) <jheitz () cisco com> Cc: nanog () nanog org Subject: Re: BGP and The zero window edge Dear Jakob, group, On Wed, Apr 21, 2021 at 08:59:06PM +0000, Jakob Heitz (jheitz) via NANOG wrote:
Ben's blog details an experiment in which he advertises routes and then withdraws them, but some of them remain stuck for days. I'd like to get to the bottom of this problem.
I think there are *two* problems: 1) some BGP implementations (or multi-node BGP configurations) sometimes end up getting stuck in one way or another. 2) other BGP nodes are not able to disconnect/reconnect to systems suffering from instantiations of problem #1. While on the one hand it is important to follow-up on each and every instantiation of problem #1, I personally think it also is worthwhile exploring whether the BGP FSM itself can be redefined in a way that encourages BGP protocol implementations to be more robust and rely less on the remote peer behaving correctly. Once Problem #2 is addressed, finding and isolating instances of Problem #1 will become much easier.
Has anyone else seen this before or can provide data to analyze? On or off list.
From the BGP Default-Free Zone perspective it is hard to differentiate
between an entire (multi-vendor) Autonomous System being stuck, or just one router. To test individual router implementations this tool is useful https://github.com/benjojo/bgp-zerowindow-test - but please keep in mind that "TCP Recv Wind == 0" trick is just one way to easily get a BGP peer to manifest the problematic behavior.
From a BGP protocol perspective BGP nodes shouldn't inspect the TCP
receive window, but rather focus on whether all locally available signals indicate that the remote peer is still progressing data. Kind regards, Job
Current thread:
- BGP and The zero window edge Jean St-Laurent via NANOG (Apr 21)
- <Possible follow-ups>
- RE: BGP and The zero window edge Jakob Heitz (jheitz) via NANOG (Apr 21)
- Re: BGP and The zero window edge Job Snijders via NANOG (Apr 21)
- RE: BGP and The zero window edge Jakob Heitz (jheitz) via NANOG (Apr 21)
- Re: BGP and The zero window edge Job Snijders via NANOG (Apr 21)
- RE: BGP and The zero window edge Philip Loenneker (Apr 21)
- Re: BGP and The zero window edge Hank Nussbacher (Apr 21)
- Re: BGP and The zero window edge Alexandre Snarskii (Apr 22)
- Re: BGP and The zero window edge Job Snijders via NANOG (Apr 22)
- Re: BGP and The zero window edge Simon Leinen (Apr 24)
- Re: BGP and The zero window edge Alarig Le Lay (Apr 25)
- Re: BGP and The zero window edge Job Snijders via NANOG (Apr 21)