nanog mailing list archives
Re: Global Akamai Outage
From: Lukas Tribus <lukas () ltri eu>
Date: Mon, 26 Jul 2021 19:04:41 +0200
Hello! On Mon, 26 Jul 2021 at 17:50, heasley <heas () shrubbery net> wrote:
Mon, Jul 26, 2021 at 02:20:39PM +0200, Lukas Tribus:rpki-client 7.1 emits a new per VRP attribute: expires, which makes it possible for RTR servers to stop considering outdated VRP's: https://github.com/rpki-client/rpki-client-openbsd/commit/9e48b3b6ad416f40ac3b5b265351ae0bb13ca925Since rpki-client removes "outdated" (expired) VRPs, how does an RTR server "stop considering" something that does not exist from its PoV?
rpki-client can only remove outdated VRP's, if it a) actually runs and b) if it successfully completes a validation cycle. It also needs to do this BEFORE the RTR server distributes data. If rpki-client for whatever reason doesn't complete a validation cycle [doesn't start, crashes, cannot write to the file] it will not be able to update the file, which stayrtr reads and distributes. If your VM went down with both rpki-client and stayrtr, and it stays down for 2 days (maybe a nasty storage or virtualization problem or maybe this just a PSU failure in a SPOF server), when the VM comes backup, stayrtr will read and distribute 2 days old data - after all - rpki-client is a periodic cronjob while stayrtr will start immediately, so there will be plenty of time to distribute obsolete VRP's. Just because you have another validator and RTR server in another region that was always available, doesn't mean that the erroneous and obsolete data served by this server will be ignored. There are more reasons and failure scenarios why this 2 piece setup (periodic RPKI validation, separate RTR daemon) can become a "split brain". As you implement more complicated setups (a single global RPKI validation result is distributed to regional RTR servers - the cloudflare approach), things get even more complicated. Generally I prefer the all in one approach for these reasons (FORT validator). At least if it crashes, it takes down the RTR server with it: https://github.com/NICMx/FORT-validator/issues/40#issuecomment-695054163 But I have to emphasize that all those are just examples. Unknown bugs or corner cases can lead to similar behavior in "all in one" daemons like Fort and Routinator. That's why specific improvements absolutely do not mean we don't have to monitor the RTR servers. lukas
Current thread:
- Re: Global Akamai Outage, (continued)
- Re: Global Akamai Outage Mark Tinka (Jul 25)
- Re: Global Akamai Outage Jared Mauch (Jul 25)
- Re: Global Akamai Outage Saku Ytti (Jul 25)
- Re: Global Akamai Outage Mark Tinka (Jul 25)
- Re: Global Akamai Outage Saku Ytti (Jul 25)
- Re: Global Akamai Outage Mark Tinka (Jul 26)
- Re: Global Akamai Outage Lukas Tribus (Jul 26)
- Re: Global Akamai Outage Mark Tinka (Jul 26)
- Re: Global Akamai Outage heasley (Jul 26)
- Re: Global Akamai Outage Mark Tinka (Jul 26)
- Re: Global Akamai Outage Lukas Tribus (Jul 26)
- Re: Global Akamai Outage Mark Tinka (Jul 27)
- Re: Global Akamai Outage Lukas Tribus (Jul 27)
- Re: Global Akamai Outage heasley (Jul 27)
- Re: Global Akamai Outage Lukas Tribus (Jul 27)
- Re: Global Akamai Outage Randy Bush (Jul 25)
- Re: Global Akamai Outage Miles Fidelman (Jul 25)