nanog mailing list archives

Re: Global Akamai Outage


From: Lukas Tribus <lukas () ltri eu>
Date: Tue, 27 Jul 2021 23:23:09 +0200

Hello,


On Tue, 27 Jul 2021 at 21:02, heasley <heas () shrubbery net> wrote:
But I have to emphasize that all those are just examples. Unknown bugs
or corner cases can lead to similar behavior in "all in one" daemons
like Fort and Routinator. That's why specific improvements absolutely
do not mean we don't have to monitor the RTR servers.

I am not convinced that I want the RTR server to be any smarter than
necessary, and I think expiration handling is too smart.  I want it to
the load the VRPs provided and serve them, no more.

Leave expiration to the validator and monitoring of both to the NMS and
other means.

While I'm all for KISS, the expiration feature makes sure that the
cryptographic validity in the ROA's is respected not only on the
validator, but also on the RTR server. This is necessary, because
there is nothing in the RTR protocol that indicates the expiration and
this change brings it at least into the JSON exchange between
validator and RTR server.

It's like TTL in DNS, and it's about respecting the wishes of the
authority (CA and ROA ressource holder).


The delegations should not be changing quickly[1] enough

How do you come to this conclusion? If I decide I'd like to originate
a /24 out of my aggregate, for DDoS mitigation purposes, why shouldn't
I be able to update my ROA and expect quasi-complete convergence in 1
or 2 hours?


for me to prefer expiration over the grace period to correct a validator
problem.  That does not prevent an operator from using other means to
share fate; eg: if the validator does fails completely for 2 hours, stop
the RTR server.

I perceive this to be choosing stability in the RTR sessions over
timeliness of updates.  And, if a 15 - 30 minute polling interval is
reasonable, why isnt 8 - 24 hours.

Well for one, I'd like my ROAs to propagate in 1 or 2 hours. If I need
to wait for 24 hours, then this could cause operational issues for me
(the DDoS mitigation case above for example, or just any other normal
routing change).

The entire RPKI system is designed to fail, so if you have multiple
failures and *all* your RTR servers go down, the worst case is that
the routes on the BGP routers turn NotFound, so you'd lose the benefit
of RPKI validation. It's *way* *way* more harmful to have obsolete
VRP's on your routers. If it's just a few hours, then the impact will
probably not be catastrophic. But what if it's 36 hours, 72 hours?
What if the rpki-validation started failing 2 weeks ago, when Jerry
from IT ("the linux guy") started it's vacation?

On the other hand, if only one (of multiple) validator/rtr instances
has a problem and the number of VRP's slowly goes down, nothing will
happen at all on your routers, as they just use the union of the RTR
endpoints and the VRP's from the broken RTR server will slowly be
withdrawn. Your router will keep using healthy RTR servers, as opposed
to considering erroneous data from a poisoned RTR server.

I define stability not as "RTR session uptime and VRP count", but
whether or not my BGP routers are making correct or wrong decisions.


I too prefer an approach where the validator and RTR are separate but
co-habitated, but this naturally increases the possibility that the two
might serve different data due to reachability, validator run-time, ....
To what extend differences occur, I have not measured.


[1] The NIST ROA graph confirms the rate of change is low, as I would
expect.  But, I have no statistic for ROA stability, considering only
the prefix and origin.

I don't see how the rate of global ROA changes is in any way related
to this issue. The operational issue a hung RTR endpoint creates for
other people's networks can't be measured with this.


lukas


Current thread: