nanog mailing list archives

Re: plea for comcast/sprint handoff debug help

From: Alex Band <alex () nlnetlabs nl>
Date: Sat, 31 Oct 2020 13:02:04 +0100

Hi Tony,

I realise there are quite some moving parts so I'll try to summarise our design choices and reasoning as clearly as
possible.

Rsync was the original transport for RPKI and is still mandatory to implement. RRDP (which uses HTTPS) was introduced
to overcome some of the shortcomings of rsync. Right now, all five RIRs make their Trust Anchors available over HTTPS,
all but two RPKI repositories support RRDP and all but one relying party software packages support RRDP. There is
currently an IETF draft to deprecate the use of rsync.

As a result, the bulk of RPKI traffic is currently transported over RRDP and only a small amount relies on rsync. For
example, our RPKI repository is configured accordingly: rrdp.rpki.nlnetlabs.nl is served by a CDN and
rsync.rpki.nlnetlabs.nl runs rsyncd on a simple, small VM to deal with the remaining traffic. When operators deploying
our Krill Delegated RPKI software ask us what to expect and how to provision their services, this is how we explain the
current state of affairs.

With this is mind, Routinator currently has this fetching strategy:

1. It starts by connecting to the Trust Anchors of the RIRs over HTTPS, if possible, and otherwise use rsync.
2. It follows the certificate tree, following several pointers to publication servers along the way. These pointers can
be rsync only or there can be two pointers, one to rsync and one to RRDP.
3. If an RRDP pointer is found, Routinator will try to connect to the service, verify if there is a valid TLS
certificate and data can be successfully fetched. If it can, the server is marked as usable and it'll prefer it. If the
initial check fails, Routinator will use rsync, but verify RRDP works on the next validation run.
4. If RRDP worked before but is unavailable for any reason, Routinator will used cached data and try again on the next
run instead of immediately falling back to rsync.
5. If the RPKI publication server operator takes away the pointer to RRDP to indicate they no longer offer this
communication protocol, Routinator will use rsync.
6. If Routinator's cache is cleared, the process will start fresh

This strategy was implemented with repository server provisioning in mind. We are assuming that if you actively
indicate that you offer RRDP, you actually provide a monitored service there. As such, an outage would be assumed to be
transient in nature. Routinator could fall back immediately, of course. But our thinking was that if the RRDP service
would have a small hiccup, currently a 1000+ Routinator instances would be hammering a possibly underprovisioned rsync
server, perhaps causing even more problems for the operator.

"Transient" is currently the focus. In Randy's experiment, he is actively advertising he offers RRDP, but doesn't offer
a service there for weeks at a time. As I write this, ca.rg.net. cb.rg.net and cc.rg.net have been returning a 404 on
their RRDP endpoint several weeks and counting. cc.rg.net was unavailable over rsync for several days this week as
well.

I would assume this is not how operators would run their RPKI publication server normally. Not having an RRDP service
for weeks when you advertise you do is fine for an experiment but constitutes pretty bad operational practice for a
production network. If a service becomes unavailable, the operator would swiftly be contacted and the issue would be
resolved, like Randy and I have done in happier times:

https://twitter.com/alexander_band/status/1209365918624755712
https://twitter.com/enoclue/status/1209933106720829440

On a personal note, I realise the situation has a dumpster fire feel to it. I have contacted Randy about his outages
months ago, not knowing they were a research project. I never got a reply. Instead of discussing his research and the
observed effects, it feels like a 'gotcha' to present the findings in this way. It could even be considered
irresponsible, if the fallout is as bad as he claims. The notion that using our software is quote, "a disaster waiting
to happen", is disingenuous at best:

https://www.ripe.net/ripe/mail/archives/members-discuss/2020-September/004239.html

Routinator design was to try to deal with outages in a responsible manner for all actors involved. Again, of course we
can change our strategy as a result of this discussion, which I'm happy we're now actually having. In that case I would
advise operators who offer an RPKI publication server to ensure that they provision their rsyncd service so that it is
capable of handling all of the traffic that their RRDP service normally handles, in case RRDP has a glitch. And, even
if people will scale their rsync service accordingly, they will only ever find out if it actually does in a time of
crisis.

Kind regards,

-Alex

On 31 Oct 2020, at 07:17, Tony Tauber <ttauber () 1-4-5 net> wrote:

As I've pointed out to Randy and others and I'll share here.
We planned, but hadn't yet upgraded our Routinator RP (Relying Party) software to the latest v0.8 which I knew had 
some improvements.
I assumed the problems we were seeing would be fixed by the upgrade.
Indeed, when I pulled down the new SW to a test machine, loaded and ran it, I could get both Randy's ROAs.
I figured I was good to go.  
Then we upgraded the prod machine to the new version and the problem persisted.
An hour or two of analysis made me realize that the "stickiness" of a particular PP (Publication Point) is encoded in 
the cache filesystem.
Routinator seems to build entries in its cache directory under either rsync, rrdp, or http and the rg.net PPs weren’t 
showing under rsync but moving the cache directory aside and forcing it to rebuild fixed the issue.

A couple of points seem to follow:
      • Randy says: "finding the fort rp to be pretty solid!"  I'll say that if you loaded a fresh Fort and fresh 
Routinator install, they would both have your ROAs.
      • The sense of "stickiness" is local only; hence to my mind the protection against "downgrade" attack is 
somewhat illusory. A fresh install knows nothing of history.
Tony

On Fri, Oct 30, 2020 at 11:57 PM Randy Bush <randy () psg com> wrote:

If there is a covering less specific ROA issued by a parent, this will
then result in RPKI invalid routes.


i.e. the upstream kills the customer.  not a wise business model.

The fall-back may help in cases where there is an accidental outage of
the RRDP server (for as long as the rsync servers can deal with the
load)


folk try different software, try different configurations, realize that
having their CA gooey exposed because they wanted to serve rrdp and
block, ...

randy, finding the fort rp to be pretty solid!

Current thread:

Re: plea for comcast/sprint handoff debug help, (continued)
- - - Re: plea for comcast/sprint handoff debug help Randy Bush (Oct 29)
    - Re: plea for comcast/sprint handoff debug help Alex Band (Oct 30)
    - Re: plea for comcast/sprint handoff debug help Tom Beecher (Oct 30)
    - RPKI over RSYNC vs RRDP (Was: plea for comcast/sprint handoff debug help) Job Snijders (Oct 30)
    - Re: plea for comcast/sprint handoff debug help Job Snijders (Oct 30)
    - Re: plea for comcast/sprint handoff debug help Tim Bruijnzeels (Oct 30)
    - Re: plea for comcast/sprint handoff debug help Randy Bush (Oct 30)
    - Re: plea for comcast/sprint handoff debug help Tony Tauber (Oct 30)
    - Re: plea for comcast/sprint handoff debug help Randy Bush (Oct 31)
    - Re: plea for comcast/sprint handoff debug help Randy Bush (Oct 31)
    - Re: plea for comcast/sprint handoff debug help Alex Band (Oct 31)
    - Re: plea for comcast/sprint handoff debug help Randy Bush (Oct 31)