nanog mailing list archives

Re: Towards an RPKI-rich Internet (and the appropriate allocation of responsibility in the event an RIR RPKI CA outage)


From: Job Snijders <job () ntt net>
Date: Mon, 1 Oct 2018 21:19:15 +0000

Dear all,

I'm very happy to see the direction this conversation has taken, seems
we've moved on towards focussing on solutions and outcomes - this is
encouraging.

On Mon, Oct 01, 2018 at 05:44:17PM +0100, Nick Hilliard wrote:
John Curran wrote on 01/10/2018 00:21:
There is likely some on the nanog mailing list who have a view on
this matter, so I pose the question of "who should be responsible"
for consequences of RPKI RIR CA failure to this list for further
discussion.

other replies in this thread have assumed that RPKI CA failure modes
are restricted to loss of availability, but there are others failure
modes, for example:

- fraud: rogue CA employee / external threat actor signs ROAs
illegitimately

- negligence: CA accidentally signs illegitimate ROAs due to e.g.
software bug

- force majeure: e.g. court orders CA to sign prefix with AS0,
complicated by NIR RPKI delegation in jurisdictions which may have
difficult relations with other parts of the world.

These types of situations are well-trodden territory for other types
of PKI CA, where users

Otherwise, as other people have pointed out, catastrophic systems
failure at the CA is designed to be fail-safe.  I.e. if the CA goes
away, ROAs will be evaluated as "unknown" and life will continue on.
If people misconfigure their networks and do silly things with this
specific failure mode, that's their problem.  You can't stop people
from aiming guns at their feet and pulling the trigger.

There are a number of failure modes and I believe the operational
community has yet to fully explore how to mitigate most risks. Over time
I expect we'll develop BCPs how to improve the robustness of the system;
these BCPs can only come into existence driven by actual operational
experierence.

A positive development that addresses some aspects of the concerns
raised is Certificate Transparency. Cloudflare set up a CT log
(https://groups.google.com/forum/#!topic/certificate-transparency/_deL5iGB5sY)
and I hope others like Google will also consider doing this. CT is a
great tool to help keep the roots perform in line with community
expectations.

I consider it the operator community's responsibility to figure out how
to deal with outages. I don't intend to hold the RIRs liable - we'll
need to learn to protect ourselves.

Kind regards,

Job


Current thread: