nanog mailing list archives

Re: BGP route hijack by AS10990


From: Mark Tinka <mark.tinka () seacom com>
Date: Mon, 3 Aug 2020 17:20:05 +0200



On 3/Aug/20 17:09, Baldur Norddahl wrote:


We suffered a series of crashes that led to JTAC recommending
disabling RPKI. We had a core dump which matches PR1332626 which is
confidential, so I have no idea what it is about. Apparently what
happened was the server running the RPKI validation server rebooted
and the service was not configured to automatically restart. Also we
did not have it redundant nor did we monitor the service. So we had no
working RPKI validation server and that apparently caused the MX204 to
become unstable in various ways. It might run for a day but it would
do all sorts of things like packet loss, delays and generally be
"strange". The first crash caused BGP, ssh and subscriber management
to be down, but LDP, OSPF, SNMP to be up. It became a black hole we
could not login to.  The worst possible kind of crash for a router. We
had to go onsite and pull the power.

The router appears to run fine after disabling RPKI. I suppose
starting the validation service may also fix the issue. But I am not
going to go there until I know what is in that PR and also I feel the
RPKI funktion needs to be failsafe before we can use it. I know we are
at fault for not deploying the validation service in a redundant setup
and for failing at monitoring the service. But we did so because we
thought it not to be too important, because a failed validation
service should simply lead to no validation, not a crashed router.

This is on JUNOS 20.1R1.11.

That's a really nasty bug.

Loss of an RTR session shouldn't kill the box, even if you are running
only one validator. If you can share details about why this happens when
you get them, that would be most helpful.

I'd be curious to know whether this is dependent on a specific
validator, or all of them.

Are there bits in Junos 20 that you can't get in fixed versions of 19?

Mark.

Current thread: