nanog mailing list archives

Re: Did your BGP crash today?


From: Gary Buhrmaster <gary.buhrmaster () gmail com>
Date: Mon, 30 Aug 2010 19:39:16 +0000

On Mon, Aug 30, 2010 at 15:55, Jack Bates <jbates () brightok net> wrote:

...
As good a place to break in on the thread as any, I guess. Randy and others
believe more testing should have been done. I'm not completely sure they
didn't test against XR. They very likely could have tested in a 1 on 1
connection and everything looked fine.

I don't know the full details, but at what point did the corruption appear,
and was it visible? We know that it was corrupt on the output which caused
peer resets, but was it necessarily visible in the router itself?

Do we require a researcher to setup a chain of every vender BGP speaker in
every possible configuration and order to verify a bug doesn't cause things
to break? In this case, one very likely would need an XR receiving and
transmitting updates to detect the failure, so no less than 3 routers with
the XR in the middle.

What about individual configurations? Perhaps the update is received and
altered by one vendor due to specific configurations, sent to the next
vendor, accepted and altered (due to the first alteration, where as it
wouldn't be altered if the original update had been received) which causes
the next vendor to reset. Then we add to this that it may pass silently
through several middle vendor routers without problems and we realize the
scope of such problems and why connecting to the Internet is so
unpredictable.

I am not aware that anyone has provided the complete details at
this point which would include any test plans that may have been
performed.  From what I have been able to discern, it does seem
likely that a test plan that would have caught this almost had to
know of the specific issue in advance.  More testing would have
been better, but there is just too much variability out there to
assure you can do a complete test.

I am also not aware that the introduction of the attribute was
announced to the usual operational lists in advance "just in
case" (Ok, in this case, I mean NANOG).  This, is my mind,
 is actually the bigger faux pas.  An "Oh S***" moment has
happened to most of us.  It probably will happen again to
many of us.  But letting people know in advance of scheduled
changes is the important thing.

I would hope that in the future researchers will commit to
test plans to (at least) all the major vendor BGP speakers
(which, I admit, would likely not have caught this issue),
and that before introducing such "new" attributes into the
"Internet", they would announce it to the usual operational
lists, again, "just in case".  But my hopes are often dashed.

Gary


Current thread: