nanog mailing list archives

Re: Time to revise RFC 1771

From: Dave Israel <davei () biohazard demon digex net>
Date: Tue, 26 Jun 2001 17:10:25 -0400


On 6/26/2001 at 13:47:37 -0700, Clayton Fiske said:

On Tue, Jun 26, 2001 at 04:27:49PM -0400, Dave Israel wrote:


This ignores three basic facts:

1) Networks tend to be homogenous in platform.
2) Platforms tend to accept their own implementation quirks
3) Networks peer at borders

Therefore, under the "drop the session rule," my bad announcement
gets to all my borders fine, and all my external peers who are not
running forgiving/compatable implementations drop their connections
to me and all my traffic to/from them hits the floor.


In this case, vendor C's implementation was neither forgiving nor
compatible. It still dropped the peer(s) in question. It just had
the much more harmful quirk that it forwarded the bad route on to
its peers before doing so. In this case, a homogenous network would
not only lose its border sessions, it would lose all internal ones
through which the route was advertised.


I'm certainly not defending (or attacking) either vendor's
implementation; in the current environment, I believe following
the RFC is the correct course.  I was more concerned with
future implementations of BGP, and how (I feel) they should handle
problems like this, since, as we add more and more features to 
BGP, how we handle what appears to be a bad route (or a bad
NLRI) is going to become more important.

One CRC error does not make PPP drop.  Why make one route cause
a catastrophic loss of connectivity?  Report the bad route,
drop it, and move on; let layer 8 resolve it.


Because, arguably, we don't know that it's just one route. We just
know that one route set off the alarm. Do you feel safe assuming that
whatever bug caused one corrupted route left all the other routes
alone?


No, but I feel secure that, if it corrupted a large enough number of
routes, the effect will not be worse than dropping the session.
Somebody mentioned what happens if there are 100,000 bad routes and 1
good one.  You keep the good one and drop the 100,000 bad ones.
Dropping routes is even easier than using them.  Besides, which tends
to be harder on a router: dropping bad routes, or tearing down and
restarting a TCP session?

Plus, a CRC error can occur between two valid, compliant, bug-free
implementations. A bad route, by definition, can't. We're not talking
about external faults here, but broken implementations. When one side
of a protocol session simply breaks the rules, I don't think it's
reasonable to say that the other side needs to be "fixed" to accept
that breakage. Fix the broken side.


A "bad route" can happen whenever one implementation differs from
another.  Both can be valid according to some definition of the
standard.  Determining who is wrong, and fixing it, takes time.  If
you're dropping a few of my routes during that time, that's
unavoidable.  If every customer of mine cannot reach every customer of
yours while we fight over whose implementation is wrong and who needs
to change what, then who wins?  And how is this fight more legitimate
than the one you have with your telco provider over how they built
your circuit and where your errors are coming from?

The reason this has got everyone's attention is because of the unique
way in which the breakage occurred. If all implementations were changed
to drop the single bad route and keep the sessions intact, the damage
would not have been what it was. If all implementations followed the
current specs and dropped the session with the router which first
originated the bad route, the damage would not have been what it was.
To say that one way causes massive damage and the other doesn't is
inaccurate. The damage was caused by the implementation in question
doing something resembling one but with harmful behavior thrown in.


I think the issue has gone beyond what happened, and into what will
happen.  It's a simple design philosophy question:  Do you build
protocols that are robust and resilient under stress, or do you
build protocols that refuse to interoperate until everything
completely agrees?  Ideally, I can see the beauty of the second,
but realistically, I think you need to be permissive.  


-- 
Dave Israel
Senior Manager, IP Backbone
Intermedia Business Internet

Current thread:

Time to revise RFC 1771 Sean Donelan (Jun 26)
- Re: Time to revise RFC 1771 Clayton Fiske (Jun 26)
  - Re: Time to revise RFC 1771 Dave Israel (Jun 26)
    - Re: Time to revise RFC 1771 Clayton Fiske (Jun 26)
    - Re: Time to revise RFC 1771 Dave Israel (Jun 26)
- <Possible follow-ups>
- Re: Time to revise RFC 1771 Sean Donelan (Jun 26)
- Re: Time to revise RFC 1771 Sean Donelan (Jun 26)
- Re: Time to revise RFC 1771 Sean Donelan (Jun 26)
  - Re: Time to revise RFC 1771 Clayton Fiske (Jun 26)
- Re: Time to revise RFC 1771 Sean Donelan (Jun 26)
  - Re: Time to revise RFC 1771 Barney Wolff (Jun 26)
- Re: Time to revise RFC 1771 Richard A. Steenbergen (Jun 26)