nanog mailing list archives

Re: Resilience: faults, causes, statistics, open issues


From: David Andersen <dga () lcs mit edu>
Date: Fri, 28 Jan 2005 13:43:51 -0500



On Jan 28, 2005, at 5:30 AM, András Császár (IJ/ETH) wrote:

Just some comments about the root causes of BGP related problems, maybe you find something useful from the research perspective, although probably this is not going to be new for you.

I found a few author groups with very related and useful papers:

- Tim Griffin and co.
- Nick Feamster and co.
- Jennifer Rexford and co.
- Lixin Gao and co.

  Yup.  That particular group you mentioned has a lot of interplay.

These people often have joint publications but sometimes separate as well. Also, Craig Labovitz and co have some very useful papers in the area of routing convergence time.

Yes.  There's also Morley Mao's convergence work.


As I see things now, in case of BGP, routing divergence, configuration and policies have a very strong correlation.

A high level conclusion (what you probably can expect from half year paper- and presentation-reading research) is that the first root cause of BGP problems is the absence of a >>widely deployed and practical<< formal language for policies. Since there is no formal language, there is no compiler, and so you have unwanted anomalies resulting from your config.

In a sense. I think that this is one of the root causes, but it's perhaps not the only one. I think we can group it into two areas:

  a)  Fundamental BGP problems
(e.g., the convergence/flap damping issues, etc.). By "fundamental" I don't mean uncorrectable - I simply mean that they're "features" of the protocol as it exists today. Some may be fundamental trade-offs in global routing; I don't know.

  b)  The abovementioned policy issue

Some of the issues in (a) can be corrected through (b) - for example, the Gao/Rexford examination of what policies can be permitted if you want to ensure stable routing. Given that BGP is a strongly policy-driven beast, many, many of its problems do arise from this.

So, in the end, although we can possibly identify the root causes behind BGP problems, I'm not sure they can ever be fully ceased. OK, I can imagine a formal language and config compiler, and one can find verification tools as well, but I can hardly imagine e.g. the sharing of policies (although some papers write about methods how to infer the necessary knowledge from measurements).

Agreed. I think we'll make steps, though, and I think that groups of collaborating providers can probably implement some of the solutions between themselves in ways that make sense.

p.s. Sorry for the long mail :) :)

No worries - quite interesting.  (to me, at least!)

  -Dave


Current thread: