nanog mailing list archives

Re: Katrina Network Damage Report


From: Todd Underwood <todd () renesys com>
Date: Sat, 10 Sep 2005 15:04:08 -0400


interesting discussion.  at least we're talking about networking now.
:-)

wrt sean's comment, the only thing i can think he means by 'partition'
is that the networks may have power may be in some routing table but
just not the routing table of any of renesys's (or routeviews or ripe)
peers.  in that case, i guess i would agree.  our use of 'outage' is a
special case of 'partition' where the whole internet is on one side
and it's possible that the networks in question are on the other.
they may route somewhere.  just not to the internet.

quick question below...

There are some inconsistent terms used in computer
dependability research, but I prefer and use two
key definitions: failure (something is offline)
and outage (customer sees the service offline).

not sure i understand these definitions.  i'm happy to use any
well-defined terms (vocabulary never being worth fighting over).
again, when i use 'outage' i mean:  previously in global internet
tables of a consensus of a large peerset and now removed from those
tables.  which is that in your terms?

Looking at the routing tables you see failures.

not necessarily, if i'm understanding your definitions (which i guess
i'm not).  

If a prefix goes away completely and utterly,
and is truly unreachable, then anyone trying to
see it is going to see an outage.  But you can
have a lot of intermediate cases where routes are
mostly down but not completely, or where parts
of the net can see it but other parts can't
due to the vagarities of route propogation
and partial failures.

yes.  we cover all of these by having a large peerset and integrating
our data across them.  the outages that we report are not from a
particular point on the net.  they are from a consensus of a large,
selected peerset.

And there are situations where the route is
down but the service is still up.

unless you use words differently, this is not true.  by 'service' i
mean 'IP service'.  if the route is down, no one can reach anything
associated with that route, obviously.  do you mean 'service' as local
loop service? 

There are other network monitoring groups
that do end to end connectivity tests from
geographically distributed clients out to
sample systems around the net.  Some for research
and some for hire for network monitoring.

I think what they do is much closer to
identifying true outages than your method.

yes, that may be.  those are good ways of identifying certain kinds of
outages.  the problem is that they only measure what they measure.
frequently these systems measure well-connected sites monitoring
well-connected sites.  this creates a bias in the data, tending to
suggest that no big event ever really impacts the internet.  this is
obviously a false conclusion.

for reference compare the analysis of the 2003 US blackouts from
keynote:

http://www.keynote.com/news_events/releases_2003/03august14.html  

(summary:  nothing to see here, move along)

with those from renesys:

http://www.renesys.com//resource_library/blackout_results.html

(summary:  >4K prefixes disappeared from the global table impacting
connectivity to hospitals, schools, government and lots of
businesses).  

i would agree that our method of routing table analysis has
significant limitations and needs to be combined with other data.  but
it's a fantastic way of showing a lower bound on what was affected:
prefixes without entries in the global table almost certainly have no
service.

t.

-- 
_____________________________________________________________________
todd underwood
director of operations & security
renesys - interdomain intelligence
todd () renesys com   www.renesys.com


Current thread: