nanog mailing list archives

Re: Extreme congestion (was Re: inter-domain link recovery)


From: Stephen Wilcox <steve.wilcox () packetrade com>
Date: Wed, 15 Aug 2007 17:12:57 +0100


Hey Sean,

On Wed, Aug 15, 2007 at 11:35:43AM -0400, Sean Donelan wrote:
On Wed, 15 Aug 2007, Stephen Wilcox wrote:
(Check slide 4) - the simple fact was that with something like 7 of 9 
cables down the redundancy is useless .. even if operators maintained 
N+1 redundancy which is unlikely for many operators that would imply 
50% of capacity was actually used with 50% spare.. however we see 
around 78% of capacity is lost. There was simply to much traffic and 
not enough capacity.. IP backbones fail pretty badly when faced with 
extreme congestion.

Remember the end-to-end principle.  IP backbones don't fail with extreme 
congestion, IP applications fail with extreme congestion.

Hmm I'm not sure about that... a 100% full link dropping packets causes many problems:
L7: Applications stop working, humans get angry
L4: TCP/UDP drops cause retransmits, connection drops, retries etc
L3: BGP sessions drop, OSPF hellos are lost.. routing fails
L2: STP packets dropped.. switching fails

I believe any or all of the above could occur on a backbone which has just failed massively and now has 20% capacity 
available such as occurred in SE Asia

Should IP applications respond to extreme congestion conditions better?
alert('Connection dropped')
"Ping timed out"

kinda icky but its not the applications job to manage the network

Or should IP backbones have methods to predictably control which IP 
applications receive the remaining IP bandwidth?  Similar to the telephone
network special information tone -- All Circuits are Busy.  Maybe we've
found a new use for ICMP Source Quench.

yes and no.. for a private network perhaps, but for the Internet backbone where all traffic is important (right?), 
differentiation is difficult unless applied at the edge and you have major failure and congestion i dont see what you 
can do that will have any reasonable effect. perhaps you are a government contractor and you reserve some capacity for 
them and drop everything else but what is really out there as a solution?

FYI I have seen telephone networks fail badly under extreme congestion. CO's have small CPUs that dont do a whole lot - 
setup calls, send busy signals .. once a call is in place it doesnt occupy CPU time as the path is locked in place 
elsewhere. however, if something occurs to cause a serious amount of busy ccts then CPU usage goes thro the roof and 
you can cause cascade failures of whole COs

telcos look to solutions such as call gapping to intervene when they anticipate major congestion, and not rely on the 
network to handle it

Even if the IP protocols recover "as designed," does human impatience mean 
there is a maximum recovery timeout period before humans start making the 
problem worse?

i'm not sure they were designed to do this.. the arpanet wasnt intended to be massively congested.. the redundant links 
were in place to cope with loss of a node and usage was manageable.

Steve


Current thread: