nanog mailing list archives

Re: Level3 routing issues?


From: "Jack Bates" <jbates () brightok net>
Date: Tue, 28 Jan 2003 09:47:17 -0600


From: <cowie () renesys com>

<snip>
On the other hand, we also know (from private communications and from
other mailing lists.. ahem) that high rate and high src/dst diversity
of scans causes some network devices to fail (devices that cache flows, or
devices that suffer from cpu overload under such conditions).

Some BGP-speaking routers (not all, by any means, but some subpopulation)
found themselves pegged at 100% CPU on Saturday.  Just one example:

   http://noc.ilan.net.il/stats/ILAN-CPU/new-gp-cpu.html

Was it not known that under certain conditions the router would flatline?
What percautionary measures were put into place in such an event to limit
the damage?

Whether you believe "anthropogenic" explanations for the instability
depends on how fast you believe NEs can look, think, and type, compared
to the speed with which the BGP announcement and withdrawal rates are
observed to take off.  For my part, I'd bet that the long slow exponential
decay (with superimposed spiky noise) is people at work.  But the initial
blast is not.

When the crisis is on you, it's too late. You are either prepared and know
exactly what to do at that critical moment or you don't. You either had a <5
minute response time to the crisis or you didn't. We also know (from private
communications and from other mailing lists.. yes, I'm a thief :) that many
NEs were caught with their pants down, a mistake they aren't apt to do
again. It comes down to one's outlook. Do you just configure and maintain or
do you strive to push it to the envelope? Do you truly know your network?
Remember, it's a living, breathing thing. The complexity of variables makes
complete predictability impossible, and so we must learn to understand it
and how it reacts.

Then again, perhaps I'm a lunatic. :)

Jack Bates
BrightNet Oklahoma


Current thread: