nanog mailing list archives

RE: Data Center testing


From: Deepak Jain <deepak () ai net>
Date: Wed, 26 Aug 2009 14:22:49 -0400


The idea of regular testing is to essentially detect failures on your time schedule rather than entropy's (or 
Murphy's). There can be flaws in your testing methodology too. This is why generic load bank tests and network load 
simulators rarely tell the whole story.

Customers are rightfully unpleased with any testing that affects their normal peace-of-mind, and doubly so when it 
affects actual operational effectiveness. However, since no system can operate indefinitely without maintenance, 
failover and other items, the question of taking a window is not negotiable. The only thing that is negotiable 
(somewhat) is when, and only in one direction (ahead of the item failing on its own). 

So, taking this concept to networks. It's not negotiable whether a link or a device will fail, the question is only how 
long you are going to forward bits along the dead path before rerouting and how long that rerouting will take. SONET 
says about 50ms, standard BGP about 30-300seconds. BFD and other things may improve these dramatically in your setup. 
You build your network around your business case and vice versa. 

Clearly, most of the known universe has decided that BGP time is "good enough" for the Internet as a whole right now. 
Most are aware of the costs in terms of overall jitter, CPU and stability if we reduce those times too far. 

Its intellectually dishonest to talk about never losing a packet or never forwarding along a dead path for even a 
nanosecond when the state-of-the-art says something very different indeed. 

Deepak Jain
AiNET

-----Original Message-----
From: Dylan Ebner [mailto:dylan.ebner () crlmed com]
Sent: Wednesday, August 26, 2009 11:33 AM
To: Dan Snyder; Ken Gilmour
Cc: NANOG list
Subject: RE: Data Center testing

I would hope that the data center engineers built and ran suite of
tests to find failure points before the network infrastructure was put
into production. That said, changes are made constantly to the
infrastructure and it can become very difficult very quickly to know if
the failovers are still going to work. This is one place where the
power and network in a datacenter divulge. The power systems may take
on additional load over the course of the life of the facility, but the
transfer switches and generators do not get many changes made to them.
Also, network infrastructure tests are not going to be zero impact if
there is a config problem. Generator tests are much easier. You can
start up the generator and do a load test. You can also load test the
UPS systems as well. Then you can initiate your failover. Network tests
are not going to be zero impact even if there isn't a problem. Let's
say you wanted to power fail a edge router participating in BGP, it can
take 30 seconds for that routers route to get withdrawn from the BGP
tables of the world. The other problem is network failures always seem
to come from "unexpected" issues. I always love it when I get an outage
report from my ISP's or datacenter and they say an "unexpected issue"
or "unforseen issue" caused the problem.


Dylan
-----Original Message-----
From: Dan Snyder [mailto:sliplever () gmail com]
Sent: Monday, August 24, 2009 8:39 AM
To: Ken Gilmour
Cc: NANOG list
Subject: Re: Data Center testing

We have done power tests before and had no problem.  I guess I am
looking for someone who does testing of the network equipment outside
of just power tests.  We had an outage due to a configuration mistake
that became apparent when a switch failed.  It didn't cause a problem
however when we did a power test for the whole data center.

-Dan


On Mon, Aug 24, 2009 at 9:31 AM, Ken Gilmour <ken.gilmour () gmail com>
wrote:

I know Peer1 in vancouver reguarly send out notifications of
"non-impacting" generator load testing, like monthly. Also InterXion
in Dublin, Ireland have occasionally sent me notification that there
was a power outage of less than a minute however their backup
successfully took the load.

I only remember one complete outage in Peer1 a few years ago... Never
seen any outage in InterXion Dublin.

Also I don't ever remember any power failure at AiNet (Deepak will
probably elaborate)

2009/8/24 Dan Snyder <sliplever () gmail com>:
Does any one know of any data centers that do failure testing of
their networking equipment regularly? I mean to verify that
everything fails over properly after changes have been made over
time.  Is there any best practice guides for doing this?

Thanks,
Dan






Current thread: