nanog mailing list archives
Re: HE.net, Fremont-2 outage?
From: Seth Mattinen <sethm () rollernet us>
Date: Wed, 04 Nov 2009 12:28:41 -0800
Joe Greco wrote:
Yup. Related: "100% availability" is a marketing person's dream; it sounds good in theory but is unattainable in practice, and is a reliable sign of non-100%-reliability. The most common way to gain "100% availability" is to avoid testing under load. This surely protects the equipment against a whole slew of failures in the less-used portions of your power systems, but also protects you from detecting them outside your Hour(s) Of Greatest Need.
Not testing under load is silly, IMHO. Does it work? Maybe. If it does something strange during testing it's attended, expected, and utility is available to fall back on. Starting your generator only means it'll turn over and idle, not that it'll provide power under load all the way to the racks. Some people may prefer a colo that never risks it and therefore never does more than idle the genset to claim 100% uptime. Others may prefer one that won't promise 100% everything but does load tests. I'd rather have a test go wrong while utility is available rather than a failed backup with no utility hoping the power comes back before the UPS dies or the room cooks itself. Both extremes are available to choose from if you do your research before picking a colo.
And even for those who follow best practices... You can inspect and maintain things until you're blue in the face. One day a contractor will drop a wrench into a PDU or UPS or whatever and spectacular things will happen. Or a battery develops a strange fault. You do live load testing, you'll lose now and then. It's best to simply assume no single circuit is 100% reliable. You should be able to get two circuits from separate power systems and the combination of the two should really closely approximate 100%, but even there... it isn't.
Separate power systems are overrated, especially if the fire department ends up being involved for some reason. (Re: the infamous gas leak story.) And of course with increased complexity comes increased risk of failure and longer downtime to diagnose and repair. There is no perfect balance. ~Seth
Current thread:
- Re: HE.net, Fremont-2 outage?, (continued)
- Re: HE.net, Fremont-2 outage? Joe Greco (Nov 05)
- Human Factors and Accident reduction/mitigation Owen DeLong (Nov 05)
- Re: Human Factors and Accident reduction/mitigation Robert Boyle (Nov 05)
- Re: Human Factors and Accident reduction/mitigation Michael Peddemors (Nov 05)
- Re: Human Factors and Accident reduction/mitigation Owen DeLong (Nov 05)
- Re: Human Factors and Accident reduction/mitigation JC Dill (Nov 06)
- Re: Human Factors and Accident reduction/mitigation Owen DeLong (Nov 07)
- Re: Human Factors and Accident reduction/mitigation JC Dill (Nov 07)
- Re: Human Factors and Accident reduction/mitigation Anton Kapela (Nov 08)
- Re: Human Factors and Accident reduction/mitigation JC Dill (Nov 08)
- Re: HE.net, Fremont-2 outage? Seth Mattinen (Nov 04)
- Re: HE.net, Fremont-2 outage? Valdis . Kletnieks (Nov 04)
- Re: HE.net, Fremont-2 outage? Stef Walter (Nov 03)
- Re: HE.net, Fremont-2 outage? Joe Greco (Nov 03)
- Re: HE.net, Fremont-2 outage? Tico (Nov 03)
- Re: HE.net, Fremont-2 outage? Majdi S. Abbas (Nov 03)
- Re: HE.net, Fremont-2 outage? Scott Howard (Nov 03)
- Re: HE.net, Fremont-2 outage? Joe Greco (Nov 03)
- Re: HE.net, Fremont-2 outage? David B. Peterson (Nov 03)
- Re: HE.net, Fremont-2 outage? Joe Greco (Nov 04)
- Re: HE.net, Fremont-2 outage? dan syn (Nov 04)