nanog mailing list archives
Re: HE.net, Fremont-2 outage?
From: Valdis.Kletnieks () vt edu
Date: Wed, 04 Nov 2009 21:57:33 -0500
On Wed, 04 Nov 2009 12:26:15 CST, Joe Greco said:
With power: N+1 is usually better than N Best to assume full load when doing math Things will go wrong, predict common failures
And uncommon ones. :) So as part of a major compute-cluster install, we upgraded our UPS and diesel generator one weekend, and breathed a collective sigh of relief that we were now safe from power outages and mostly dodged a bullet. We *did* have some scary moments when we discovered that (a) of the 400 or so disks on our Sun E10K, about 10 didn't spin up again and (b) several of the boot disks on said box weren't mirrored. Fortunately, none of the 10 fails were on a non-mirrored disk. By Tuesday, all the non-mirrored boot disks were in fact mirrored. That Friday, a bozo contractor relocating a doorway managed to set off the Halon. Only lost two disks on the E10K. Guess which two? ;) And a month later, we discovered that the nice shiny new automatic cutover switch was wired in backwards, necessitating another power outage to re-wire it correctly. So much for safe from power outages... :)
Attachment:
_bin
Description:
Current thread:
- Human Factors and Accident reduction/mitigation, (continued)
- Human Factors and Accident reduction/mitigation Owen DeLong (Nov 05)
- Re: Human Factors and Accident reduction/mitigation Robert Boyle (Nov 05)
- Re: Human Factors and Accident reduction/mitigation Michael Peddemors (Nov 05)
- Re: Human Factors and Accident reduction/mitigation Owen DeLong (Nov 05)
- Re: Human Factors and Accident reduction/mitigation JC Dill (Nov 06)
- Re: Human Factors and Accident reduction/mitigation Owen DeLong (Nov 07)
- Re: Human Factors and Accident reduction/mitigation JC Dill (Nov 07)
- Re: Human Factors and Accident reduction/mitigation Anton Kapela (Nov 08)
- Re: Human Factors and Accident reduction/mitigation JC Dill (Nov 08)
- Re: HE.net, Fremont-2 outage? Seth Mattinen (Nov 04)
- Re: HE.net, Fremont-2 outage? Valdis . Kletnieks (Nov 04)
- Re: HE.net, Fremont-2 outage? Stef Walter (Nov 03)
- Re: HE.net, Fremont-2 outage? Joe Greco (Nov 03)
- Re: HE.net, Fremont-2 outage? Tico (Nov 03)
- Re: HE.net, Fremont-2 outage? Majdi S. Abbas (Nov 03)
- Re: HE.net, Fremont-2 outage? Scott Howard (Nov 03)
- Re: HE.net, Fremont-2 outage? Joe Greco (Nov 03)
- Re: HE.net, Fremont-2 outage? David B. Peterson (Nov 03)
- Re: HE.net, Fremont-2 outage? Joe Greco (Nov 04)
- Re: HE.net, Fremont-2 outage? dan syn (Nov 04)
- RE: HE.net, Fremont-2 outage? Alex Rubenstein (Nov 04)