nanog mailing list archives

Re: What to expect after a cooling failure


From: Larry Sheldon <LarrySheldon () cox net>
Date: Tue, 09 Jul 2013 23:17:04 -0500

On 7/9/2013 10:28 PM, Erik Levinson wrote:
As some may know, yesterday 151 Front St suffered a cooling failure
after Enwave's facilities were flooded.

One of the suites that we're in recovered quickly but the other took
much longer and some of our gear shutdown automatically due to
overheating. We shut down remotely many redundant and non-essential
systems in the hotter suite, and transferred remotely some others to
the cooler suite, to ensure that we had a minimum of all core systems
running in the hotter suite. We waited until the temperatures
returned to normal, and brought everything back online. The entire
event lasted from approx 18:45 until 01:15. Apparently ambient
temperature was above 43 degrees Celcius at one point on the cool
side of cabinets in the hotter suite.

For those who have gone through such events in the past, what can one
expect in terms of long-term impact...should we expect some premature
component failures? Does anyone have any stats to share?

No stats, but way back in the day of very large computers (1 each) in very large facilities, it seems like the thing we worried most about at restart was too-rapid cooling and the resulting condensation if the conditions were right.

After power-up the next thing was disk crashes that occurred on the way down (this was a long time ago discs and drums are different now).

Lastly was overheat failures which were relatively few and always in components with a weakness reputation.

--
Requiescas in pace o email           Two identifying characteristics
                                        of System Administrators:
Ex turpi causa non oritur actio      Infallibility, and the ability to
                                        learn from their mistakes.
                                          (Adapted from Stephen Pinker)


Current thread: