nanog mailing list archives

Re: Most energy efficient (home) setup

From: "Luke S. Crawford" <lsc () prgmr com>
Date: Sun, 15 Apr 2012 21:54:14 -0400

On Sun, Apr 15, 2012 at 10:52:51AM -0500, Jimmy Hess wrote:

Consider that the probability 16GB of SDRAM experiences at least one
single bit error at sea level,
in a given 6 hour period exceeds  66%  = 1 - (1 - 1.3e-12 * 6)^(16 *
2^30 * 8).    In any given 24 hour period, the probability of at least
one single bit error  exceeds 98%.    Assuming the memory is good and
functioning correctly;

It's expected to see on average approximately   3 to 4   1-bit errors
per day.  More are frequently seen.

Now if most of this 16GB of memory is unused, you will never notice
that over 30 days,  120 or so bits have been flipped  from their
proper value..


I think that is an overestimate, at least if single-bit (corrected)
ecc errors are as common as flipped bits on non-ecc ram. 

Now, First, count me in the "ECC is a must, full stop." crowd.   I 
insist on ecc for even my customer's dedicated servers, even though most
of the customers don't care that much.   "It's not for you, it's for me."
With ECC?  if you have EDAC/bluesmoke setup correctly on a supported
motherboard, you get console spew whenever you have a single-bit error.

This means I can do a very simple grep on the box conserver logs to
and I can find all the failing ram modules I am responsible for.  
Without ecc, I have no real way of telling the difference between broken
software and broken ram.    

That said,  I still think the 120 bits a month estimate is large;  I 
believe that ECC ram should report correctable errors (assuming a 
correctly configured EDAC/bluesmoke module and supported chipset) 
about as often as non-ecc ram would get a bit flip.   

In a past role, I did spend the time grepping through such a properly 
configured cluster, with tens of thousands of nodes, looking for failing
hardware.   I should have done a proper paper with statistics, but
I did not.   The vast majority of servers had zero correctable ecc errors,
while a few had a lot, which is consistent with the theory that ECC errors
are more often caused by bad ram.    

(Of course, all these servers were in proper cases in a proper data center,
which probably gives you a fair bit of shielding.)

On my current fleet (well under 100 servers)  single bit errors are so rare
that if I get one, I schedule that machine for removal from production.

Current thread:

Re: Most energy efficient (home) setup, (continued)
- - - Re: Most energy efficient (home) setup Joe Greco (Apr 15)
    - Re: Most energy efficient (home) setup Jimmy Hess (Apr 15)
    - Re: Most energy efficient (home) setup Laurent GUERBY (Apr 15)
    - Message not available
    - Re: Most energy efficient (home) setup Mike (Apr 15)
    - Re: Most energy efficient (home) setup Jimmy Hess (Apr 15)
    - Re: Most energy efficient (home) setup Jeroen van Aart (Apr 18)
    - Re: Most energy efficient (home) setup Douglas Otis (Apr 18)
    - Re: Most energy efficient (home) setup Steven Bellovin (Apr 18)
    - Re: Most energy efficient (home) setup Douglas Otis (Apr 19)
    - Re: Most energy efficient (home) setup Steven Bellovin (Apr 19)
    - Re: Most energy efficient (home) setup Luke S. Crawford (Apr 15)
    - Re: Most energy efficient (home) setup Joe Greco (Apr 15)
    - RE: Most energy efficient (home) setup Jamie Bowden (Apr 16)
    - Re: Most energy efficient (home) setup Leo Bicknell (Apr 16)
    - Re: Most energy efficient (home) setup Joe Greco (Apr 16)
    - Re: Most energy efficient (home) setup Jeroen van Aart (Apr 17)