nanog mailing list archives
Re: Most energy efficient (home) setup
From: "Luke S. Crawford" <lsc () prgmr com>
Date: Sun, 15 Apr 2012 21:54:14 -0400
On Sun, Apr 15, 2012 at 10:52:51AM -0500, Jimmy Hess wrote:
Consider that the probability 16GB of SDRAM experiences at least one single bit error at sea level, in a given 6 hour period exceeds 66% = 1 - (1 - 1.3e-12 * 6)^(16 * 2^30 * 8). In any given 24 hour period, the probability of at least one single bit error exceeds 98%. Assuming the memory is good and functioning correctly; It's expected to see on average approximately 3 to 4 1-bit errors per day. More are frequently seen. Now if most of this 16GB of memory is unused, you will never notice that over 30 days, 120 or so bits have been flipped from their proper value..
I think that is an overestimate, at least if single-bit (corrected) ecc errors are as common as flipped bits on non-ecc ram. Now, First, count me in the "ECC is a must, full stop." crowd. I insist on ecc for even my customer's dedicated servers, even though most of the customers don't care that much. "It's not for you, it's for me." With ECC? if you have EDAC/bluesmoke setup correctly on a supported motherboard, you get console spew whenever you have a single-bit error. This means I can do a very simple grep on the box conserver logs to and I can find all the failing ram modules I am responsible for. Without ecc, I have no real way of telling the difference between broken software and broken ram. That said, I still think the 120 bits a month estimate is large; I believe that ECC ram should report correctable errors (assuming a correctly configured EDAC/bluesmoke module and supported chipset) about as often as non-ecc ram would get a bit flip. In a past role, I did spend the time grepping through such a properly configured cluster, with tens of thousands of nodes, looking for failing hardware. I should have done a proper paper with statistics, but I did not. The vast majority of servers had zero correctable ecc errors, while a few had a lot, which is consistent with the theory that ECC errors are more often caused by bad ram. (Of course, all these servers were in proper cases in a proper data center, which probably gives you a fair bit of shielding.) On my current fleet (well under 100 servers) single bit errors are so rare that if I get one, I schedule that machine for removal from production.
Current thread:
- Re: Most energy efficient (home) setup, (continued)
- Re: Most energy efficient (home) setup Joe Greco (Apr 15)
- Re: Most energy efficient (home) setup Jimmy Hess (Apr 15)
- Re: Most energy efficient (home) setup Laurent GUERBY (Apr 15)
- Message not available
- Re: Most energy efficient (home) setup Mike (Apr 15)
- Re: Most energy efficient (home) setup Jimmy Hess (Apr 15)
- Re: Most energy efficient (home) setup Jeroen van Aart (Apr 18)
- Re: Most energy efficient (home) setup Douglas Otis (Apr 18)
- Re: Most energy efficient (home) setup Steven Bellovin (Apr 18)
- Re: Most energy efficient (home) setup Douglas Otis (Apr 19)
- Re: Most energy efficient (home) setup Steven Bellovin (Apr 19)
- Re: Most energy efficient (home) setup Luke S. Crawford (Apr 15)
- Re: Most energy efficient (home) setup Joe Greco (Apr 15)
- RE: Most energy efficient (home) setup Jamie Bowden (Apr 16)
- Re: Most energy efficient (home) setup Leo Bicknell (Apr 16)
- Re: Most energy efficient (home) setup Joe Greco (Apr 16)
- Re: Most energy efficient (home) setup Jeroen van Aart (Apr 17)