nanog mailing list archives

Re: Quick question.


From: "John Underhill" <stepnwlf () magma ca>
Date: Sun, 1 Aug 2004 15:07:30 -0400


If a CPU dies, it's unlikely to come back up without removing the bad
CPU, especially if the CPU has become unreliable rather than dying
completely. Even if CPU 0 is good and the BIOS has no problems
booting the OS, the SMP aware OS will quite probably hit problems
with the bad CPU.

Not necessarily. There have been a number of innovations in recent years in
the area of integrated fault tolerance, including bios level controls over
component monitoring / management. Some of the more upscale Compaq G3
servers for instance, can remove a processor from operation if it exceeds a
threshold of critical errors, (this is also true for memory).
Alphas can boot even if the bootstrap processor fails at system start, and
simply selects the next available processor.. they also have hot swap
processor capabilities, (again for the time being -upscale..). Add onto this
features like hot swap 'raid memory' and pci, redundant pwr, fans, and
drives, and systems can be made to withstand many common component failures,
with little or no interruption in service.
With the advent of technologies like hyperthreading, manufacturers are being
driven by market demands to create more reliable SMP drivers, and I think it
is likely that simultaneous multi-threading will eventually become the
standard.


a duallie will keep the system up when a faulty process hogs 100%
CPU, because the second one is still available. That also increases
availability ratio.

Well it depends.. The real differentiation is if the system is truly
'symetric', that is; dual processor, I/O and memory bus. If both processors
share the same resources, competition between processors for regions of
memory and acquiring locks on the pci bus, severely constrain the available
resources for each processor. So that if a process runs amock on a single
bus architecture, the second processor will not have the resources it needs
to run effectively..

application is not going to take down the machine on any modern OS[2]
and anyway can be dealt with with resource limits, SMP or not,
presuming your OS supports resource limits.

The real problem with SMP is kernel complexity. Drivers that are rock
solid in single-processor can have bugs that are only triggered under
SMP. Threaded applications can also become unreliable on SMP systems.

The extra power of an SMP system might be a bonus, but trying to
argue their benefits on the basis of reliability is misguided.

Michel.

1. Now, they may still be very reliable, and more than reliable
enough for your needs, but they are still not as reliable as the
exact same machine with terminators in all CPU sockets/slots bar one
;) The fault-tolerant systems are outrageously expensive.

2. Unless you're running MacOS 9 or Windows 3.11 on your server.. -
dont think either supports SMP though ;).

regards,
-- 
Paul Jakma paul () clubi ie paul () jakma org Key ID: 64A2FF6A
Fortune:
A Linux machine! because a 486 is a terrible thing to waste!
(By jjs () wintermute ucr edu, Joe Sloan)


Current thread: