nanog mailing list archives
Re: availability and resiliency
From: Valdis.Kletnieks () vt edu
Date: Fri, 29 Sep 2000 22:07:47 -0400
On Fri, 29 Sep 2000 18:42:12 EDT, Andrew Brown said:
um...is an smp cpu configuration really going to help your uptime? or are there operating systems or hardware out there that can say to themselves "hmph! cpu 2 seems not to be working correctly...i'd better spin it down."
IBM mainframes have been doing this for decades. I believe that both OS/VS1 and VM/370 for the S370-158 supported this back in the 1973 timeframe. About 10 years ago, our 3090-300 blew a TCM and lost one of the 3 CPUs. As I was sitting there diagnosing the problem at the console, I got a popup dialog box from the onboard support processor. Basically, it wanted to phone IBM Hardware Support and tell them to send a guy with a new TCM, but it had detected that it was more than 7 digits and therefor probably a long distance phone call, was this OK? Yes, it asked permission to rack up the phone bill before it called for repairs itself. Current mainframe state of the art is described in the IBM Journal of Research and Development - Vol 43, Number 5/6 (Sep/No 99), which was devoted to the G5 and G6 chipsets used in current IBM S/390 big iron. The article "RAS strategy for IBM S/390 G5 and G6" (page 875) talks about the system's ability to not only detect a failing CPU, but on detection it will latch out the last known good state from the previous instruction, and retry the failing machine instruction on a hot-spare. That's after a reset-and-retry on the failing processor has proven it's a hard failure and not a soft one. The mind boggles.... ;) -- Valdis Kletnieks Operating Systems Analyst Virginia Tech
Attachment:
_bin
Description:
Current thread:
- availability and resiliency Irwin Lazar (Sep 28)
- Re: availability and resiliency Michael Shields (Sep 28)
- Re: availability and resiliency Andrew Bangs (Sep 29)
- Re: availability and resiliency Lionel Lauer (Sep 29)
- <Possible follow-ups>
- RE: availability and resiliency Leo Nelson (Sep 29)
- Re: availability and resiliency Majdi S. Abbas (Sep 29)
- RE: availability and resiliency Leo Nelson (Sep 29)
- RE: availability and resiliency Roeland M.J. Meyer (Sep 29)
- Re: availability and resiliency Andrew Brown (Sep 29)
- Re: availability and resiliency Valdis . Kletnieks (Sep 29)
- Re: availability and resiliency Adrian Chadd (Sep 30)
- Re: availability and resiliency Jay Tribick (Sep 30)
- Re: availability and resiliency Andrew Brown (Sep 29)