Interesting People mailing list archives
Tracking the blackout bug
From: Dave Farber <dave () farber net>
Date: Fri, 16 Apr 2004 21:27:30 -0400
Delivered-To: dfarber+ () ux13 sp cs cmu edu Date: Fri, 16 Apr 2004 20:22:30 -0500 From: "Ayyasamy, Senthilkumar (UMKC-Student)" <saq66 () umkc edu> Subject: Tracking the blackout bug To: dave () farber net Hi, I did not see this interesting report announced in IP https://reports.energy.gov/ It states software bugs as the reason for the recent blackout. I thought hardware bugs are costly; seems, software errors are more costly. atleast, a lot of verification tools are available for hardware testing; but, software lacks widely standardized verification tools, dominant safe language use and safe execution platform like Virtual machines. -- senthil a related article from securityfocus: http://www.securityfocus.com/news/8412 A number of factors and failings came together to make the August 14th northeastern blackout the worst outage in North American history. One of them was buried in a massive piece of software compiled from four million lines of C code and running on an energy management computer in Ohio. To nobody's surprise, the final report on the blackout released by a U.S. -Canadian task force Monday puts most of blame for the outage on Ohio-based FirstEnergy Corp., faulting poor communications, inadequate training, and the company's failure to trim back trees encroaching on high-voltage power lines. But over a dozen of task force's 46 recommendations for preventing future outages across North America are focused squarely on cyberspace. That may have something to do with the timing of the blackout, which came three days after the relentless Blaster worm began wreaking havoc around the Internet -- a coincidence that prompted speculation at the time that the worm, or the traffic it was generating in its efforts to spread, might have triggered or exacerbated the event. When U.S. and Canadian authorities assembled their investigative teams, they included a computer security contingent tasked with looking specifically at any cybersecurity angle on the outage. In the end, it turned out that a computer snafu actually played a significant role in the cascading blackout -- though it had nothing to do with viruses or cyber terrorists. A silent failure of the alarm function in FirstEnergy's computerized Energy Management System (EMS) is listed in the final report as one of the direct causes of a blackout that eventually cut off electricity to 50 million people in eight states and Canada. ... The XA/21 isn't based on Windows, so it couldn't have been infected by Blaster, but the company didn't immediately rule out the possibility that the worm somehow played a role in the alarm failure. "In the initial stages, nobody really knew what the root cause was," says Mike Unum, manager of commercial solutions at GE Energy. "We spent a considerable amount of time analyzing that, trying to understand if it was a software problem, or if -- like some had speculated -- something different had happened." Sometimes working late into the night and the early hours of the morning, the team pored over the approximately one-million lines of code that comprise the XA/21's Alarm and Event Processing Routine, written in the C and C++ programming languages. Eventually they were able to reproduce the Ohio alarm crash in GE Energy's Florida laboratory, says Unum. "It took us a considerable amount of time to go in and reconstruct the events." In the end, they had to slow down the system, injecting deliberate delays in the code while feeding alarm inputs to the program. About eight weeks after the blackout, the bug was unmasked as a particularly subtle incarnation of a common programming error called a "race condition," triggered on August 14th by a perfect storm of events and alarm conditions on the equipment being monitored. The bug had a window of opportunity measured in milliseconds. "There was a couple of processes that were in contention for a common data structure, and through a software coding error in one of the application processes, they were both able to get write access to a data structure at the same time," says Unum. "And that corruption led to the alarm event application getting into an infinite loop andspinning."
------------------------------------- You are subscribed as interesting-people () lists elistx com To manage your subscription, go to http://v2.listbox.com/member/?listname=ip Archives at: http://www.interesting-people.org/archives/interesting-people/
Current thread:
- Tracking the blackout bug Dave Farber (Apr 13)
- <Possible follow-ups>
- Tracking the blackout bug Dave Farber (Apr 16)