Interesting People mailing list archives

Tracking the blackout bug


From: Dave Farber <dave () farber net>
Date: Fri, 16 Apr 2004 21:27:30 -0400


Delivered-To: dfarber+ () ux13 sp cs cmu edu
Date: Fri, 16 Apr 2004 20:22:30 -0500
From: "Ayyasamy, Senthilkumar  (UMKC-Student)" <saq66 () umkc edu>
Subject: Tracking the blackout bug
To: dave () farber net


Hi,

I did not see this interesting report announced in IP
https://reports.energy.gov/  It states software bugs as the
reason for the recent blackout. I thought hardware bugs are
costly; seems, software errors are more costly. atleast, a
lot of verification tools are available for hardware testing;
but, software lacks widely standardized verification tools,
dominant safe language use and safe execution platform like
Virtual machines.

   -- senthil


a related article from securityfocus:
http://www.securityfocus.com/news/8412

A number of factors and failings came together to make the August 14th
northeastern blackout the worst outage in North American history. One of

them was buried in a massive piece of software compiled from four
million
lines of C code and running on an energy management computer in Ohio.

To nobody's surprise, the final report on the blackout released by a
U.S.
-Canadian task force Monday puts most of blame for the outage on
Ohio-based
FirstEnergy Corp., faulting poor communications, inadequate training,
and
the company's failure to trim back trees encroaching on high-voltage
power
lines. But over a dozen of task force's 46 recommendations for
preventing
future outages across North America are focused squarely on cyberspace.

That may have something to do with the timing of the blackout, which
came
three days after the relentless Blaster worm began wreaking havoc around

the Internet -- a coincidence that prompted speculation at the time that

the worm, or the traffic it was generating in its efforts to spread,
might
have triggered or exacerbated the event. When U.S. and Canadian
authorities
assembled their investigative teams, they included a computer security
contingent tasked with looking specifically at any cybersecurity angle
on
the outage.


In the end, it turned out that a computer snafu actually played a
significant
role in the cascading blackout -- though it had nothing to do with
viruses or
cyber terrorists. A silent failure of the alarm function in
FirstEnergy's
computerized Energy Management System (EMS) is listed in the final
report as one
of the direct causes of a blackout that eventually cut off electricity
to 50
million people in eight states and Canada.

...

The XA/21 isn't based on Windows, so it couldn't have been infected by
Blaster,
but the company didn't immediately rule out the possibility that the
worm somehow
played a role in the alarm failure. "In the initial stages, nobody
really knew
what the root cause was," says Mike Unum, manager of commercial
solutions at GE
Energy. "We spent a considerable amount of time analyzing that, trying
to
understand if it was a software problem, or if -- like some had
speculated --
something different had happened."

Sometimes working late into the night and the early hours of the
morning, the
team pored over the approximately one-million lines of code that
comprise the
XA/21's Alarm and Event Processing Routine, written in the C and C++
programming
languages. Eventually they were able to reproduce the Ohio alarm crash
in GE
Energy's Florida laboratory, says Unum. "It took us a considerable
amount of time
to go in and reconstruct the events." In the end, they had to slow down
the system,
injecting deliberate delays in the code while feeding alarm inputs to
the program.
About eight weeks after the blackout, the bug was unmasked as a
particularly subtle
incarnation of a common programming error called a "race condition,"
triggered on
August 14th by a perfect storm of events and alarm conditions on the
equipment being
monitored. The bug had a window of opportunity measured in milliseconds.


"There was a couple of processes that were in contention for a common
data structure,
and through a software coding error in one of the application processes,
they were
both able to get write access to a data structure at the same time,"
says Unum. "And
that corruption led to the alarm event application getting into an
infinite loop and
spinning."
-------------------------------------
You are subscribed as interesting-people () lists elistx com
To manage your subscription, go to
 http://v2.listbox.com/member/?listname=ip

Archives at: http://www.interesting-people.org/archives/interesting-people/


Current thread: