Interesting People mailing list archives

more on LA ATC Failure The Risks Digest Volume 23: Issue 54


From: David Farber <dave () farber net>
Date: Tue, 28 Sep 2004 11:16:04 -0400



Begin forwarded message:

  Re: LA ATC Failure (RISKS-23.53)


    <Paul Cox <pcox () eskimo com>>

   Thu, 23 Sep 2004 13:03:40 -0700

I'm an air traffic controller in Seattle Center, which is a facility just
like the one in LA that had the crash.

To do their job, air traffic controllers need one thing above/beyond all: They need the ability to communicate with the aircraft they're controlling.

We can control planes even without radar, because we can get position
reports from the airplanes and provide safe separation via altitude,
spacing, and so forth.  But without comm, we're completely and utterly
hosed.

(Some of the FAA spokesflacks had the audacity to suggest that the system was still safe, because the radar system continued working just fine. Sure,
the controllers could still *see* the airplanes; they just couldn't do
anything about it as they watched them get closer, and closer, and
closer... they'd have had a wonderful view of the targets merging as the
passengers were converted instantly a thin pink mist had the planes
collided.  But hey, the system was safe.)

The VSCS (Voice Switching Communications System) puts all of our
communications into one spot- ground-to-ground calls to other facilities, calls within our own facility to other controllers, and air-to-ground comm.

It's a purely digital system; all the incoming feeds are converted to bits and bytes and switched through a series of servers and such until they're
turned back into analog and put into the controller's ear through his
headset.

Of course, this means that power to the system is absolutely critical, and
we've had power failures in the past (see past RISKS for that info).

The VSCS system was designed and built by Harris Corporation, but their
contract ran out some time ago. The FAA, coming to the end of the contract, decided to go a much less expensive route- and replace all the servers with
Dell boxes and their own programming.

In theory, there's nothing wrong with this; do the required maintenance, and there's no problem. But the system does have the design flaws referred to
in the RISKS articles.

Basically, the system needs to be reset about once a month- or more
specifically, once every 30 days or so. I heard a rumor that part of the problem in LA was that they'd done the reset at the beginning of August, but had put it off for September... and were planning to do it at the end of the
month.

There's a RISK right there; "once a month" probably means "once every 30 or so days", not "once in a calendar month" which could leave an interval as
long as nearly 60 days in between resets.

(On a side note, the voice recordings are only kept for the past 15 days, and it's done by an entirely separate system. The main reason for the reset
has to do with file and memory buffers overloading.)

Now, there's a backup system for VSCS. It's called VTABS, and is basically a reduced-capability server that normally runs the VSCS system on the ATC
simulator that's used to train students.

The VTABS system, with much less server power, cannot run the entire control
room and all of the frequencies that the control center has, so it's a
hassle to go to VTABS.

When the reset on VSCS is done, you have to run on VTABS for a while, which
usually means it's done on graveyard shifts to reduce the impact on live
traffic. The downside to this is that the VTABS system also doesn't get a
full workout.

So the next RISK pops up: The backup system isn't really fully checked out,
and if/when ATC needs it... it might not work.

Sure enough, that happened.  When VSCS died, LA Center switched to
VTABS... which also didn't work right.  Big trouble, now.

Finally, the FAA (in its infinite wisdom) a while back decided to remove a
last-ditch backup system called EARS.

EARS was basically a hard-wired, all-analog system that only provided the
most crucial thing- air-to-ground communications.

EARS required power to run, but the reason it had a big advantage over VSCS
or VTABS is that if the power died for, say, 20 seconds, as soon as the
power was back on EARS would work with no spool-up startup time. VSCS takes up to 45 minutes to completely start up, and VTABS has a significant delay
in startup time as well.

Seattle Center (where I work) is the only facility of its type that still has EARS (our variant is called VEARS). We have it because a fairly wise
manager asked our technicians to keep the system when it was slated for
removal. The tech side agreed, and have kept VEARS going by moving a little
money around in their budget (since FAA nationally cut VEARS, they don't
provide any money to maintain the system to the facilities.)

Fortunately (and perhaps a bit unbelievably) VEARS costs very very little to maintain, because it's just a set of switches that sit there unused the huge
majority of the time.  We test them for functionality about once a week.

The LA failure was both ridiculous and scary. It's ridiculous on several levels; the fact that the system is designed to shut itself down is silly in a way, because from the user's perspective the system basically crashes to
protect itself from crashing.

Well, when suddenly you can't talk to the airplanes, you don't much give a damn whether it's an intentional shutdown or an accidental/buggy shutdown.
Therefore, they might as well remove this intentional design.

It's ridiculous that the technicians weren't doing the reset. This issue is NOT NEW, and has been known for some time... and had any of the 10 airplanes (with 200 passengers each) managed to smack into another plane, you can bet
that the FAA would have been paying the families for a long, long, long
time.

It's ridiculous that the first backup system didn't work right simply
because people were too lazy/unmotivated to test it properly. VTABS is an acceptable backup; it's not perfect, but for the money it cost (essentially nothing for hardware, some reprogramming costs for the servers) it's nearly
ideal.

It's ridiculous that a perfectly good SECOND backup was thrown away by the FAA that cost even less. The technology in EARS has been around since, oh, about as long as there's been radio; it's tried and true, and it's pathetic
that there's only one facility in the nation (out of 21) that still has
EARS.

And it's scary to think that this could've happened in an even busier
facility than LA.  The morning crush of traffic in New York or Boston or
Indy or Cleveland Centers, for example, where there's even more traffic
packed into even less airspace than out west in LA.

The RISKS here are many and silly, because nearly all of them could have
been easily avoided with some diligence and forethought.

RISK 1) programming the system to shutdown to try and prevent a shutdown.
If you don't expect it either way, it doesn't matter.

RISK 2) being lazy or not really understanding that "once a month" actually
means "once every 30 days" and ensuring that a critical job is done, on
time, and correctly.

RISK 3) having a backup system that isn't checked to see if it can actually
do the job.  You rely upon it, it better work, and if/when it doesn't,
you're screwed.

RISK 3) throwing out a perfectly good second backup system because you think it's "old fashioned" and that the primary/secondary system you have now is so much better. Hey, the new stuff is all digital, it's gotta be better,
right?

Finally, on a personal note, the manager at Seattle Center who managed to talk the technical guys into keeping our VEARS system should be considered a
hero and an example for the rest of the FAA.  He's already a hero to me-
he's my father.  :)

Paul Cox, Seattle Center

-------------------------------------
You are subscribed as interesting-people () lists elistx com
To manage your subscription, go to
 http://v2.listbox.com/member/?listname=ip

Archives at: http://www.interesting-people.org/archives/interesting-people/


Current thread: