Interesting People mailing list archives

more on LA ATC Failure The Risks Digest Volume 23: Issue 54

From: David Farber <dave () farber net>
Date: Tue, 28 Sep 2004 11:16:04 -0400



Begin forwarded message:

  Re: LA ATC Failure (RISKS-23.53)


    <Paul Cox <pcox () eskimo com>>

   Thu, 23 Sep 2004 13:03:40 -0700

I'm an air traffic controller in Seattle Center, which is a facilityjust

like the one in LA that had the crash.

To do their job, air traffic controllers need one thing above/beyondall:They need the ability to communicate with the aircraft they'recontrolling.


We can control planes even without radar, because we can get position
reports from the airplanes and provide safe separation via altitude,
spacing, and so forth.  But without comm, we're completely and utterly
hosed.

(Some of the FAA spokesflacks had the audacity to suggest that thesystemwas still safe, because the radar system continued working just fine.Sure,

the controllers could still *see* the airplanes; they just couldn't do
anything about it as they watched them get closer, and closer, and
closer... they'd have had a wonderful view of the targets merging as the
passengers were converted instantly a thin pink mist had the planes
collided.  But hey, the system was safe.)

The VSCS (Voice Switching Communications System) puts all of our

communications into one spot- ground-to-ground calls to otherfacilities,calls within our own facility to other controllers, and air-to-groundcomm.

It's a purely digital system; all the incoming feeds are converted tobitsand bytes and switched through a series of servers and such untilthey're

turned back into analog and put into the controller's ear through his
headset.

Of course, this means that power to the system is absolutely critical,and

we've had power failures in the past (see past RISKS for that info).

The VSCS system was designed and built by Harris Corporation, but their

contract ran out some time ago. The FAA, coming to the end of thecontract,decided to go a much less expensive route- and replace all the serverswith

Dell boxes and their own programming.

In theory, there's nothing wrong with this; do the requiredmaintenance, andthere's no problem. But the system does have the design flaws referredto

in the RISKS articles.

Basically, the system needs to be reset about once a month- or more

specifically, once every 30 days or so. I heard a rumor that part oftheproblem in LA was that they'd done the reset at the beginning ofAugust, buthad put it off for September... and were planning to do it at the endof the

month.

There's a RISK right there; "once a month" probably means "once every30 orso days", not "once in a calendar month" which could leave an intervalas

long as nearly 60 days in between resets.

(On a side note, the voice recordings are only kept for the past 15days,and it's done by an entirely separate system. The main reason for thereset

has to do with file and memory buffers overloading.)

Now, there's a backup system for VSCS. It's called VTABS, and isbasicallya reduced-capability server that normally runs the VSCS system on theATC

simulator that's used to train students.

The VTABS system, with much less server power, cannot run the entirecontrol

room and all of the frequencies that the control center has, so it's a
hassle to go to VTABS.

When the reset on VSCS is done, you have to run on VTABS for a while,which

usually means it's done on graveyard shifts to reduce the impact on live

traffic. The downside to this is that the VTABS system also doesn'tget a

full workout.

So the next RISK pops up: The backup system isn't really fully checkedout,

and if/when ATC needs it... it might not work.

Sure enough, that happened.  When VSCS died, LA Center switched to
VTABS... which also didn't work right.  Big trouble, now.

Finally, the FAA (in its infinite wisdom) a while back decided toremove a

last-ditch backup system called EARS.

EARS was basically a hard-wired, all-analog system that only providedthe

most crucial thing- air-to-ground communications.

EARS required power to run, but the reason it had a big advantage overVSCS

or VTABS is that if the power died for, say, 20 seconds, as soon as the

power was back on EARS would work with no spool-up startup time. VSCStakesup to 45 minutes to completely start up, and VTABS has a significantdelay

in startup time as well.

Seattle Center (where I work) is the only facility of its type thatstillhas EARS (our variant is called VEARS). We have it because a fairlywise

manager asked our technicians to keep the system when it was slated for

removal. The tech side agreed, and have kept VEARS going by moving alittle

money around in their budget (since FAA nationally cut VEARS, they don't
provide any money to maintain the system to the facilities.)

Fortunately (and perhaps a bit unbelievably) VEARS costs very verylittle tomaintain, because it's just a set of switches that sit there unused thehuge

majority of the time.  We test them for functionality about once a week.

The LA failure was both ridiculous and scary. It's ridiculous onseverallevels; the fact that the system is designed to shut itself down issilly ina way, because from the user's perspective the system basically crashesto

protect itself from crashing.

Well, when suddenly you can't talk to the airplanes, you don't muchgive adamn whether it's an intentional shutdown or an accidental/buggyshutdown.

Therefore, they might as well remove this intentional design.

It's ridiculous that the technicians weren't doing the reset. Thisissue isNOT NEW, and has been known for some time... and had any of the 10airplanes(with 200 passengers each) managed to smack into another plane, you canbet

that the FAA would have been paying the families for a long, long, long
time.

It's ridiculous that the first backup system didn't work right simply

because people were too lazy/unmotivated to test it properly. VTABS isanacceptable backup; it's not perfect, but for the money it cost(essentiallynothing for hardware, some reprogramming costs for the servers) it'snearly

ideal.

It's ridiculous that a perfectly good SECOND backup was thrown away bytheFAA that cost even less. The technology in EARS has been around since,oh,about as long as there's been radio; it's tried and true, and it'spathetic

that there's only one facility in the nation (out of 21) that still has
EARS.

And it's scary to think that this could've happened in an even busier
facility than LA.  The morning crush of traffic in New York or Boston or
Indy or Cleveland Centers, for example, where there's even more traffic
packed into even less airspace than out west in LA.

The RISKS here are many and silly, because nearly all of them could have
been easily avoided with some diligence and forethought.

RISK 1) programming the system to shutdown to try and prevent ashutdown.

If you don't expect it either way, it doesn't matter.

RISK 2) being lazy or not really understanding that "once a month"actually

means "once every 30 days" and ensuring that a critical job is done, on
time, and correctly.

RISK 3) having a backup system that isn't checked to see if it canactually

do the job.  You rely upon it, it better work, and if/when it doesn't,
you're screwed.

RISK 3) throwing out a perfectly good second backup system because youthinkit's "old fashioned" and that the primary/secondary system you have nowisso much better. Hey, the new stuff is all digital, it's gotta bebetter,

right?

Finally, on a personal note, the manager at Seattle Center who managedtotalk the technical guys into keeping our VEARS system should beconsidered a

hero and an example for the rest of the FAA.  He's already a hero to me-
he's my father.  :)

Paul Cox, Seattle Center

-------------------------------------
You are subscribed as interesting-people () lists elistx com
To manage your subscription, go to
 http://v2.listbox.com/member/?listname=ip

Archives at: http://www.interesting-people.org/archives/interesting-people/

Current thread:

more on LA ATC Failure The Risks Digest Volume 23: Issue 54 David Farber (Sep 28)