Interesting People mailing list archives

NASA: DOS Glitch Nearly Killed Mars Rover


From: David Farber <dave () farber net>
Date: Sat, 28 Aug 2004 09:28:50 -0400



Begin forwarded message:


http://www.extremetech.com/article2/0,1558,1638764,00.asp

August 23, 2004
NASA: DOS Glitch Nearly Killed Mars Rover
By Mark Hachman

STANFORD, CALIF. -- A software glitch that paralyzed the Mars "Spirit"
rover earlier this year was caused by an unanticipated characteristic of
a
DOS file system, a NASA scientist said Monday.

The flaw, since fixed, was only discovered after days of agonizingly slow
tests complicated by the limited "windows" of communication allowed by
the
rotation of Mars, said Robert Denise, a member of the Flight Software
Development Team at NASA's Jet Propulsion Laboratory.

On Jan. 21, the Spirit rover stopped communicating with the teams on
Earth, beginning a cycle where the rover would reboot itself, over and
over. After days of tests, the team finally discovered on Jan. 26 that
the
issue was tied to what was originally reported as corruption inside the
rover's onboard flash memory.

In a presentation at the Hot Chips conference here, Denise said that the
real issue was an embedded DOS file system whose directory structure kept growing and growing. When the rover's embedded operating system then told
the flash memory to mirror the data structure in RAM, the unexpectedly
large file caused a fatal error and an almost continuous reboot cycle, he
said.

Aside from the flash memory error, the recent voyages of Spirit and
Opportunity have gone far better than expected. The mission was
originally
funded to last 90 sols, the equivalent of 90 Mars days, and come to an
end
last April. (One sol equals 24.65 hours.) Since both rovers have managed
to stay "alive" far longer than anticipated, Denise said, the current
funding will run out on Sept. 13, the beginning of the "solar
conjunction," when Mars disappears behind the Sun and out of radio range.
The lifespan of both rovers is really not known, he said.

On Sol 18, the mood among the JPL ground team was nothing short of
"euphoric," Denise said. "Life was good," he said. "And then we missed a
comms pass," a window in which the JPL team and the rover were supposed
to
exchange information.

The team didn't worry, at least initially. The team rechecked that its
instruments were calibrated, and awaited the next pass a few hours later.
Over the next few days, however, nothing went right, Denise said. The
team
determined the rover was functional; it could emit a status "beep",
proving it was online. Other passes, however, generated just pseudorandom
noise, indicative that the rover was online, functioning, but that no
data
was passing through the antenna. The rover, meanwhile, was rebooting
hundreds of times a day.

The problem, Denise said, was in the file system the rover used. In DOS,
a
directory structure is actually stored as a file. As that directory tree
grows, the directory file grows, as well. The Achilles' heel, Denise
said,
was that deleting files from the directory tree does not reduce the size
of the directory file. Instead, deleted files are represented within the
directory by special characters, which tell the OS that the files can be
replaced with new data.

By itself, the cancerous file might not have been an issue. Combined with
a "feature" of a third-party piece of software used by the onboard Wind
River embedded OS, however, the glitch proved nearly fatal.

According to Denise, the Spirit rover contains 256 Mbytes of flash
memory,
a nonvolatile memory that can be written and rewritten thousands of
times.
The rover also contains 128 Mbytes of DRAM, 96 Mbytes of which are used
for data, such as buffering image files in preparation for transmitting
them to Earth. The other 32 Mbytes are used for code storage. An
additional 11 Mbytes of EEPROM memory are used for additional program
code
storage.

The undisclosed software vendor required that data stored in flash memory
be mirrored in RAM. Since the rover's flash memory was twice the size of
the system RAM, a crash was almost inevitable, Denise said.

Moving an actuator, for example, generates a large number of tiny data
files. After the rover rebooted, the OSes heap memory would be a hair's
breadth away from a crash, as the system RAM would be nearly full, Denise said. Adding another data file would generate a memory allocation command
to a nonexistent memory address, prompting a fatal error.

Dynamic allocation of memory is considered a no-no in embedded systems,
precisely because of the possibility of a system crash, attendees said.
Denise acknowledged that JPL's tests only allowed for the addition of a
small number of data files, and that the exception slipped by. "We made
an
exception and got bit by it," he admitted.

The team finally got the rover up and running by essentially using the
system RAM as simulated flash, discovered the error, and disabled the
dynamic allocation feature, Denise said. The flash memory was erased, and
the JPL engineers installed a utility that monitors the file system, and
treats the memory heap as a consumable resources.

Denise's keynote address to the Hot Chips audience lasted about an hour,
twenty minutes or so dedicated to the flash-memory issue. At the end, he
summed up the issue for the small percentage of the audience who weren't
engineers: "The Spirit was the willing, but the flash was weak."





-------------------------------------
You are subscribed as interesting-people () lists elistx com
To manage your subscription, go to
 http://v2.listbox.com/member/?listname=ip

Archives at: http://www.interesting-people.org/archives/interesting-people/


Current thread: