nanog mailing list archives

Re: OT: Traffic Light Control (was Re: First real-world SCADA attack in US)


From: Thomas Maufer <tmaufer () gmail com>
Date: Wed, 23 Nov 2011 18:41:58 -0800

<unlurks>

I have to jump in on this thread. Traffic light controllers are a fun category of technical artifacts. The weatherproof 
boxes that the relays used to live in have stayed the same size for decades, but now the controllers just take a teeny 
tiny circuit board rattling around in this comparatively huge box. And it's full of software, dontcha know? So why not 
have lots of newfangled features? Curiously, the people who make the insides of the box have a WHOLE DIFFERENT way of 
thinking about "what a traffic light controller should do?" - the "insider" people are in the 21st century, while the 
"outsider" people are in the early 20th century. Lemme splain.

A particular traffic light controller that I tested in 2007 had an FTP server inside it. I have no idea why. So I tried 
fuzzing it. 5 minutes into the test, the test aborted because the DuT wouldn't restart anymore. Upon investigation, we 
discovered that a particular FTP sequence had triggered a bug that had a rather unfortunate (side-)effect: The flash 
file system of the traffic light controller was formatted or erased. As a bonus, the device also had crashed and it was 
awaiting a ZMODEM file download since it didn't have a boot image any more. We couldn't test anything else because we 
didn't have the special serial cable to (re-)install the OS. Fail-safe? Not hardly: Not when it has no software! It's a 
lump of highly refined sand, in a plastic case.

There are many lessons here, not least of which is: Ship the device with the smallest possible attack surface! Why the 
heck was FTP enabled? Clearly this device had never been subjected to any negative testing. And these devices are meant 
to be networked, so that FTP bug will be tickled someday, I just don't know when. Yes, it was reported to the vendor, 
and no, I have no idea if they ever fixed it.

Also, in this thread I have seen several references to "fail-safe" or "redundancy" features. In my experience, those 
are often some of the weakest aspects of some systems. In one case, I my testing rendered a multi-million-dollar highly 
redundant VoIP soft switch useless by constantly causing the primary to fail - and while the secondary was being 
activated, there was a quiet period of 2-3 seconds during which time no calls went through. Shortly after the secondary 
had become the primary, it failed again, continuing the cycle. Literally traffic amounting to one packet (about 100 
bytes, IIRC) per second of carefully crafted SIP INVITES could make this switch completely useless. The bug I found 
involved SIP INVITE messages that could not be filtered…unless you didn't want to accept VoIP phone calls at all, which 
calls into question your purchase of the multi-million-dollar highly redundant soft switch. That bug was fixed.

Software is tricky stuff. The number of ways it can fail is practically infinite, but there is generally only a small 
number of ways for it to work correctly. Networked software is particularly challenging to write because the software 
engineers don't get to control their inputs. The intervening network can (does) fold, spindle, mutilate, truncate, 
drop, reorder or duplicate packets and your code on the receiving end has to try to understand what was intended by the 
sender. Oh, and the sender might be following an older version of the standard (if one even exists) or simply have 
included some bugs of their own. Because the coders are so focused on making their code do what the MRD/PRD required - 
on a tight schedule! - they have little time to imagine all the possible ways their code might fail. Their 
error-handling routines are simply never imaginative enough to handle real-world brokenness. It *is* possible to test 
this stuff, but time pressures in release schedules don't leave a lot of breathing room for developers to take on whole 
new classes of tasks that are outside their expertise (security testing). So you end up with a traffic light controller 
that erases its own flash file system when it receives a slightly strange but completely legal FTP command, or a highly 
redundant VoIP soft switch that is only good at ping-ponging from primary to secondary CPUs. Don't even get me started 
on problems I have found in carrier-class routers.

I don't need to name names: All software has bugs (except possibly the code in the main computers on the Space 
Shuttle). Every engineer I have ever known has tried to write their code well, but automated negative testing has only 
recently caught up to where the engineers and QA staff can focus on what they do best (write and test code that 
implements features that someone can buy), and let purpose-built tools do the negative testing for them, so their 
error-handling routines can be robust, too. Fixing bugs is generally straightforward. Finding them has always been the 
challenge.

~tom

</unlurks>


On 23 Nov 2011, at 17:59 , Brett Frankenberger wrote:

On Wed, Nov 23, 2011 at 05:45:08PM -0500, Jay Ashworth wrote:

Yeah.  But at least that's stuff you have a hope of managing.  "Firmware
underwent bit rot" is simply not visible -- unless there's, say, signature 
tracing through the main controller.

I can't speak to traffic light controllers directly, but at least some
vital logical controllers do check signatures of their firmware and
programming and will fail into a safe configuration if the
signatures don't validate.

    -- Brett




Current thread: