Politech mailing list archives

FC: What's so bad about Total Information Awareness? by Ben Brunk


From: Declan McCullagh <declan () well com>
Date: Mon, 09 Dec 2002 23:57:16 -0500


---

Date: Mon, 09 Dec 2002 22:34:13 -0500
From: Ben Brunk <brunkb () ils unc edu>
To: declan () well com
Subject: Debunking TIA

Declan,

I'm in the middle of writing a dissertation relating to online privacy, but I have been completely sidetracked by the recent discussion over the Total Information Awareness program authorized by the Homeland Security bill that just passed into law. All I've seen so far are a lot of reactionary editorials written by people who haven't put an ounce of effort into analyzing the proposed system. They seem infatuated with the TIA logo, its slogan, and Poindexter. I have read, with avid fascination, all the dire predictions and scary stories about a new Big Brother system spearheaded by a felon who managed to avoid accountability. What I have yet to see is a rational analysis of the idea itself from someone who knows something about computers, databases and statistics. I hope to fill in that gap as best I can, though I'm sure there are experts out there with even better background in the appropriate research fields.

From what I have been able to find out about the TIA program, it is supposed to be a massive computerized dragnet that culls information from dozens of different sources and is intended to locate potential terrorists so that government agents can scrutinize them more closely. This system will draw data from sources such as credit reports, bank records, airline reservation systems, police records, gun purchase records, and many others.

Many of these sources of information are private databases owned and maintained by the corporations that rely on them. Even if they were all implemented in say, Oracle, it would be difficult to match up records to any reliable degree. Who knows if the John Poindexter in one database is the same as Jon Pointdexter in another? The social security number, which is apparently the holy grail of database keys, is not necessarily going to help since many of these companies did not collect it or use it as a key. Name and address might make a good cross referencing key, but people move all the time, and I get three catalogs from a company that I purchased items from three times-even their internal database is not sophisticated enough to detect slight differences in spacing or my apartment number using a '#' instead of 'apt' or 'apartment'. This is just inside one organization; we're not even trying to connect any dots yet. It will be easier to match records kept by the government, especially if they include SSNs and fingerprints. However, errors in government databases are well documented (although not readily admitted to). Those systems contain large numbers of errors, and even when errors are located and fixed, they have a nasty tendency of recurring when data is shared or re-shared. If you fix an error in your Experian credit report, but not TRW, often times, the Experian error will reappear. Many people play this sort of "whack a mole" game for years.

Another matter that no journalist has touched on, and the one I think is the biggest nail in TIA's coffin, is the matter of database error are several orders of magnitude higher than the number of terrorists in the world. All databases contain errors. Data culled from multiple, heterogeneous sources is going to have lots of errors. I don't have current estimates on the average expected error rate in a database, but let's suppose it is 5%. That means that in any given database, 95% of the data is right and 5% of it is junk. Garbage in, garbage out. Errors such as misspellings, flipped bits, juxtaposed numbers, and transaction entries that never took place or were unintentionally duplicated or omitted. Five percent isn't a big deal until you look at it on the scale of what TIA is proposing. There are approximately 300 million people in the United States. Those 300 million people are very busy consumers, and their paper trail is enormous. There are trillions of transaction records, log entries, and records that TIA would have to amass, standardize, and then examine. Even if the government buys all the necessary computing power and the very best staff, the government can't do anything about randomness. The 5% expected error rate is the monkey wrench in the works. 5% of 300 million is 15,000,000. Multiply that number by however many data points will be looked at. Say 500 data points for each person. Now we are looking at 300 million times 500, or 150,000,000,000 data points. 5% of that number leaves us with 7,500,000,000. Seven and one half billion data points if they want to look at every American. Worse, this is not a one-time scan. For any hope of success, they would have to look longitudinally. That is, every year, month, day, hour, whatever. Some indications of terrorism are very subtle: People who plan terror don't just run out and buy their entire list of bomb making ingredients in one day and then book a flight. Terrorists are slow and methodical. They plan over months and years. So what we're looking at here is 7.5 billion data points examined day in and day out for years and years. With a 5% error rate, the number of false positives is outrageous, no matter what analysis technique used (and any analysis technique will have its own error rate). There is not enough manpower in the entire federal government to possibly track down every lead generated, even if much of that work is automated. With each passing day, homeland security will drown a little more in a hopeless pile of randomly generated false leads that grow even on weekends and holidays.

Let's suppose there are 1,000 terrorists hiding out in the USA, waiting to strike, which I personally think is a greatly exaggerated number. We know from the actions taken on 9/11 that these people are fairly cunning. They know how to hide from the system and how to hide in plain sight. They pay in cash, or they buy what they need by proxy, and they don't act any different than anyone else. Like the millions of illegal immigrants in the US, terrorist operatives are good at using social networks to "fly below the radar" and subvert the system. One thousand people is a lot, but 1,000 out of 300 million is 3.33 * 10^-6, or .000033%. In other words, TIA would be looking for a miniscule fraction of 1% of the population in their database, the exact people who are going out of their way to escape detection. With an error rate of even 1%, detecting such a tiny fraction would be impossible. You would not be able to separate the signal from the noise, no matter what techniques were used. Pollsters run into this problem every election season when the 'margin of error' rises to a level greater than the projected differential between the candidates. 3% margin of error in a race where the candidates differ by 1% is "too close to call." The same problem exists for scanning all airport baggage, but that is fodder for another day. The only way TIA would work is if some high percentage of Americans were terrorists-20%, 50%, whatever. Only then could there be enough comparison data in both sets to draw testable conclusions from and be assured that those conclusions were not just random error phenomena.

Let's look at this on a much smaller scale: Suppose the system worked well enough each day to render a list of 10,000 people, one (1) of which is an actual terrorist (unbelievably good odds for the government). The government has a .0001% probability of successfully picking the terrorist each day (using this system alone). Could the FBI/CIA/NSA/whatever even investigate 10,000 people with other techniques carefully enough each day to locate the one terrorist? Could they do it in a month or a year? I suppose the government could err on the side of caution and detain large numbers of people, place them in custody, and hold them indefinitely without due process until certain that they weren't terrorists. But that action presents nightmarish logistical and humanitarian prospects. The US prison population is bursting at the seams with an all time high of two million. There would have to be enormous concentration camps for the millions of suspected terrorists who would be detained until their innocence is proven. That begs the question: Is it even possible to prove you are innocent in the current legal climate? The Red Scare (and the more recent FBI watch lists) has already taught us the folly of black lists and unsubstantiated accusations.

Lastly, data mining as a useful technique has been thoroughly debunked. It never lived up to its promises. This is why you don't hear much about data mining in the CS and IS literature these days; what of it that is left has morphed into the more esoteric "knowledge management" or KD. Like AI, it turned out to be quite a bit more difficult to do than expected and has been largely abandoned. Had anyone in the government actually bothered to read any of the literature, they would already know this.

All in all, I can't see how TIA will do anything except harm innocent people and create new jobs for bureaucrats. Any numerate person who spends five minutes thinking about what is proposed will come to the same conclusion. If our system is going to become this arbitrary, there are going to be an awful lot of lives ruined in this country. I fail to see how the TIA approach could do anything positive for the war on terror or for America in general. It will eat up resources better spent on more proven and acceptable approaches. In fact, such a data-drive approach might actually be more successful if it simply took a random sampling of the population each day.

My hope is that this editorial will awaken those who are even more skilled in computer science, statistics, game theory, etc. and that they find the courage to speak up so we can put the brakes on the wasteful and destructive blind alley called TIA.


Benjamin Brunk




-------------------------------------------------------------------------
POLITECH -- Declan McCullagh's politics and technology mailing list
You may redistribute this message freely if you include this notice.
To subscribe to Politech: http://www.politechbot.com/info/subscribe.html
This message is archived at http://www.politechbot.com/
Declan McCullagh's photographs are at http://www.mccullagh.org/
-------------------------------------------------------------------------
Like Politech? Make a donation here: http://www.politechbot.com/donate/
Recent CNET News.com articles: http://news.search.com/search?q=declan
-------------------------------------------------------------------------


Current thread: