Interesting People mailing list archives

statement by Tether (Dir. Darpa) before the House subcom

From: Dave Farber <dave () farber net>
Date: Sun, 11 May 2003 13:54:33 -0400

------------------------------------------------------------------------


Statement by

Dr. Tony Tether
Director
Defense Advanced Research Projects Agency

before the
Subcommittee on Technology, Information Policy, Intergovernmental
Relations and the Census
Committee on Government Reform
United States House of Representatives

May 6, 2003 
Mr. Chairman, Subcommittee Members, and staff: I am Tony Tether, Director of
the Defense Advanced Research Projects Agency (DARPA). I am pleased to
appear before you today to talk about data mining and protecting the privacy
of Americans. This is an important issue, and I hope that you will find my
remarks helpful as your subcommittee looks into this complicated topic.

Some of you might be unfamiliar with DARPA. We are, essentially, tool
makers, sponsoring high-payoff research for the Department of Defense (DoD).
This research includes several new software tools that DARPA is developing
to assist the DoD in its counterterrorism mission. We are developing new
data search and pattern recognition technologies, which have little in
common with existing data mining technology, and represent just one element
of DARPA¹s counterterrorism research. Other critical areas of our research
include secure collaborative problem solving, structured knowledge
discovery, data visualization, and decision making with corporate memory.

It is important to remember that the technologies I will be discussing do
not yet exist in their final form, and, no doubt, they will change. Some
will succeed and some will fail, and we will learn as we go along. That is
the nature of research.

Moreover, unlike some of the other agencies represented by my fellow
panelists today, DARPA is not an agency that will actually use these tools,
if they work. Other agencies in the DoD, Federal government, or Congress
will decide if they want to use the tools we create and how they will use
them.

DARPA¹s Approach to Data Search and Pattern Recognition

When most people talk about ³data mining,² they are referring to the use of
clever statistical techniques to comb through large amounts of data to
discover previously unknown, but useful patterns for building predictive
models. This is typically done in the commercial world to better predict
customer purchases, understand supply chains, or find fraud  or address any
number of other issues where a better understanding of behavior patterns
would be helpful. The basic approach is to find statistical correlations as
a means of discovering unknown behavior patterns, and then build a
predictive model.

At first, one might think that data mining would be very helpful for the
most general attempts to find terrorists. It would appear ideal to have
software that could automatically discover suspicious, but previously
unnoticed patterns in large amounts of data, and which could be used to
create models for ³connecting-the-dots² and predicting attacks beforehand.
However, there are fundamental limitations to expanding today¹s data mining
approaches to the challenge of generally finding and interdicting complex
and meticulously well-planned terrorist plots that involve various
individuals.

Skeptics believe that such techniques are not feasible because it is simply
too difficult to program software to answer the general question, ³Is that
activity suspicious?² when terrorist plans are so variable and evidence of
them is so rare. The results, skeptics say, will contain unmanageable
numbers of ³false positives²  activities flagged as suspicious that turn
out to be innocent.

Beyond the skeptics, critics claim that such an approach must inevitably
lead to ³fishing expeditions² through massive amounts of personal data and a
wholesale invasion of Americans¹ privacy that yields, basically, nothing in
terms of finding terrorists. In previous testimony, this approach has been
referred to as ³mass dataveillance.²

In fact, these objections are among the reasons why DARPA is not pursuing
these techniques, but is developing a different approach in our research.

DARPA is not trying to bring about ³mass dataveillance,² regardless of what
you have read or heard. We believe that the existing data mining approach of
discovering previously unknown patterns is ill-suited to ferreting out
terrorist plans.

The purpose of data mining is, typically, to find previously unknown but
useful patterns of behavior in large amounts of data on activities that are
narrowly defined and identified, such as credit card usage or book
purchases. These behavior patterns relate to individual transactions or
classes of transactions (but not to individuals, themselves), again in
narrowly defined and identified areas of activity.

The counter-terrorism problem is much more difficult than this. To detect
and prevent complex terrorist plots, one must find extremely rare instances
of patterns across an extremely wide variety of activities  and hidden
relationships among individuals. Data mining is ill-suited to this task
because the domains of potentially interesting activity are so much more
numerous and complex than purchasing behavior.

Accordingly, we believe that better tools and a different approach are
needed for the most general efforts to detect and prevent complicated,
well-planned terrorist plots, particularly if we are to prevent them well
before they can occur and long before they can reach U.S. shores.
Consequently, our research goal to create better counterterrorism tools will
not be realized by surveilling huge piles of data representing a collection
of broad or ill-defined activities in the hope of discovering previously
unknown, unspecified patterns. Instead, we are pursuing an approach of
searching for evidence of specified patterns.

Detecting Data that Fits Specified Patterns

Our approach starts with developing attack scenarios, which are used to find
specific patterns that could indicate terrorist plans or planning. These
scenarios would be based on expert knowledge from previous terrorist
attacks, intelligence analysis, new information about terrorist techniques,
and/or from wargames in which clever people imagine ways to attack the
United States and its deployed forces. The basic approach does not rely on
statistical analysis to discover unknown patterns for creating predictive
models. Instead, we start with expert knowledge to create scenarios in
support of intelligence analysis versus an data mining approach that scans
databases for previously unknown correlations.

The scenarios would then be reduced to a series of questions about which
data would provide evidence that such attacks were being planned. We call
these scenarios ³models,² and they are, essentially, hypotheses about
terrorist plans. Our goal is to detect data that supports the hypotheses.

Contrast this approach with trying to discover a suspicious pattern without
having a model as a starting point  when the pattern is not known in
advance. Consider a truck bomb attack, involving a rental truck filled with
fertilizer and other materials. Trying to get software to discover such an
attack in its planning stages by combing through piles of data  not knowing
what it was looking for, but trying to flag ³suspicious² activities
suggestive of terrorist planning  is unlikely to work. Terrorist activity
is far too rare, and spotting it across many different activities by broadly
surveilling all available data requires enormous knowledge about the world
in order to identify an activity or individual as being ³suspicious.²

DARPA¹s approach, instead, focuses a search on detecting evidence for the
scenario model or hypothesis, ³Are there foreign visitors to the United
States who are staying in urban areas, buying large amounts of fertilizer
and renting trucks?² Again, the model or hypothesis is not created by
meandering through vast amounts of data to discover unknown patterns.

Finding the evidence of a suspicious pattern is, of course, not as simple as
I have made it sound. DARPA¹s counterterrorism research in the areas of data
search and pattern recognition is based on two basic types of queries that,
as a practical matter, would probably be used in combination.

The first type of query is subject-based and begins with an entity, such as
people known to be suspects. Analysts would start with actual suspects¹
names and see if there is evidence of links with other suspects or
suspicious activities. Current technology and policy pertaining to
subjectbased queries are fairly well developed and understood. One method of
subject-based query with enormous potential is link analysis, which seeks to
discover knowledge based on the relationships in data about people, places,
things, and events. Link analysis makes it possible to understand the
relationships between entities. Properly assembled, these links can provide
a picture of higher-level terrorist networks and activities, which, in turn,
forms a basis for early indications and warning of a terror attack. Data
mining offers little as a tool for investigating such relationships  it
creates models by finding statistical correlations within databases without
using a starting point, and then applies these models indiscriminately over
entire data sets. Link analysis differs because it detects connectedness
within rare patterns using known starting points, reducing the search space
at the outset.

The second type of query is strictly pattern-based. Analysts would look for
evidence of a specified pattern of activity that might be a threat.

It is crucial to note that both types of queries start with either known,
identified suspects or known, identified patterns. The focus is
investigative as opposed to broad surveillance. In both cases, the data that
one is looking for is likely to be distributed over a large number of very
different databases. Querying distributed, heterogeneous databases is not
easy, particularly if we are trying to detect patterns, and we do not know
how to do it right now. Pattern query technology is a critical element of
our counter-terrorism research; it is rather immature, as are the policies
governing its application.

The data that analysts get back in response to a query might not tell them
everything. The response may depend on who is doing the analysis and their
levels of authorization. This brings me to the second aspect of our
approach, detecting in stages.

Detecting in Stages

We envision that analysts will search for evidence of specified patterns in
stages. They will ask questions, get some results, and then refine their
results by asking more questions. This is really just common sense, but it
is worth highlighting that detecting in stages offers a number of
advantages: it uses information more efficiently; it helps limit false
positives; it can conform to legal investigative procedures; and it allows
privacy protection to be built-in.

Detecting in stages helps deal with the crucial challenge of false positives
 that is, mistakenly flagging activities and people as suspicious that are,
in fact, innocuous. False positives waste investigative resources and, in
the worst cases, can lead to false accusations. Unfortunately, much of the
discussion of false positives and counter-terrorism has tended to emphasize
technology as the key issue by implicitly assuming a caricature of an
investigative process in which a computer program fishes through massive
piles of data, officials press the ³print² button, and out pop a bunch of
arrest warrants. Of course, such an approach is unworkable.

We recognize that false positives must be considered as a product of the
whole system. They result from how the data, the technology, the personnel,
and the investigative procedures interact with each other  they are not
solely the result of the application of less-than-perfect technology.
DARPA¹s research seeks to provide analysts with powerful tools, not replace
the analysts themselves. Moreover, how we react to positives and what we
plan to do with the result is what matters enormously to this issue.

It is also important to remember that all investigations  whether they use
databases or not  will yield false positives. Therefore, the relevant
question is, ³Can we improve our overall ability to detect and prevent
terrorist attacks without having an unacceptable false positive rate at the
system level?² That is the key challenge to be answered by our research.

No doubt many of the ³positives² found during the first queries that
analysts make will be false ones. The positives must be further examined to
start weeding out the false ones and confirming the real ones, if there are
any. This will require analysis in several stages to find independent,
additional evidence that either refutes or continues to support the
hypothesis represented by the model. Moreover, the level of proof depends,
in part, on the nature of the planned response to a positive. We do not, for
example, arrest everyone who sets off the metal detector when entering this
building.

An analogy we sometimes use to illustrate this is submarine detection. In
submarine warfare, we do not simply attack something based on first
indications that a single sensor has detected an object. We refine the
object¹s identification in stages  from ³possible² enemy submarine, to
³probable² enemy submarine, to ³certainly² an enemy submarine. To be sure of
our actions, we confirm the identification over time, using different,
independent sensors and sources of information. Our approach to data
searching and pattern recognition would proceed in a similar fashion.

Proceeding in stages also means that the entire process can conform to
required, legal procedures or steps. In fact, many of these steps exist
precisely to protect people¹s rights and weed out false positives. We
envision hard-wiring many of the required procedures, permissions, or
business rules into the software to ensure that they are actually being
followed at each stage of the process.

Let us go back to the truck bomb example. One might incorporate a process
called ³selective revelation² into data queries. In selective revelation,
the amount of information revealed to the analyst depends on who the analyst
is, the status of the investigation, and the specific authorization the
analyst has received. The analyst¹s credentials would be automatically
included with the query, and the level of information returned would vary
accordingly.

Perhaps the result of the truck bomb query I talked about earlier is that 17
people fit the truck bomber pattern, but no personal information about those
17 is revealed. To retrieve additional personal information, a higher level
of authorization might be required, based on an independent evaluation (by a
court, for example) of the evidence that the analyst is actually ³on to²
something suspicious.

This suggests that there is a special class of business rules and procedures
that could be put into the technology to strengthen privacy protection, so
let me turn to that now.

Built-in Privacy Protection

From the very start of our research, we began looking for ways to build
privacy protection into DARPA¹s approach to detecting terrorists.

We had two motivations. First, we knew that the American public and their
elected officials must have confidence that their liberties will not be
violated before they would accept this kind of technology.

Second, much of what Federal agencies need to share is intelligence data.
Historically, agencies have been reluctant to share intelligence data for
fear of exposing their sources and methods. Accordingly, protecting privacy
and intelligence sources and methods are integral to our approach.

We are putting policies into place that will highlight protecting privacy.
As I previously alluded, DARPA does not own or collect any intelligence or
law enforcement databases. Our policies will address the development and
transition of new tools to the agencies authorized by law to use those
databases, reinforcing to everyone the importance of privacy. Moreover, we
are fully aware of and intend for the tools to be only used in a manner that
complies with the requirements of the Privacy Act, as well as the privacy
provisions of the E-Government Act regarding a Privacy Impact Assessment
where such an assessment is required. And we recognize that under Office of
Management and Budget policy, major agency information systems employing the
technology will have to be justified by a business case that addresses how
privacy and security are built into the technology.

To further assist agencies that have collected the data for analysis, we are
developing other tools that will help them protect the integrity of the
information  even during searches. I previously mentioned ³selective
revelation² as one way to protect privacy, and we are looking at other
related techniques as well, such as separating identity information from
transaction information. These separate pieces of information could only be
reassembled after the analyst has received the proper authorizations.

Until then, an analyst might only know the basic facts but not the identity
of who was involved. We are also looking at ways to anonymize data before it
is analyzed. We are evaluating methods for filtering out irrelevant
information from the analysis, such as the use of ³software agents² that
utilize experience-based rules. These software agents would automatically
remove data that appears to be irrelevant before the analyst even sees it.

Going beyond privacy protection, we are also looking into building-in
indelible audit technology that makes it exceedingly difficult to abuse the
data search and pattern recognition technology without the abuse being
detected. This audit technology would answer the question, ³Who used the
system to retrieve what data?²

Some ideas that we are pursuing include cryptographically protecting audit
information and perhaps even broadcasting it to outside parties, where it
cannot be tampered with. We are also looking into software agents that would
watch what analysts are doing to ensure that their searches and procedures
are appropriate and that they are following established guidelines.

Another interesting idea is data that reports its location back to the
system. One might even include a unique identifier for each copy (³digital
watermark²), so that if unauthorized copies were distributed their source
could be traced. Still another concept is giving control of database
querying a trusted third party, who could not be subject to organizational
pressure to provide unauthorized access.

We take privacy issues very seriously. DARPA is, in fact, one of the few
Federal agencies sponsoring significant research in the area of privacy
protection technologies.

You will often hear talk in this debate about how there are trade-offs  for
instance, that we may need to trade less privacy for more security. People
may disagree about the proper balance, but DARPA¹s efforts in developing
privacy protection technology are designed, in fact, to improve prospects
for providing both improved privacy protection and improved security by the
legally relevant agencies.

In closing, I would like to emphasize two points:

First, remember that what I have been describing here today is research, and
exactly how the technology will work  indeed, if it works  will only be
shown over time.

- 9 - Second, because of the high profile of DARPA¹s research in this area,
in February 2003 the Department of Defense announced the establishment of
two boards to provide oversight of our Information Awareness programs,
including our data search and pattern recognition technologies. These two
boards, an internal oversight board and an outside advisory committee, will
work with DARPA as we proceed with our research to ensure full compliance
with U.S. constitutional law, U.S. statutory law, and American values
related to privacy.

This concludes my remarks. I would be happy to answer any questions.

-------------------------------------
You are subscribed as interesting-people () lists elistx com
To manage your subscription, go to
  http://v2.listbox.com/member/?listname=ip

Archives at: http://www.interesting-people.org/archives/interesting-people/
Current thread:

statement by Tether (Dir. Darpa) before the House subcom Dave Farber (May 11)