Educause Security Discussion mailing list archives

Re: Sensitive data detection -- vendor response


From: Gary Golomb <gary () PROVENTSURE COM>
Date: Sat, 21 Apr 2007 16:41:16 -0400

 

Disclaimer: First of all, please don't view this with any negativity towards the plethora of free tools available for 
searching for sensitive data. This post starts by focusing on regular expressions, which all those tools happen to be 
based on, but more importantly.... Most commercial applications are based solely on regex's under-the-hood as well. In 
this sense, this post isn't taking a negative view of all the free tools, but applauds them. If a vendor expects you to 
pay for something that is ultimately based on regex's with a pretty interface, then you're better off using something 
for free. 

 

I wanted to address the discussion of searching for 9-digit SSNs without spaces or dashes. Even numbers in the format 
of nnn-nn-nnnn will false positive a lot, when searching using regex's. Some tools have taken steps to mitigate this by 
creating a "white list" of files types to search, but that simply increases your chances for missing real data - 
exponentially, as we've seen in many cases. You'd be surprised what types of files have this data in it, and looking 
only for .docs .xls. .csv, txt, etc barely scratches the surface. 

 

While "Regular Expressions" are a wonderful language for defining text search parameters, they are designed for a very 
specific problem. As the name implies, they are good at looking for "regular" patterns. A good example of their 
limitations is shown by looking for email addresses. While this seems like an easy problem to address with regexs, 
email addresses can have a number of different formats and allowed characters. It is commonly known that regular 
expressions are a poor choice for finding data that has variance in it, and most web developers will attest to the 
lackluster performance of trying to validate email addresses with regexs. The relationship between false positives and 
false negatives with regex's (as with most things) is almost always inverse. That is, if you are not getting many false 
positives, it is because you are surely experiencing false negatives - ie: missing real data. This problem is further 
compounded when searching for "sensitive" data, which is far more variable in format than only email addresses. 

 

Compensations to the above problems can be made with regex's since a powerful feature of regex's is the ability to add 
AND, OR, and NOT logic to the regex. This is also the largest failure in engineering solutions around regexs, since the 
computational overhead on the regular expression engine by using this type of logic grows exponentially with each 
implementation of this type of logic. And that doesn't even come half as close to using quantifiers within the regexs 
(something even some of these free tools do), regardless of whether they are lazy or greedy quantifiers. Even engineers 
commonly mistakenly assume this is acceptable because computers are so powerful these days. Indeed, in testing on lab 
and development machines, the impact seems to be small, however, when these solutions are moved to real-world systems 
it is not uncommon to see these types of applications consume all the computational resources of the system. I know 
people on this list have experienced this on live systems with some of the free tools already mentioned in this thread. 

 

One of the largest problems is that regex's ultimately find patterns. They do not find valid data - they only find 
patterns, valid and invalid. If the purpose of the application is to meet industry fiduciary requirements or used to 
maintain a legally auditable chain of custody for information, regex's are a very poor choice for these types of 
problems because of the variability of confidence (or lack thereof) they introduce to the audit process. The need to 
add many validation functions on top of already taxing regex-based functions just makes matters worse. The fact is, if 
you are happy with regex's, there are a number of free applications available that can use regular expressions, have 
very decent file decoding capabilities, and can even be scripted, so many universities would be better off using one of 
these free programs as opposed to paying for a solution based on regex's. 

 

How this relates to Asarium, as originally described....

 

Asarium uses a layered approach to detection, allowing for the best possible balance of accuracy and performance. 
First, Asarium statistically analyzes files to determine information that is used later in selecting file decoding 
algorithms. Asarium dynamically makes decisions about files, their structures, and their encodings as analysis 
progresses through a file (so you don't have to bother with white-listing or black listing files types - it figures out 
if there is decodable data within the file and works on that data). 

 

Asarium uses a number of mathematically advanced algorithms for analyzing data in a file. At GWU, I had originally 
focused on computational analysis of human proteins using similar algorithms to find "hidden" and unknown patterns by 
the analysis of observable and known patterns. These algorithms are statistical and probable in nature, and are known 
for their applications in temporal pattern recognition such as speech, handwriting, gesture recognition, musical score 
following, as well as bioinformatics. 

 

As it turns out, the problem of accurately detecting "sensitive" information (including SSNs and more specifically nine 
digit numbers without formatting) is extremely similar to problems in bioinformatics and genetic analysis, since long 
strings of numbers and letters are very common in files - especially if the software is thorough enough to examine all 
files on a computer. By not examining all files, then users are subjected to ignoring data based on assumptions. 

 

For the geeks in the crowd, one of the two main methods Asarium uses to analyze file data deals with modeling data 
within Hidden Markov Models. The Hidden Markov Model is a finite set of states, each of which is associated with a 
(generally multidimensional) probability distribution array. Transitions among the states are governed by a set of 
probabilities called transition probabilities. In a particular state an outcome or observation can be generated, 
according to the associated probability distribution. In the effect, they are basically a method of walking from one 
observed outcome (series of bytes in a file) to the next, making a calculation at each point, and only using the 
calculated probability from the previous point in determining the most likely probability at the current point. This 
problem can be summarized as the probability of observing the sequence of length k 

Y = y(0),y(1),...,y(k − 1)

where the sum runs over all possible hidden node sequences 

X = x(0),x(1),...,x(k − 1),

And is summarized by: 

P(Y) = Σ P(Y | X) P(X)

 

Long story short....

 

Asarium is able to dynamically step though every byte in a file and determine if it contains "sensitive" data based on 
a probabilistic modeling of bytes around each examined step. As it turns out, this sort of analysis is not only far 
more accurate than regular expressions (allowing Asarium to search inside ALL files, and not just certain ones based on 
vendor assumptions), but it also performs these searches faster than regular expression based tools with lower impact 
on the computer running the search. Internal tests have shown Asarium running over 1,600% faster than other tools, and 
external tests have shown far more dramatic results - of which some people on this list can validate. 

 

We just released the Beta-2 of our 2.0 version, and will work with some institutions in testing it. I expect 2.0 to be 
GA ready in the next couple of weeks. If you'd like more information, definitely check out our website, or contact us 
for more information. 

 

-Gary

 

 



Gary Golomb 

Founder, Managing Partner

 

443-536-5757

 

http://www.proventsure.com <http://www.proventsure.com/> 

 

 

 

----------------------------

 

Date:         Fri, 20 Apr 2007 17:56:34 -0400

From:         Wyman Miles 

Subject:      Re: Sensitive data detection

Content-Type: text/plain;charset=iso-8859-1

 

Runs of 9 are an extremely difficult problem.  You can bracket them with a

\D (nondigit) or \b (word break), which sometimes helps.  Validation

against the SSA area/group data helps a little.  If your institution draws

heavily from a predictable population, you can use the approach Colorado

employs and write geographically dependent regexes.  But no, there is no

silver bullet.  SINs are a far easier problem as they're Luhn-derived,

like CC#s.

 

 

 

I'd be interested in hearing people's feedback about the issues with

high false positive rates and 9 digit SSNs in evaluating these

tools.  Most the datastores I come across here store SSN without

hyphens, and creating regexs for any combination of 9 digit numbers

has always returned high false positives, so much so its borderline

useless.  There are some special rules for SSNs, but nothing like

creditcard luhn checks.



At 11:15 AM 4/20/2007, Harold Winshel wrote:

We're also looking to use Cornell's Spider program for

Rutgers-Camden Arts & Sciences faculty and staff.



At 01:52 PM 4/20/2007, you wrote:

On 4/20/07, Curt Wilson <[log in to unmask]> wrote:

Dear Educause security community,



For those that are currently working on a project involving the

identification of sensitive data across campus, I have some items of

potential interest. I know that Teneble (Nessus) recently announced a

module that can check (with host credentials) a host for the presence

of

selected types of sensitive data, but what we have chosen is

Proventsure's Asarium software. We are in the early stages of testing,

but it looks to be a tremendously helpful tool for such a large task

(depending upon the size of your institution).



Thanks Curt.  A freeware package that works in this same area is

the Cornell Spider



http://www.cit.cornell.edu/computer/security/tools/

http://www.cit.cornell.edu/computer/security/tools/spider-cap.html



--

Peter N. Wan ([log in to unmask])     258 Fourth Street, Rich 244

Senior Information Security Engineer        Atlanta, Georgia 30332-0700

USA

OIT, Information Security                   +1 (404) 894-7766 AIM:

oitispnw

Georgia Institute of Technology             GT FIRST Team Representative



Harold Winshel

Computing and Instructional Technologies

Faculty of Arts & Sciences

Rutgers University, Camden Campus

311 N. 5th Street, Room B10 Armitage Hall

Camden NJ 08102

(856) 225-6669 (O)



---------------------------------------------------------------------------------------------------



Josh Drummond

Security Architect

Administrative Computing Services, University of California - Irvine

[log in to unmask]

949.824.9574



 

 

Wyman Miles

Senior Security Engineer

Cornell University

 


Current thread: