Educause Security Discussion mailing list archives

Re: Sensitive data detection -- vendor response

From: Brad Judy <Brad.Judy () COLORADO EDU>
Date: Sat, 21 Apr 2007 17:02:31 -0600
I just wanted to note that I think most of us using Spider use a 'skip' list of filetypes that would not contain 
text-based data strings rather than a 'check' list of certain files to look at.  Obviously, using a 'check' list 
instead of a 'skip' list is much more likely to miss valid data and I'd be rather surprised to see any software focus 
on this approach.  
 
In the end, NO tool can detect ALL sensitive data because the scope is much greater than SSN's and credit card numbers. 
 Using any decent tool will identify most text-string easily identifiable items (like SSN and CC) and the quality of 
the tool's technique will have an impact on the number of false positives/negatives.  But, the key is that it points 
out areas of interest, not just files.  Anyone who only looks at the files identified and not the context will likely 
miss other files that may be undetectable by any tool.  Where there's smoke, there's fire.  
 
BTW: I find this response to lean a bit into sales.  At the very least, I think the last two paragraphs should have 
been dropped.  
 
Brad Judy
 
IT Security Office
Information Technology Services
University of Colorado at Boulder
 
 


________________________________

        From: Gary Golomb [mailto:gary () PROVENTSURE COM] 
        Sent: Saturday, April 21, 2007 2:41 PM
        To: SECURITY () LISTSERV EDUCAUSE EDU
        Subject: Re: [SECURITY] Sensitive data detection -- vendor response
        
        

         

        Disclaimer: First of all, please don't view this with any negativity towards the plethora of free tools 
available for searching for sensitive data. This post starts by focusing on regular expressions, which all those tools 
happen to be based on, but more importantly.... Most commercial applications are based solely on regex's under-the-hood 
as well. In this sense, this post isn't taking a negative view of all the free tools, but applauds them. If a vendor 
expects you to pay for something that is ultimately based on regex's with a pretty interface, then you're better off 
using something for free. 

         

        I wanted to address the discussion of searching for 9-digit SSNs without spaces or dashes. Even numbers in the 
format of nnn-nn-nnnn will false positive a lot, when searching using regex's. Some tools have taken steps to mitigate 
this by creating a "white list" of files types to search, but that simply increases your chances for missing real data 
- exponentially, as we've seen in many cases. You'd be surprised what types of files have this data in it, and looking 
only for .docs .xls. .csv, txt, etc barely scratches the surface. 

         

        While "Regular Expressions" are a wonderful language for defining text search parameters, they are designed for 
a very specific problem. As the name implies, they are good at looking for "regular" patterns. A good example of their 
limitations is shown by looking for email addresses. While this seems like an easy problem to address with regexs, 
email addresses can have a number of different formats and allowed characters. It is commonly known that regular 
expressions are a poor choice for finding data that has variance in it, and most web developers will attest to the 
lackluster performance of trying to validate email addresses with regexs. The relationship between false positives and 
false negatives with regex's (as with most things) is almost always inverse. That is, if you are not getting many false 
positives, it is because you are surely experiencing false negatives - ie: missing real data. This problem is further 
compounded when searching for "sensitive" data, which is far more variable in format than only email addresses. 

         

        Compensations to the above problems can be made with regex's since a powerful feature of regex's is the ability 
to add AND, OR, and NOT logic to the regex. This is also the largest failure in engineering solutions around regexs, 
since the computational overhead on the regular expression engine by using this type of logic grows exponentially with 
each implementation of this type of logic. And that doesn't even come half as close to using quantifiers within the 
regexs (something even some of these free tools do), regardless of whether they are lazy or greedy quantifiers. Even 
engineers commonly mistakenly assume this is acceptable because computers are so powerful these days. Indeed, in 
testing on lab and development machines, the impact seems to be small, however, when these solutions are moved to 
real-world systems it is not uncommon to see these types of applications consume all the computational resources of the 
system. I know people on this list have experienced this on live systems with some of the free tools already mentioned 
in this thread. 

         

        One of the largest problems is that regex's ultimately find patterns. They do not find valid data - they only 
find patterns, valid and invalid. If the purpose of the application is to meet industry fiduciary requirements or used 
to maintain a legally auditable chain of custody for information, regex's are a very poor choice for these types of 
problems because of the variability of confidence (or lack thereof) they introduce to the audit process. The need to 
add many validation functions on top of already taxing regex-based functions just makes matters worse. The fact is, if 
you are happy with regex's, there are a number of free applications available that can use regular expressions, have 
very decent file decoding capabilities, and can even be scripted, so many universities would be better off using one of 
these free programs as opposed to paying for a solution based on regex's. 

         

        How this relates to Asarium, as originally described....

         

        Asarium uses a layered approach to detection, allowing for the best possible balance of accuracy and 
performance. First, Asarium statistically analyzes files to determine information that is used later in selecting file 
decoding algorithms. Asarium dynamically makes decisions about files, their structures, and their encodings as analysis 
progresses through a file (so you don't have to bother with white-listing or black listing files types - it figures out 
if there is decodable data within the file and works on that data). 

         

        Asarium uses a number of mathematically advanced algorithms for analyzing data in a file. At GWU, I had 
originally focused on computational analysis of human proteins using similar algorithms to find "hidden" and unknown 
patterns by the analysis of observable and known patterns. These algorithms are statistical and probable in nature, and 
are known for their applications in temporal pattern recognition such as speech, handwriting, gesture recognition, 
musical score following, as well as bioinformatics. 

         

        As it turns out, the problem of accurately detecting "sensitive" information (including SSNs and more 
specifically nine digit numbers without formatting) is extremely similar to problems in bioinformatics and genetic 
analysis, since long strings of numbers and letters are very common in files - especially if the software is thorough 
enough to examine all files on a computer. By not examining all files, then users are subjected to ignoring data based 
on assumptions. 

         

        For the geeks in the crowd, one of the two main methods Asarium uses to analyze file data deals with modeling 
data within Hidden Markov Models. The Hidden Markov Model is a finite set of states, each of which is associated with a 
(generally multidimensional) probability distribution array. Transitions among the states are governed by a set of 
probabilities called transition probabilities. In a particular state an outcome or observation can be generated, 
according to the associated probability distribution. In the effect, they are basically a method of walking from one 
observed outcome (series of bytes in a file) to the next, making a calculation at each point, and only using the 
calculated probability from the previous point in determining the most likely probability at the current point. This 
problem can be summarized as the probability of observing the sequence of length k 

        Y = y(0),y(1),...,y(k − 1)

        where the sum runs over all possible hidden node sequences 

        X = x(0),x(1),...,x(k − 1),

        And is summarized by: 

        P(Y) = Σ P(Y | X) P(X)

         

        Long story short....

         

        Asarium is able to dynamically step though every byte in a file and determine if it contains "sensitive" data 
based on a probabilistic modeling of bytes around each examined step. As it turns out, this sort of analysis is not 
only far more accurate than regular expressions (allowing Asarium to search inside ALL files, and not just certain ones 
based on vendor assumptions), but it also performs these searches faster than regular expression based tools with lower 
impact on the computer running the search. Internal tests have shown Asarium running over 1,600% faster than other 
tools, and external tests have shown far more dramatic results - of which some people on this list can validate. 

         

        We just released the Beta-2 of our 2.0 version, and will work with some institutions in testing it. I expect 
2.0 to be GA ready in the next couple of weeks. If you'd like more information, definitely check out our website, or 
contact us for more information. 

         

        -Gary

         

         

         

        Gary Golomb 

        Founder, Managing Partner

         

        443-536-5757

         

        http://www.proventsure.com <http://www.proventsure.com/> 

         

         

         

        ----------------------------

         

        Date:         Fri, 20 Apr 2007 17:56:34 -0400

        From:         Wyman Miles 

        Subject:      Re: Sensitive data detection

        Content-Type: text/plain;charset=iso-8859-1

         

        Runs of 9 are an extremely difficult problem.  You can bracket them with a

        \D (nondigit) or \b (word break), which sometimes helps.  Validation

        against the SSA area/group data helps a little.  If your institution draws

        heavily from a predictable population, you can use the approach Colorado

        employs and write geographically dependent regexes.  But no, there is no

        silver bullet.  SINs are a far easier problem as they're Luhn-derived,

        like CC#s.

         

         

         

        > I'd be interested in hearing people's feedback about the issues with

        > high false positive rates and 9 digit SSNs in evaluating these

        > tools.  Most the datastores I come across here store SSN without

        > hyphens, and creating regexs for any combination of 9 digit numbers

        > has always returned high false positives, so much so its borderline

        > useless.  There are some special rules for SSNs, but nothing like

        > creditcard luhn checks.

        > 

        > At 11:15 AM 4/20/2007, Harold Winshel wrote:

        >>We're also looking to use Cornell's Spider program for

        >>Rutgers-Camden Arts & Sciences faculty and staff.

        >> 

        >>At 01:52 PM 4/20/2007, you wrote:

        >>>On 4/20/07, Curt Wilson <[log in to unmask]> wrote:

        >>>>Dear Educause security community,

        >>>> 

        >>>>For those that are currently working on a project involving the

        >>>>identification of sensitive data across campus, I have some items of

        >>>>potential interest. I know that Teneble (Nessus) recently announced a

        >>>>module that can check (with host credentials) a host for the presence

        >>>> of

        >>>>selected types of sensitive data, but what we have chosen is

        >>>>Proventsure's Asarium software. We are in the early stages of testing,

        >>>>but it looks to be a tremendously helpful tool for such a large task

        >>>>(depending upon the size of your institution).

        >>> 

        >>>Thanks Curt.  A freeware package that works in this same area is

        >>>the Cornell Spider

        >>> 

        >>>http://www.cit.cornell.edu/computer/security/tools/

        >>>http://www.cit.cornell.edu/computer/security/tools/spider-cap.html

        >>> 

        >>>--

        >>>Peter N. Wan ([log in to unmask])     258 Fourth Street, Rich 244

        >>>Senior Information Security Engineer        Atlanta, Georgia 30332-0700

        >>> USA

        >>>OIT, Information Security                   +1 (404) 894-7766 AIM:

        >>> oitispnw

        >>>Georgia Institute of Technology             GT FIRST Team Representative

        >> 

        >>Harold Winshel

        >>Computing and Instructional Technologies

        >>Faculty of Arts & Sciences

        >>Rutgers University, Camden Campus

        >>311 N. 5th Street, Room B10 Armitage Hall

        >>Camden NJ 08102

        >>(856) 225-6669 (O)

        > 

        > ---------------------------------------------------------------------------------------------------

        > 

        > Josh Drummond

        > Security Architect

        > Administrative Computing Services, University of California - Irvine

        > [log in to unmask]

        > 949.824.9574

        > 

         

         

        Wyman Miles

        Senior Security Engineer

        Cornell University
Current thread:

Re: Sensitive data detection -- vendor response Gary Golomb (Apr 21)
- <Possible follow-ups>
- Re: Sensitive data detection -- vendor response Brad Judy (Apr 21)
- Re: Sensitive data detection -- vendor response Russell Fulton (Apr 21)
- Re: Sensitive data detection -- vendor response Roger Safian (Apr 23)
- Re: Sensitive data detection -- vendor response Gary Golomb (Apr 24)