Educause Security Discussion mailing list archives
Re: Sensitive data detection -- vendor response
From: Brad Judy <Brad.Judy () COLORADO EDU>
Date: Sat, 21 Apr 2007 17:02:31 -0600
I just wanted to note that I think most of us using Spider use a 'skip' list of filetypes that would not contain text-based data strings rather than a 'check' list of certain files to look at. Obviously, using a 'check' list instead of a 'skip' list is much more likely to miss valid data and I'd be rather surprised to see any software focus on this approach. In the end, NO tool can detect ALL sensitive data because the scope is much greater than SSN's and credit card numbers. Using any decent tool will identify most text-string easily identifiable items (like SSN and CC) and the quality of the tool's technique will have an impact on the number of false positives/negatives. But, the key is that it points out areas of interest, not just files. Anyone who only looks at the files identified and not the context will likely miss other files that may be undetectable by any tool. Where there's smoke, there's fire. BTW: I find this response to lean a bit into sales. At the very least, I think the last two paragraphs should have been dropped. Brad Judy IT Security Office Information Technology Services University of Colorado at Boulder ________________________________ From: Gary Golomb [mailto:gary () PROVENTSURE COM] Sent: Saturday, April 21, 2007 2:41 PM To: SECURITY () LISTSERV EDUCAUSE EDU Subject: Re: [SECURITY] Sensitive data detection -- vendor response Disclaimer: First of all, please don't view this with any negativity towards the plethora of free tools available for searching for sensitive data. This post starts by focusing on regular expressions, which all those tools happen to be based on, but more importantly.... Most commercial applications are based solely on regex's under-the-hood as well. In this sense, this post isn't taking a negative view of all the free tools, but applauds them. If a vendor expects you to pay for something that is ultimately based on regex's with a pretty interface, then you're better off using something for free. I wanted to address the discussion of searching for 9-digit SSNs without spaces or dashes. Even numbers in the format of nnn-nn-nnnn will false positive a lot, when searching using regex's. Some tools have taken steps to mitigate this by creating a "white list" of files types to search, but that simply increases your chances for missing real data - exponentially, as we've seen in many cases. You'd be surprised what types of files have this data in it, and looking only for .docs .xls. .csv, txt, etc barely scratches the surface. While "Regular Expressions" are a wonderful language for defining text search parameters, they are designed for a very specific problem. As the name implies, they are good at looking for "regular" patterns. A good example of their limitations is shown by looking for email addresses. While this seems like an easy problem to address with regexs, email addresses can have a number of different formats and allowed characters. It is commonly known that regular expressions are a poor choice for finding data that has variance in it, and most web developers will attest to the lackluster performance of trying to validate email addresses with regexs. The relationship between false positives and false negatives with regex's (as with most things) is almost always inverse. That is, if you are not getting many false positives, it is because you are surely experiencing false negatives - ie: missing real data. This problem is further compounded when searching for "sensitive" data, which is far more variable in format than only email addresses. Compensations to the above problems can be made with regex's since a powerful feature of regex's is the ability to add AND, OR, and NOT logic to the regex. This is also the largest failure in engineering solutions around regexs, since the computational overhead on the regular expression engine by using this type of logic grows exponentially with each implementation of this type of logic. And that doesn't even come half as close to using quantifiers within the regexs (something even some of these free tools do), regardless of whether they are lazy or greedy quantifiers. Even engineers commonly mistakenly assume this is acceptable because computers are so powerful these days. Indeed, in testing on lab and development machines, the impact seems to be small, however, when these solutions are moved to real-world systems it is not uncommon to see these types of applications consume all the computational resources of the system. I know people on this list have experienced this on live systems with some of the free tools already mentioned in this thread. One of the largest problems is that regex's ultimately find patterns. They do not find valid data - they only find patterns, valid and invalid. If the purpose of the application is to meet industry fiduciary requirements or used to maintain a legally auditable chain of custody for information, regex's are a very poor choice for these types of problems because of the variability of confidence (or lack thereof) they introduce to the audit process. The need to add many validation functions on top of already taxing regex-based functions just makes matters worse. The fact is, if you are happy with regex's, there are a number of free applications available that can use regular expressions, have very decent file decoding capabilities, and can even be scripted, so many universities would be better off using one of these free programs as opposed to paying for a solution based on regex's. How this relates to Asarium, as originally described.... Asarium uses a layered approach to detection, allowing for the best possible balance of accuracy and performance. First, Asarium statistically analyzes files to determine information that is used later in selecting file decoding algorithms. Asarium dynamically makes decisions about files, their structures, and their encodings as analysis progresses through a file (so you don't have to bother with white-listing or black listing files types - it figures out if there is decodable data within the file and works on that data). Asarium uses a number of mathematically advanced algorithms for analyzing data in a file. At GWU, I had originally focused on computational analysis of human proteins using similar algorithms to find "hidden" and unknown patterns by the analysis of observable and known patterns. These algorithms are statistical and probable in nature, and are known for their applications in temporal pattern recognition such as speech, handwriting, gesture recognition, musical score following, as well as bioinformatics. As it turns out, the problem of accurately detecting "sensitive" information (including SSNs and more specifically nine digit numbers without formatting) is extremely similar to problems in bioinformatics and genetic analysis, since long strings of numbers and letters are very common in files - especially if the software is thorough enough to examine all files on a computer. By not examining all files, then users are subjected to ignoring data based on assumptions. For the geeks in the crowd, one of the two main methods Asarium uses to analyze file data deals with modeling data within Hidden Markov Models. The Hidden Markov Model is a finite set of states, each of which is associated with a (generally multidimensional) probability distribution array. Transitions among the states are governed by a set of probabilities called transition probabilities. In a particular state an outcome or observation can be generated, according to the associated probability distribution. In the effect, they are basically a method of walking from one observed outcome (series of bytes in a file) to the next, making a calculation at each point, and only using the calculated probability from the previous point in determining the most likely probability at the current point. This problem can be summarized as the probability of observing the sequence of length k Y = y(0),y(1),...,y(k − 1) where the sum runs over all possible hidden node sequences X = x(0),x(1),...,x(k − 1), And is summarized by: P(Y) = Σ P(Y | X) P(X) Long story short.... Asarium is able to dynamically step though every byte in a file and determine if it contains "sensitive" data based on a probabilistic modeling of bytes around each examined step. As it turns out, this sort of analysis is not only far more accurate than regular expressions (allowing Asarium to search inside ALL files, and not just certain ones based on vendor assumptions), but it also performs these searches faster than regular expression based tools with lower impact on the computer running the search. Internal tests have shown Asarium running over 1,600% faster than other tools, and external tests have shown far more dramatic results - of which some people on this list can validate. We just released the Beta-2 of our 2.0 version, and will work with some institutions in testing it. I expect 2.0 to be GA ready in the next couple of weeks. If you'd like more information, definitely check out our website, or contact us for more information. -Gary Gary Golomb Founder, Managing Partner 443-536-5757 http://www.proventsure.com <http://www.proventsure.com/> ---------------------------- Date: Fri, 20 Apr 2007 17:56:34 -0400 From: Wyman Miles Subject: Re: Sensitive data detection Content-Type: text/plain;charset=iso-8859-1 Runs of 9 are an extremely difficult problem. You can bracket them with a \D (nondigit) or \b (word break), which sometimes helps. Validation against the SSA area/group data helps a little. If your institution draws heavily from a predictable population, you can use the approach Colorado employs and write geographically dependent regexes. But no, there is no silver bullet. SINs are a far easier problem as they're Luhn-derived, like CC#s. > I'd be interested in hearing people's feedback about the issues with > high false positive rates and 9 digit SSNs in evaluating these > tools. Most the datastores I come across here store SSN without > hyphens, and creating regexs for any combination of 9 digit numbers > has always returned high false positives, so much so its borderline > useless. There are some special rules for SSNs, but nothing like > creditcard luhn checks. > > At 11:15 AM 4/20/2007, Harold Winshel wrote: >>We're also looking to use Cornell's Spider program for >>Rutgers-Camden Arts & Sciences faculty and staff. >> >>At 01:52 PM 4/20/2007, you wrote: >>>On 4/20/07, Curt Wilson <[log in to unmask]> wrote: >>>>Dear Educause security community, >>>> >>>>For those that are currently working on a project involving the >>>>identification of sensitive data across campus, I have some items of >>>>potential interest. I know that Teneble (Nessus) recently announced a >>>>module that can check (with host credentials) a host for the presence >>>> of >>>>selected types of sensitive data, but what we have chosen is >>>>Proventsure's Asarium software. We are in the early stages of testing, >>>>but it looks to be a tremendously helpful tool for such a large task >>>>(depending upon the size of your institution). >>> >>>Thanks Curt. A freeware package that works in this same area is >>>the Cornell Spider >>> >>>http://www.cit.cornell.edu/computer/security/tools/ >>>http://www.cit.cornell.edu/computer/security/tools/spider-cap.html >>> >>>-- >>>Peter N. Wan ([log in to unmask]) 258 Fourth Street, Rich 244 >>>Senior Information Security Engineer Atlanta, Georgia 30332-0700 >>> USA >>>OIT, Information Security +1 (404) 894-7766 AIM: >>> oitispnw >>>Georgia Institute of Technology GT FIRST Team Representative >> >>Harold Winshel >>Computing and Instructional Technologies >>Faculty of Arts & Sciences >>Rutgers University, Camden Campus >>311 N. 5th Street, Room B10 Armitage Hall >>Camden NJ 08102 >>(856) 225-6669 (O) > > --------------------------------------------------------------------------------------------------- > > Josh Drummond > Security Architect > Administrative Computing Services, University of California - Irvine > [log in to unmask] > 949.824.9574 > Wyman Miles Senior Security Engineer Cornell University
Current thread:
- Re: Sensitive data detection -- vendor response Gary Golomb (Apr 21)
- <Possible follow-ups>
- Re: Sensitive data detection -- vendor response Brad Judy (Apr 21)
- Re: Sensitive data detection -- vendor response Russell Fulton (Apr 21)
- Re: Sensitive data detection -- vendor response Roger Safian (Apr 23)
- Re: Sensitive data detection -- vendor response Gary Golomb (Apr 24)