Educause Security Discussion mailing list archives

Re: Sensitive data detection


From: Brad Judy <Brad.Judy () COLORADO EDU>
Date: Sat, 21 Apr 2007 17:04:15 -0600

I definitely recommend a selective SSN regex like we are using for
non-delimited SSNs.  If you aren't sure what numbers to use, see if you
can dump the first three digits of your student SSN's and do a quick
analysis to find the most common.  While select schools may be pretty
balanced, most of us have a heavy weighting toward local residents.  

I also wouldn't touch regex without using boundary statements on those
expressions.  Just adding a \b or \D to the start and end makes a huge
difference in reducing false positives if you aren't already using it.  

Interestingly, one of the false positives we can't reliably shake is
actually hyphenated: Japanese telephone numbers which appear to be
formated nnn-nn-nnnn.

Brad Judy

IT Security Office
Information Technology Services
University of Colorado at Boulder

 

Date:         Fri, 20 Apr 2007 17:56:34 -0400

From:         Wyman Miles 

Subject:      Re: Sensitive data detection

Content-Type: text/plain;charset=iso-8859-1

 

Runs of 9 are an extremely difficult problem.  You can bracket them with
a

\D (nondigit) or \b (word break), which sometimes helps.  Validation

against the SSA area/group data helps a little.  If your institution
draws

heavily from a predictable population, you can use the approach Colorado

employs and write geographically dependent regexes.  But no, there is no

silver bullet.  SINs are a far easier problem as they're Luhn-derived,

like CC#s.

 

 

 

I'd be interested in hearing people's feedback about the issues with

high false positive rates and 9 digit SSNs in evaluating these

tools.  Most the datastores I come across here store SSN without

hyphens, and creating regexs for any combination of 9 digit numbers

has always returned high false positives, so much so its borderline

useless.  There are some special rules for SSNs, but nothing like

creditcard luhn checks.



At 11:15 AM 4/20/2007, Harold Winshel wrote:

We're also looking to use Cornell's Spider program for

Rutgers-Camden Arts & Sciences faculty and staff.



At 01:52 PM 4/20/2007, you wrote:

On 4/20/07, Curt Wilson <[log in to unmask]> wrote:

Dear Educause security community,



For those that are currently working on a project involving the

identification of sensitive data across campus, I have some items of

potential interest. I know that Teneble (Nessus) recently announced
a

module that can check (with host credentials) a host for the
presence

of

selected types of sensitive data, but what we have chosen is

Proventsure's Asarium software. We are in the early stages of
testing,

but it looks to be a tremendously helpful tool for such a large task

(depending upon the size of your institution).



Thanks Curt.  A freeware package that works in this same area is

the Cornell Spider



http://www.cit.cornell.edu/computer/security/tools/

http://www.cit.cornell.edu/computer/security/tools/spider-cap.html



--

Peter N. Wan ([log in to unmask])     258 Fourth Street, Rich 244

Senior Information Security Engineer        Atlanta, Georgia
30332-0700

USA

OIT, Information Security                   +1 (404) 894-7766 AIM:

oitispnw

Georgia Institute of Technology             GT FIRST Team
Representative



Harold Winshel

Computing and Instructional Technologies

Faculty of Arts & Sciences

Rutgers University, Camden Campus

311 N. 5th Street, Room B10 Armitage Hall

Camden NJ 08102

(856) 225-6669 (O)




------------------------------------------------------------------------
---------------------------



Josh Drummond

Security Architect

Administrative Computing Services, University of California - Irvine

[log in to unmask]

949.824.9574



 

 

Wyman Miles

Senior Security Engineer

Cornell University

 

Current thread: