Educause Security Discussion mailing list archives

Re: Spider scripts?


From: randy marchany <marchany () VT EDU>
Date: Thu, 30 Apr 2009 16:30:43 -0400

I would recommend using both Spider and our tool, Find_SSN. We've done
a lot of work making sure valid SSN are checked (we eliminate deceased
SSNs, etc.). We have a test site that you can use to verify your
numbers.

Online Number Validation Test is at https://black.cirt.vt.edu/valid_ssn
Generate some test numbers and paste them into the input window.
This process helps users get why some numbers are valid while others
are not. Cornell and VT did a preconference seminar on the PII search
tools. You can find the presentation at the Educause www site.

-Randy Marchany


On Thu, Apr 30, 2009 at 3:56 PM, Brad Judy <win-hied () bradjudy com> wrote:
I'll second some of Mike's recommendations, but double the warning on
searching for only delimited items.

I can say (from an unfortunate amount of personal experience) that large
repositories of private data are almost always non-delimited.  It makes
perfect sense too - people big on data entry can probably quote productivity
percentages off the top of their head for the difference between typing nine
characters versus eleven (or 19 versus 16).  The delimitation doesn't serve
a purpose except for display back to human eyes, where hyphens can be added
programmatically

Delimited numbers are most common in individual communication (e-mail, Word
docs, etc) where only one or two elements are present.

My focus on such searching has generally been on finding the larger
spreadsheets, databases, text output files, etc.  So, I have promoted the
use of Mike's #3 if you want to reduce false positives.  Find the most
common starting numbers for your organization (for SSNs, it's likely
whatever your state issued 18-22 years ago) and keep checking for
non-delimited data.  I used the same approach for campus-issued credit cards
- they likely all start with the same first six digits (the issuing bank
code).

Combining this approach with the concept in #1 (scanning only certain file
types), you can have a very good detection rate for lists of 10+ elements
with minimal false positives.  Don't forget to include some old file types
in your boundaries.  It's likely some of the forgotten private information
DBs are in an old FoxPro database (or something similar).

Brad Judy

--------------------------------------------------
From: "Mike Lococo" <mike.lococo () NYU EDU>
Sent: Wednesday, April 29, 2009 5:10 PM
To: <SECURITY () LISTSERV EDUCAUSE EDU>
Subject: Re: [SECURITY] Spider scripts?

We are currently working with Cornell's Spider to develop a process and
tool
for our IT techs and others to scan computers for confidential data (SSNs
and CC#s).  Has anyone refined the scripts that Spider uses to help lower
the incidence of false positives?  If so, would you be willing to share
them
with us?  You can reply to me offline if you like.  Thanks for your help
in
this.

I haven't heard too many complaints about FP's recently, but the standard
tuning list I've heard mentioned in the past is...

0) Utilize the "Mark as false positive" feature so bad hits don't show up
in future scans.

1) Try Spider 4 (2008), it moves from a filetype blacklist to a filetype
whitelist.  By scanning only likely document types (word, excel, pdf, email,
etc), FP's are cut way down.

2) Scan the user's profile directory instead of the whole drive.  You'll
miss stuff stored in temp directories, but will cut down on FP's
significantly.  (Spider 2008 does this by default)

3) If you have a common area prefix for your organization, require it. You
might miss emails with 1-2 SSN's in them (if those 1-2 aren't from the right
prefixes) but you'll find the big spreadsheets and 2000 SSN's in them
(because at least some of those will be from the right prefixes).

4) Use a custom regex that requires delimiters.  False positives will go
WAY down, at the cost of missing undelimited strings which several folks
have found to be very common.  Older spiders had several regex options, some
of which required delimiters and some of which didn't.  The latest version
seems to have standardized on matching with or without delimiters and
offering no option for alternate behavior.

Thanks,
Mike Lococo



Current thread: