Educause Security Discussion mailing list archives

Re: Spider scripts?


From: Brad Judy <win-hied () BRADJUDY COM>
Date: Thu, 30 Apr 2009 15:56:57 -0400

I'll second some of Mike's recommendations, but double the warning on
searching for only delimited items.

I can say (from an unfortunate amount of personal experience) that large
repositories of private data are almost always non-delimited.  It makes
perfect sense too - people big on data entry can probably quote productivity
percentages off the top of their head for the difference between typing nine
characters versus eleven (or 19 versus 16).  The delimitation doesn't serve
a purpose except for display back to human eyes, where hyphens can be added
programmatically

Delimited numbers are most common in individual communication (e-mail, Word
docs, etc) where only one or two elements are present.

My focus on such searching has generally been on finding the larger
spreadsheets, databases, text output files, etc.  So, I have promoted the
use of Mike's #3 if you want to reduce false positives.  Find the most
common starting numbers for your organization (for SSNs, it's likely
whatever your state issued 18-22 years ago) and keep checking for
non-delimited data.  I used the same approach for campus-issued credit
cards - they likely all start with the same first six digits (the issuing
bank code).

Combining this approach with the concept in #1 (scanning only certain file
types), you can have a very good detection rate for lists of 10+ elements
with minimal false positives.  Don't forget to include some old file types
in your boundaries.  It's likely some of the forgotten private information
DBs are in an old FoxPro database (or something similar).

Brad Judy

--------------------------------------------------
From: "Mike Lococo" <mike.lococo () NYU EDU>
Sent: Wednesday, April 29, 2009 5:10 PM
To: <SECURITY () LISTSERV EDUCAUSE EDU>
Subject: Re: [SECURITY] Spider scripts?

We are currently working with Cornell's Spider to develop a process and
tool
for our IT techs and others to scan computers for confidential data (SSNs
and CC#s).  Has anyone refined the scripts that Spider uses to help lower
the incidence of false positives?  If so, would you be willing to share
them
with us?  You can reply to me offline if you like.  Thanks for your help
in
this.

I haven't heard too many complaints about FP's recently, but the standard
tuning list I've heard mentioned in the past is...

0) Utilize the "Mark as false positive" feature so bad hits don't show up
in future scans.

1) Try Spider 4 (2008), it moves from a filetype blacklist to a filetype
whitelist.  By scanning only likely document types (word, excel, pdf,
email, etc), FP's are cut way down.

2) Scan the user's profile directory instead of the whole drive.  You'll
miss stuff stored in temp directories, but will cut down on FP's
significantly.  (Spider 2008 does this by default)

3) If you have a common area prefix for your organization, require it. You
might miss emails with 1-2 SSN's in them (if those 1-2 aren't from the
right prefixes) but you'll find the big spreadsheets and 2000 SSN's in
them (because at least some of those will be from the right prefixes).

4) Use a custom regex that requires delimiters.  False positives will go
WAY down, at the cost of missing undelimited strings which several folks
have found to be very common.  Older spiders had several regex options,
some of which required delimiters and some of which didn't.  The latest
version seems to have standardized on matching with or without delimiters
and offering no option for alternate behavior.

Thanks,
Mike Lococo


Current thread: