Educause Security Discussion mailing list archives

Re: Spider scripts?


From: Mike Lococo <mike.lococo () NYU EDU>
Date: Wed, 29 Apr 2009 17:10:28 -0400

We are currently working with Cornell's Spider to develop a process and tool
for our IT techs and others to scan computers for confidential data (SSNs
and CC#s).  Has anyone refined the scripts that Spider uses to help lower
the incidence of false positives?  If so, would you be willing to share them
with us?  You can reply to me offline if you like.  Thanks for your help in
this.

I haven't heard too many complaints about FP's recently, but the
standard tuning list I've heard mentioned in the past is...

0) Utilize the "Mark as false positive" feature so bad hits don't show
up in future scans.

1) Try Spider 4 (2008), it moves from a filetype blacklist to a filetype
whitelist.  By scanning only likely document types (word, excel, pdf,
email, etc), FP's are cut way down.

2) Scan the user's profile directory instead of the whole drive.  You'll
miss stuff stored in temp directories, but will cut down on FP's
significantly.  (Spider 2008 does this by default)

3) If you have a common area prefix for your organization, require it.
You might miss emails with 1-2 SSN's in them (if those 1-2 aren't from
the right prefixes) but you'll find the big spreadsheets and 2000 SSN's
in them (because at least some of those will be from the right prefixes).

4) Use a custom regex that requires delimiters.  False positives will go
WAY down, at the cost of missing undelimited strings which several folks
have found to be very common.  Older spiders had several regex options,
some of which required delimiters and some of which didn't.  The latest
version seems to have standardized on matching with or without
delimiters and offering no option for alternate behavior.

Thanks,
Mike Lococo

Current thread: