Educause Security Discussion mailing list archives
Re: Spider scripts?
From: Brad Judy <win-hied () BRADJUDY COM>
Date: Thu, 30 Apr 2009 15:56:57 -0400
I'll second some of Mike's recommendations, but double the warning on searching for only delimited items. I can say (from an unfortunate amount of personal experience) that large repositories of private data are almost always non-delimited. It makes perfect sense too - people big on data entry can probably quote productivity percentages off the top of their head for the difference between typing nine characters versus eleven (or 19 versus 16). The delimitation doesn't serve a purpose except for display back to human eyes, where hyphens can be added programmatically Delimited numbers are most common in individual communication (e-mail, Word docs, etc) where only one or two elements are present. My focus on such searching has generally been on finding the larger spreadsheets, databases, text output files, etc. So, I have promoted the use of Mike's #3 if you want to reduce false positives. Find the most common starting numbers for your organization (for SSNs, it's likely whatever your state issued 18-22 years ago) and keep checking for non-delimited data. I used the same approach for campus-issued credit cards - they likely all start with the same first six digits (the issuing bank code). Combining this approach with the concept in #1 (scanning only certain file types), you can have a very good detection rate for lists of 10+ elements with minimal false positives. Don't forget to include some old file types in your boundaries. It's likely some of the forgotten private information DBs are in an old FoxPro database (or something similar). Brad Judy -------------------------------------------------- From: "Mike Lococo" <mike.lococo () NYU EDU> Sent: Wednesday, April 29, 2009 5:10 PM To: <SECURITY () LISTSERV EDUCAUSE EDU> Subject: Re: [SECURITY] Spider scripts?
We are currently working with Cornell's Spider to develop a process and tool for our IT techs and others to scan computers for confidential data (SSNs and CC#s). Has anyone refined the scripts that Spider uses to help lower the incidence of false positives? If so, would you be willing to share them with us? You can reply to me offline if you like. Thanks for your help in this.I haven't heard too many complaints about FP's recently, but the standard tuning list I've heard mentioned in the past is... 0) Utilize the "Mark as false positive" feature so bad hits don't show up in future scans. 1) Try Spider 4 (2008), it moves from a filetype blacklist to a filetype whitelist. By scanning only likely document types (word, excel, pdf, email, etc), FP's are cut way down. 2) Scan the user's profile directory instead of the whole drive. You'll miss stuff stored in temp directories, but will cut down on FP's significantly. (Spider 2008 does this by default) 3) If you have a common area prefix for your organization, require it. You might miss emails with 1-2 SSN's in them (if those 1-2 aren't from the right prefixes) but you'll find the big spreadsheets and 2000 SSN's in them (because at least some of those will be from the right prefixes). 4) Use a custom regex that requires delimiters. False positives will go WAY down, at the cost of missing undelimited strings which several folks have found to be very common. Older spiders had several regex options, some of which required delimiters and some of which didn't. The latest version seems to have standardized on matching with or without delimiters and offering no option for alternate behavior. Thanks, Mike Lococo
Current thread:
- Spider scripts? Theresa Semmens (Apr 29)
- <Possible follow-ups>
- Re: Spider scripts? Mike Lococo (Apr 29)
- Re: Spider scripts? Baumstein,Avi H (Apr 29)
- Re: Spider scripts? Sarazen, Daniel (Apr 29)
- Re: Spider scripts? Curt Wilson (Apr 29)
- Re: Spider scripts? Eric Case (Apr 29)
- Re: Spider scripts? Mike Lococo (Apr 29)
- Re: Spider scripts? Brad Judy (Apr 30)
- Re: Spider scripts? randy marchany (Apr 30)
- Re: Spider scripts? Mike Lococo (Apr 30)