Educause Security Discussion mailing list archives
Re: SSN file scanner (C source available)
From: Graham Toal <gtoal () UTPA EDU>
Date: Fri, 12 May 2006 08:51:56 -0500
OK, you guys are just too great! Three more tools posted since yesterday and I like them all. I'ld like to add some comments:
A tool that you can use to examine file contents is available by default on both Mac OS X and Unix systems. This tools is grep. There are versions of grep available for the PC as well.
I may be wrong, but chances are high that grep will only give a meaningful report for text files. It may tell you that it has found a match in a binary file but not what the match is, so you won't have the ability to eliminate false positives, of which you should expect many if you include simple 9-digit sequences in your patterns (for SSNs without '-'s) It's useful to look for those patterns in binary files, so you can find spreadsheets and databases etc. I liked how Wyman's tool included the ability to run things like unzip in order to look inside archives. Basically what we really want is the ability to hook some sort of search like this into a decent anti-virus product, which already has the low-level facilities needed to look at all the data on a machine. I wonder if any AV products have user-accessible hooks? Maybe clamav, since it is already open source?
Grep can use regular expressions to look for data within a file. The following strings when used in grep will find Social Security and credit card numbers. SSNs 123-45-6789 or 123 45 6789 [0-9][0-9][0-9]\-[0-9][0-9]\-[0-9][0-9][0-9][0-9]|[0-9][0-9][0 -9]\ [0-9][0-9]\ [0-9][0-9][0-9][0-9]
...
Please examine the contents of any files carefully. I know on my system, I found a file containing flow data that matched the social security number format. Just because you get a particular hit does not automatically mean the data is of concern.
That was why I wrote my hack. Searching by regular expression is a useful tool (it's what all three solutions posted do) but if you're just using a generic regexp without any special knowlege of the domain (eg doing a check-digit calculation on a credit card no, or a validation of an apparent SSN) the noise from these tools is going to flood you with data and make it hard to see the signal. (You avoided most of the noise by not allowing 9 consecutive digits as a pattern...) One other observation: searching for a fixed pattern string can be done *much* faster than searching for an arbitrary regexp of indeterminate length. Even searching for multiple fixed pattern strings at once can be done pretty efficiently. It doesn't have to be a fixed string (like an A/V signature), just a fixed *pattern* (with wild-cards for individual characters)
Current thread:
- SSN file scanner (C source available) Graham Toal (May 11)
- <Possible follow-ups>
- Re: SSN file scanner (C source available) Wyman Miles (May 12)
- Re: SSN file scanner (C source available) Roger Safian (May 12)
- Re: SSN file scanner (C source available) Graham Toal (May 12)
- Re: SSN file scanner (C source available) Wyman Miles (May 12)
- Re: SSN file scanner (C source available) Wyman Miles (May 12)
- Re: SSN file scanner (C source available) Steve Lovaas (May 12)
- Re: SSN file scanner (C source available) Gary Golomb (May 12)
- Re: SSN file scanner (C source available) Graham Toal (May 12)
- Re: SSN file scanner (C source available) Gary Golomb (May 12)
- Re: SSN file scanner (C source available) Wyman Miles (May 12)