Educause Security Discussion mailing list archives

Re: SSN file scanner (C source available)


From: Graham Toal <gtoal () UTPA EDU>
Date: Fri, 12 May 2006 08:51:56 -0500

OK, you guys are just too great!  Three more tools posted since
yesterday and I like them all.

I'ld like to add some comments: 

A tool that you can use to examine file contents is available 
by default on both Mac OS X and Unix systems.  This tools is grep.
There are versions of grep available for the PC as well.

I may be wrong, but chances are high that grep will only give a
meaningful report for text files.  It may tell you that it has
found a match in a binary file but not what the match is, so you
won't have the ability to eliminate false positives, of which you
should expect many if you include simple 9-digit sequences in
your patterns (for SSNs without '-'s) It's useful to look for
those patterns in binary files, so you can find spreadsheets
and databases etc.  I liked how Wyman's tool included the ability
to run things like unzip in order to look inside archives.
Basically what we really want is the ability to hook some sort
of search like this into a decent anti-virus product, which
already has the low-level facilities needed to look at all the
data on a machine.  I wonder if any AV products have user-accessible
hooks?  Maybe clamav, since it is already open source?

Grep can use regular expressions to look for data within a file.
The following strings when used in grep will find Social 
Security and credit card numbers.

SSNs 123-45-6789 or 123 45 6789

[0-9][0-9][0-9]\-[0-9][0-9]\-[0-9][0-9][0-9][0-9]|[0-9][0-9][0
-9]\ [0-9][0-9]\ [0-9][0-9][0-9][0-9]
...
Please examine the contents of any files carefully.  I know 
on my system, I found a file containing flow data that 
matched the social security number format.  Just because you 
get a particular hit does not automatically mean the data is 
of concern.


That was why I wrote my hack.  Searching by regular expression
is a useful tool (it's what all three solutions posted do) but
if you're just using a generic regexp without any special knowlege
of the domain (eg doing a check-digit calculation on a credit
card no, or a validation of an apparent SSN) the noise from these
tools is going to flood you with data and make it hard to see
the signal.  (You avoided most of the noise by not allowing
9 consecutive digits as a pattern...)

One other observation: searching for a fixed pattern string can
be done *much* faster than searching for an arbitrary regexp of
indeterminate length.  Even searching for multiple fixed pattern
strings at once can be done pretty efficiently.

It doesn't have to be a fixed string (like an A/V signature), just
a fixed *pattern* (with wild-cards for individual characters)

Current thread: