Educause Security Discussion mailing list archives

Re: SSN file scanner (C source available)


From: Wyman Miles <wm63 () CORNELL EDU>
Date: Fri, 12 May 2006 10:17:41 -0400

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



- --On Friday, May 12, 2006 8:51 AM -0500 Graham Toal <gtoal () UTPA EDU> wrote:

OK, you guys are just too great!  Three more tools posted since
yesterday and I like them all.

I'ld like to add some comments:

A tool that you can use to examine file contents is available
by default on both Mac OS X and Unix systems.  This tools is grep.
There are versions of grep available for the PC as well.

hooks?  Maybe clamav, since it is already open source?

Possibly.  Something we thought of, but never pursued.


Grep can use regular expressions to look for data within a file.
The following strings when used in grep will find Social
Security and credit card numbers.

SSNs 123-45-6789 or 123 45 6789

[0-9][0-9][0-9]\-[0-9][0-9]\-[0-9][0-9][0-9][0-9]|[0-9][0-9][0
-9]\ [0-9][0-9]\ [0-9][0-9][0-9][0-9]
...
Please examine the contents of any files carefully.  I know
on my system, I found a file containing flow data that
matched the social security number format.  Just because you
get a particular hit does not automatically mean the data is
of concern.


Spider's approach to this problem (at least in the Linux variant) is to log
roughly 1K of text on either side of the match.  You can then visually
inspect the log for false positives.  Right now we do a home-grown
encrypted syslog.  We're headed toward HTTPS and XML.

Both the win32 and linux flavors also exclude images, executables,
encrypted files, and other things where the chance of a valid match is
nearly zero and/or the chance of a false positive is high.

[ Eudora's license key is a perfectly valid credit card number, damn them.
Japanese phone numbers match \d{3}-\d{2}-\d{4}, etc. ]


That was why I wrote my hack.  Searching by regular expression
is a useful tool (it's what all three solutions posted do) but
if you're just using a generic regexp without any special knowlege
of the domain (eg doing a check-digit calculation on a credit
card no, or a validation of an apparent SSN) the noise from these
tools is going to flood you with data and make it hard to see
the signal.  (You avoided most of the noise by not allowing
9 consecutive digits as a pattern...)

We're using the regex from BleedingSnort, which realizes SSNs above 772 as
first-three haven't been assigned.  This cuts things down some, at the
expense of the valuable bycatch that comes from us assigning 999- to
international students.

For credit card numbers, there are regexes that accurately identify
Visa/MC/Disc (4xxx/5xxx/6011, etc)


One other observation: searching for a fixed pattern string can
be done *much* faster than searching for an arbitrary regexp of
indeterminate length.  Even searching for multiple fixed pattern
strings at once can be done pretty efficiently.


Precompiling the patterns, as perl will do, or our libPCRE addition to
dd/dcfldd does, speeds things up considerably.

Win32 spider, using the .NET regex matching, is the slowest of them all.
Hunting through 90K files on my pokey laptop for the 4 I've baited only
takes about 30 minutes, though.

It doesn't have to be a fixed string (like an A/V signature), just
a fixed *pattern* (with wild-cards for individual characters)



Wyman Miles
Senior Security Engineer
Cornell University, Ithaca, NY
(607) 255-8421
-----BEGIN PGP SIGNATURE-----
Version: Mulberry PGP Plugin v3.0
Comment: processed by Mulberry PGP Plugin

iQA/AwUBRGSZBcRE6QfTb3V0EQKztQCfR6uvOB+MysNSrIU1AgiXBvgAubwAmwbs
4FDT0ZL7wOPrN/GxueKnl887
=0pms
-----END PGP SIGNATURE-----

Current thread: