Educause Security Discussion mailing list archives

Re: SSN file scanner (C source available)


From: Steve Lovaas <steven.lovaas () COLOSTATE EDU>
Date: Fri, 12 May 2006 08:32:08 -0600

What a great series of posts! We're working on exactly the same thing at
Colorado State. Guess most state legislatures are too :)

Remember, as you think about reducing false positives and becoming more
confident of your hits, the following details of numbering:

1) Valid SSNs never start with '8' (and those beginning with '9' are
"Individual Taxpayer Identification Numbers" issued to foreign nationals
and their dependents), so a regex ought to start with [0-7,9] at the
very least.

2) Valid credit card numbers are not always 16 digits long. Diners Club
are 13 digits, AMEX are 15 digits, most of the rest are 16 digits,
although the numbering scheme allows for numbers as long as 19 digits.

More detailed info:

Several people have coded versions of the "Luhn algorithm" which checks
that the number is a potentially valid credit card number. It's the
algorithm used to determine the last digit of your credit card number,
which is a checksum. One of the most interesting sources talking about
this is at http://www.merriampark.com/anatomycc.htm, and there's also an
article at http://javascript.about.com/library/blccard.htm - both
include Javascript for automating the check.

Basically, (from the merriampark article), "For a card with an even
number of digits, double every odd numbered digit and subtract 9 if the
product is greater than 9. Add up all the even digits as well as the
doubled-odd digits, and the result must be a multiple of 10 or it's not
a valid card. If the card has an odd number of digits, perform the same
addition doubling the even numbered digits instead."

Also, as long as I'm being pedantic, there are invalid SSN patterns that
a truly comprehensive search would exclude (this wording from
Wikipedia,http://en.wikipedia.org/wiki/Social_security_number, though
the concept can be found lots of places):

"Currently, a valid SSN cannot have the first three digits (the area
number) above 772, the highest area number which the Social Security
Administration has allocated. There are also special numbers which will
never be allocated:

* Numbers with all zeros in a digit group (000-xx-xxxx, xxx-00-xxxx,
xxx-xx-0000).
* Numbers of the form 666-xx-xxxx, probably due to the potential
controversy (see Number of the Beast). Though the omission of this area
number is not acknowledged by the SSA, it remains unassigned.
* Numbers from 987-65-4320 to 987-65-4329 are reserved for advertising use."



I know we don't necessarily need to catch EVERY number for the exercise
to be useful, but as long as people are working on custom tools, it
might pay to be as accurate as possible. To be honest, our first pass
will probably use simpler pattern matching to just get the thing done in
a timely fashion, but I'd be interested in working out a complete set of
expressions (incorporated with a Luhn check) to really get the best
coverage. Hey, I'm about to start a CS PhD... sounds like a project ;0

Thanks,
Steve Lovaas

Wyman Miles wrote:
At their heart, all of these tools are one flavor or another of pcregrep.

A somewhat organized "find it and nuke it" movement has started at Cornell,
where the departments are conducting periodic, organized searches for
confidential data and either encrypting, moving, or removing it.

What we're striving to build here are LAN-capable tools with centralized
logging and unattended operation to support that effort.

--On Friday, May 12, 2006 8:17 AM -0500 Roger Safian
<r-safian () NORTHWESTERN EDU> wrote:

If it's on any use, here's a post I made a while back to our local
user group about looking for SSN's and credit card numbers using
grep.

--


--
==============================================================
Steven Lovaas, MSIA, CISSP
Network & Security Resource Manager
Academic Computing & Network Services
Colorado State University
970-297-3707
Steven.Lovaas () ColoState EDU
==============================================================

Current thread: