Educause Security Discussion mailing list archives

Re: SSN file scanner (C source available)


From: Graham Toal <gtoal () UTPA EDU>
Date: Fri, 12 May 2006 10:48:13 -0500

1) Valid SSNs never start with '8' (and those beginning with 
'9' are "Individual Taxpayer Identification Numbers" issued 
to foreign nationals and their dependents), so a regex ought 
to start with [0-7,9] at the very least.

The whole business with the odd/even numbers and <10 / >= 10
is unnecessarily complex, but it works and eliminates *all*
the other heuristics picked up in random web pages - many of
which are no longer valid as new groups start being used.
The table approach is definitely worth implementing.  I'm not
sure I got it 100% right but the implementation below is close
enough for a rough cut.

If you have things like local fake SSNs for students from
abroad, then you just manually add an entry to the 1000-entry
table for the first 3 digit (area) code.


"Currently, a valid SSN cannot have the first three digits (the area
number) above 772, the highest area number which the Social 
Security Administration has allocated. There are also special 
numbers which will never be allocated:

* Numbers with all zeros in a digit group (000-xx-xxxx, 
xxx-00-xxxx, xxx-xx-0000).
* Numbers of the form 666-xx-xxxx, probably due to the 
potential controversy (see Number of the Beast). Though the 
omission of this area number is not acknowledged by the SSA, 
it remains unassigned.
* Numbers from 987-65-4320 to 987-65-4329 are reserved for 
advertising use."

All subsumed by the table in http://www.ssa.gov/employer/highgroup.txt
described here:
"The Group portion of the SSN has no meaning other than to
determine whether or not a number has been assigned. SSA
publishes a list every month of the highest group assigned for
each SSN Area.  The order of assignment for the Groups is: odd
numbers under 10, even numbers over 9, even numbers under 9
except for 00 which is never used, and odd numbers over 10. For
example, if the highest group assigned for area 999 is 72, then
we know that the number 999-04-1234 is an invalid number because
even Groups under 9 have not yet been assigned."

Here's my interpretation of that description:


int validgroup(int area, int group)
{
  int cur, even, under10;
  if (maxgroup[area] < 0) return FALSE;

  cur = maxgroup[area];
  even = ((cur&1) == 0);
  under10 = (cur < 10);

  if (debug) fprintf(stderr, "Our SSN's area is %d and group is %d. "
                             " max group for %d is %d\n",
                             area, group, area, cur);

  if (!even && under10) {
    if (debug) fprintf(stderr, "group is odd and < 10\n");
    // our group must therefore also be odd and < 10
    if (group > cur) return FALSE; // range check
    return ((group&1) != 0) && (group < 10);
  }

  if (even && !under10) {
    if (debug) fprintf(stderr, "group is even and >= 10, "
                               "which also allows odd and < 10\n");
    // our group may be odd and < 10, or even and >= 10
    // first range check:
    if (group > cur) return FALSE; // range check
    return (((group&1) != 0) && (group < 10)) 
        || (((group&1) == 0) && (group >= 10));
  }

  if (even && under10) {
    if (debug) fprintf(stderr, "group is even and < 10, "
                               "which also allows even and >= 10, "
                               "plus odd and < 10\n");
    // only illegal group would be if odd and >= 10  (note reversed
logic)
    return (!(((group&1) != 0) && (group >= 10)));
  }

  // group must be odd and >= 10.
  // All groups now allowed, modulo range check if odd && >= 10.
  if (debug) fprintf(stderr, "group is odd and >= 10, which means "
                             "anything goes (but can be range checked "
                             "if our group is also odd)\n");
  if (((group&1) != 0) && (group >= 10) && (group > cur)) return FALSE;
  return TRUE;
}


I know we don't necessarily need to catch EVERY number for 
the exercise to be useful, but as long as people are working 
on custom tools, it might pay to be as accurate as possible. 
To be honest, our first pass will probably use simpler 
pattern matching to just get the thing done in a timely 
fashion, but I'd be interested in working out a complete set 
of expressions (incorporated with a Luhn check) to really get 
the best coverage. Hey, I'm about to start a CS PhD.. sounds 
like a project ;0

Sounds like we found our volunteer to construct a 'best of breed'
tool :-)  Mind you I'm not sure if it would be enough to justify
a PhD, unless standards have gone downhill a lot in recent years ;-)


G

Current thread: