Educause Security Discussion mailing list archives
Re: SSN file scanner (C source available)
From: Graham Toal <gtoal () UTPA EDU>
Date: Fri, 12 May 2006 10:48:13 -0500
1) Valid SSNs never start with '8' (and those beginning with '9' are "Individual Taxpayer Identification Numbers" issued to foreign nationals and their dependents), so a regex ought to start with [0-7,9] at the very least.
The whole business with the odd/even numbers and <10 / >= 10 is unnecessarily complex, but it works and eliminates *all* the other heuristics picked up in random web pages - many of which are no longer valid as new groups start being used. The table approach is definitely worth implementing. I'm not sure I got it 100% right but the implementation below is close enough for a rough cut. If you have things like local fake SSNs for students from abroad, then you just manually add an entry to the 1000-entry table for the first 3 digit (area) code.
"Currently, a valid SSN cannot have the first three digits (the area number) above 772, the highest area number which the Social Security Administration has allocated. There are also special numbers which will never be allocated: * Numbers with all zeros in a digit group (000-xx-xxxx, xxx-00-xxxx, xxx-xx-0000). * Numbers of the form 666-xx-xxxx, probably due to the potential controversy (see Number of the Beast). Though the omission of this area number is not acknowledged by the SSA, it remains unassigned. * Numbers from 987-65-4320 to 987-65-4329 are reserved for advertising use."
All subsumed by the table in http://www.ssa.gov/employer/highgroup.txt described here: "The Group portion of the SSN has no meaning other than to determine whether or not a number has been assigned. SSA publishes a list every month of the highest group assigned for each SSN Area. The order of assignment for the Groups is: odd numbers under 10, even numbers over 9, even numbers under 9 except for 00 which is never used, and odd numbers over 10. For example, if the highest group assigned for area 999 is 72, then we know that the number 999-04-1234 is an invalid number because even Groups under 9 have not yet been assigned." Here's my interpretation of that description: int validgroup(int area, int group) { int cur, even, under10; if (maxgroup[area] < 0) return FALSE; cur = maxgroup[area]; even = ((cur&1) == 0); under10 = (cur < 10); if (debug) fprintf(stderr, "Our SSN's area is %d and group is %d. " " max group for %d is %d\n", area, group, area, cur); if (!even && under10) { if (debug) fprintf(stderr, "group is odd and < 10\n"); // our group must therefore also be odd and < 10 if (group > cur) return FALSE; // range check return ((group&1) != 0) && (group < 10); } if (even && !under10) { if (debug) fprintf(stderr, "group is even and >= 10, " "which also allows odd and < 10\n"); // our group may be odd and < 10, or even and >= 10 // first range check: if (group > cur) return FALSE; // range check return (((group&1) != 0) && (group < 10)) || (((group&1) == 0) && (group >= 10)); } if (even && under10) { if (debug) fprintf(stderr, "group is even and < 10, " "which also allows even and >= 10, " "plus odd and < 10\n"); // only illegal group would be if odd and >= 10 (note reversed logic) return (!(((group&1) != 0) && (group >= 10))); } // group must be odd and >= 10. // All groups now allowed, modulo range check if odd && >= 10. if (debug) fprintf(stderr, "group is odd and >= 10, which means " "anything goes (but can be range checked " "if our group is also odd)\n"); if (((group&1) != 0) && (group >= 10) && (group > cur)) return FALSE; return TRUE; }
I know we don't necessarily need to catch EVERY number for the exercise to be useful, but as long as people are working on custom tools, it might pay to be as accurate as possible. To be honest, our first pass will probably use simpler pattern matching to just get the thing done in a timely fashion, but I'd be interested in working out a complete set of expressions (incorporated with a Luhn check) to really get the best coverage. Hey, I'm about to start a CS PhD.. sounds like a project ;0
Sounds like we found our volunteer to construct a 'best of breed' tool :-) Mind you I'm not sure if it would be enough to justify a PhD, unless standards have gone downhill a lot in recent years ;-) G
Current thread:
- SSN file scanner (C source available) Graham Toal (May 11)
- <Possible follow-ups>
- Re: SSN file scanner (C source available) Wyman Miles (May 12)
- Re: SSN file scanner (C source available) Roger Safian (May 12)
- Re: SSN file scanner (C source available) Graham Toal (May 12)
- Re: SSN file scanner (C source available) Wyman Miles (May 12)
- Re: SSN file scanner (C source available) Wyman Miles (May 12)
- Re: SSN file scanner (C source available) Steve Lovaas (May 12)
- Re: SSN file scanner (C source available) Gary Golomb (May 12)
- Re: SSN file scanner (C source available) Graham Toal (May 12)
- Re: SSN file scanner (C source available) Gary Golomb (May 12)
- Re: SSN file scanner (C source available) Wyman Miles (May 12)