Bugtraq mailing list archives

Re: Defeating CAPTCHAs via Averaging


From: Fred Leeflang <fredl () dutchie org>
Date: Wed, 31 Jan 2007 00:55:41 +0100

Alexander Klimov wrote:
I am not sure I understand how you propose to build an automatic
system to attack it: If you can tell that two images contain the same
number then it is very likely that you can recognize the numbers
themselves (there are only 10 different digits).
Well if one of the stated conditions (being that the predominant
distortion in the captcha is of a noise-like nature) then you won't
need to find out if the numbers are identical or not. You will simply
find the number, something you wouldn't be able to do when the
distortion isn't noise-like. So when getting the same captcha several
times and averaging out the noise-like distortion will not result in
a number which OCR software can recognize then there can be a
(programmatic) conclusion that either 1) the distortion wasn't
noise-like, or 2) the numbers aren't identical in the repeteated gets.

So an automatic attack system would scan sites for captchas, try
doing the averaging trick, probably find a lot of negatives, but find
some positives.

 OTOH, if you have a
human in the loop, they can just use gimp to create the averaged
figure images from a single image per figure, and then use these
templates to calculate correlation in different places of a given
challenge.

I don't think your understanding of averaging out noise is quite
the same as mine (or the author's?). There's no 'template' with
which you can filter out noise-like distortion. You need multiple
different images. 'noise-like' means random, so as many values
to the left of the average as to the right of the average of the noise.
Averaging will make the result go near the average as the word implies,
and make the noise 'disappear'.

OTOH, I'm not sure if averaging is the best technique to use. It certainly
is a technique that's could be done by somebody with average mathematical
skills. Particularly on the sample captchas, the contrast is high enough that a discrete Fourier transform may be able to recognize it using just one captcha (no
I'm not volunteering)

Either way, when simple algoritms like averaging can decypher a captcha,
then it's not really a captcha, is it? :)

Regards,
Fred Leeflang


Current thread: