Dailydave mailing list archives

Re: approximate string matching


From: Martin Roesch <roesch () sourcefire com>
Date: Fri, 1 Sep 2006 10:06:15 -0400

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Have you taken a look at Bloom Filters?  I'm not sure what your exact  
requirements are but Bloom Filters are fast and do probabilistic  
pattern matching.  Check out:

http://en.wikipedia.org/wiki/Bloom_filter
http://www.arl.wustl.edu/~sarang/analysis.pdf

They're easy to implement and fast, so maybe that will work for you.   
Hope that helps!

      -Marty


On Sep 1, 2006, at 6:52 AM, Mateusz Berezecki wrote:

Hello list readers,

Stumbling upon some filtering problem I realised exact string matching
is not the way to go. Thinking a bit I realised doing an approximate
string matching is a good solution.

What is approximate string matching? It's the way to allow pattern
matching allowing errors, such as typos, character deletions.
It's often called a fuzzy string matching or string matching allowing
errors. (please visit
http://www.nist.gov/dads/HTML/stringMatchwError.html for more
information)

The heart of a good fuzzy string matching algorithm is a
filtering function which unlike in exact string matching doesn't  
tell if
the strings are exact (match / no match answer) but rather tells us
if the strings are similar or not similar.

Each filtering (often called a distance function) function f accepts 2
strings and returns a value which is called differently and the naming
strictly depends on the method but the general term is a distance.

This distance function has the following properties:

1) D(X, X) = 0
2) D(X, Y) > 0  when (X != Y)
3) D(X,Y) = D(Y,X)

The returned match coincidence value is *the* value which tells how
similar are the strings. The value returned by the distance function
is also a distance and tells us how far apart are the 2 strings in the
same metric space.

Having this in mind I started doing a small research (looking it up on
my favorite search engine) and what I found was really surprising.

The only good resource for such algorithms is
which gives a handy list of distance functions

http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

Knowing of Hamming, Levenshtein and Q-grams functions a bit
I know they are too slow for my needs. I have no idea of other  
functions
but looking at execution time comparision graph on the page specified
above I saw that Jaccard Similarity function or Overlap coefficient
are giving quite good results. Then I need this quite quick so
implementing an unknown function is going to be quite painful to me.

So in the era of fuzziness everywhere I thought that this might be the
good place to ask as my favorite search engine is useless in this
case.

Is anyone aware of a good implementation of any of these algorithms
in C or perhaps some opensource C library for that purpose?
Do you have any recommendations?



thanks,
Mateusz Berezecki
_______________________________________________
Dailydave mailing list
Dailydave () lists immunitysec com
http://lists.immunitysec.com/mailman/listinfo/dailydave


- --
Martin Roesch - Founder/CTO, Sourcefire Inc. - +1-410-290-1616
Sourcefire - Security for the Real World - http://www.sourcefire.com
Snort: Open Source IDP - http://www.snort.org


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (Darwin)

iD8DBQFE+D5Xqj0FAQQ3KOARAphFAJsG4pxiugksePQ4FLC8tJ0HH6NegACfbee/
ffzWTEf6dPKxfD9Ao2kDCtU=
=dOBm
-----END PGP SIGNATURE-----
_______________________________________________
Dailydave mailing list
Dailydave () lists immunitysec com
http://lists.immunitysec.com/mailman/listinfo/dailydave


Current thread: