Dailydave mailing list archives
Re: approximate string matching
From: Martin Roesch <roesch () sourcefire com>
Date: Fri, 1 Sep 2006 10:06:15 -0400
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Have you taken a look at Bloom Filters? I'm not sure what your exact requirements are but Bloom Filters are fast and do probabilistic pattern matching. Check out: http://en.wikipedia.org/wiki/Bloom_filter http://www.arl.wustl.edu/~sarang/analysis.pdf They're easy to implement and fast, so maybe that will work for you. Hope that helps! -Marty On Sep 1, 2006, at 6:52 AM, Mateusz Berezecki wrote:
Hello list readers, Stumbling upon some filtering problem I realised exact string matching is not the way to go. Thinking a bit I realised doing an approximate string matching is a good solution. What is approximate string matching? It's the way to allow pattern matching allowing errors, such as typos, character deletions. It's often called a fuzzy string matching or string matching allowing errors. (please visit http://www.nist.gov/dads/HTML/stringMatchwError.html for more information) The heart of a good fuzzy string matching algorithm is a filtering function which unlike in exact string matching doesn't tell if the strings are exact (match / no match answer) but rather tells us if the strings are similar or not similar. Each filtering (often called a distance function) function f accepts 2 strings and returns a value which is called differently and the naming strictly depends on the method but the general term is a distance. This distance function has the following properties: 1) D(X, X) = 0 2) D(X, Y) > 0 when (X != Y) 3) D(X,Y) = D(Y,X) The returned match coincidence value is *the* value which tells how similar are the strings. The value returned by the distance function is also a distance and tells us how far apart are the 2 strings in the same metric space. Having this in mind I started doing a small research (looking it up on my favorite search engine) and what I found was really surprising. The only good resource for such algorithms is which gives a handy list of distance functions http://www.dcs.shef.ac.uk/~sam/stringmetrics.html Knowing of Hamming, Levenshtein and Q-grams functions a bit I know they are too slow for my needs. I have no idea of other functions but looking at execution time comparision graph on the page specified above I saw that Jaccard Similarity function or Overlap coefficient are giving quite good results. Then I need this quite quick so implementing an unknown function is going to be quite painful to me. So in the era of fuzziness everywhere I thought that this might be the good place to ask as my favorite search engine is useless in this case. Is anyone aware of a good implementation of any of these algorithms in C or perhaps some opensource C library for that purpose? Do you have any recommendations? thanks, Mateusz Berezecki _______________________________________________ Dailydave mailing list Dailydave () lists immunitysec com http://lists.immunitysec.com/mailman/listinfo/dailydave
- -- Martin Roesch - Founder/CTO, Sourcefire Inc. - +1-410-290-1616 Sourcefire - Security for the Real World - http://www.sourcefire.com Snort: Open Source IDP - http://www.snort.org -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (Darwin) iD8DBQFE+D5Xqj0FAQQ3KOARAphFAJsG4pxiugksePQ4FLC8tJ0HH6NegACfbee/ ffzWTEf6dPKxfD9Ao2kDCtU= =dOBm -----END PGP SIGNATURE----- _______________________________________________ Dailydave mailing list Dailydave () lists immunitysec com http://lists.immunitysec.com/mailman/listinfo/dailydave
Current thread:
- approximate string matching Mateusz Berezecki (Sep 01)
- Re: approximate string matching Arun Koshy (Sep 01)
- Re: approximate string matching Mateusz Berezecki (Sep 01)
- Re: approximate string matching Tony Carter (Sep 01)
- Re: approximate string matching Martin Roesch (Sep 01)
- Re: approximate string matching Mateusz Berezecki (Sep 01)
- Re: approximate string matching Arun Koshy (Sep 01)