Nmap Development mailing list archives

Re: RFC: patch to skip some service matches


From: David Fifield <david () bamsoftware com>
Date: Fri, 9 Jun 2017 10:24:08 -0700

On Wed, Aug 24, 2016 at 10:23:40AM -0600, David Fifield wrote:
On Wed, Aug 24, 2016 at 12:20:37AM -0500, Daniel Miller wrote:
List,

I've spent probably too much time today enhancing Nmap's service matching
system to try to reduce CPU time spent in regular expression matching.
Unfortunately, I can't tell whether it has improved anything yet, so I'm asking
for help testing.

Since this is a CPU-time enhancement, it would only affect scans which are
CPU-bound. For this reason, I've CC'd Tudor and Brandon, as their GSoC project
resulted in speedups of certain large scans due to algorithmic improvements,
and I hope they can test.

The change involves inspecting each match line in nmap-service-probes to see if
it is part of a contiguous group of match lines that will only ever match a
string starting with a given single byte. Already, this means we are targeting
only the very fastest match lines, so chances are good there won't be
noticeable improvement. The first such match in a group will link to the last
one (by index) so that the entire group can be skipped if the first match fails
because of an incorrect initial byte. There are groups of hundreds of such
contiguous match lines in a few places: FTP matches starting with "2", SSH
matches starting with "S", and HTTP matches starting with "H" for instance.

I've thought about this problem too. I'm intrigued by the Rust
"RegexSet" API:
https://doc.rust-lang.org/regex/regex/struct.RegexSet.html
"A regex set corresponds to the union of two or more regular
expressions... it will also report *which* regular expressions in the
set match."

A RegexSet feels like the "right" way to do our kind of matching. But I
haven't found an equivalent API in PCRE, or in any library other than
Rust's. It presumably requires the RE engine to use a linear-time
matching algorithm (as in https://swtch.com/~rsc/regexp/regexp1.html) so
that matching the union of many REs will be faster than matching each
one individually in turn--the NFA implementation would do your
first-byte optimization implicitly.

It turns out the C++ re2 library also has this regex set feature (the
Rust library got the idea from re2).
https://github.com/google/re2/blob/4bb66d4f2fff9c5d4b0ef72634723180aba9f320/re2/set.h

The idea is that for each Probe in nmap-service-probes, you would have a
data structure consisting of both an RE2::Set and a list of all the
Regexp patterns that compose it (including all fallbacks). You would
then match your data against the RE2::Set to get a list of matching
indices. Take the smallest index (as if you were testing Regexps
sequentially and stopping at the first match) and then re-do the match
against the Regexp at that index, in order to extract captures (the
RE2::Set doesn't give you captures).
_______________________________________________
Sent through the dev mailing list
https://nmap.org/mailman/listinfo/dev
Archived at http://seclists.org/nmap-dev/


Current thread: