Dailydave mailing list archives

Re: [fuzzing] Coverage and a recent paper by L. Suto


From: "matthew wollenweber" <mwollenweber () gmail com>
Date: Wed, 17 Oct 2007 00:28:44 -0400


The main downside is that I don't know why it is that finding bugs
should depend linearly on anything. is it really true that if run A
explores 10 more blocks than run B then run A has a .10 better chance of
finding a bug in the program?


I think the above is the crux of the problem. Despite lots of intuition on
where the bugs are or how to find them, there hasn't been much analysis (to
my knowledge) of modeling bug prediction. Clearly a lot of bugs occur where
unchecked data is transferred into fixed sized buffers, but from a system
perspective I've not seen anything quantifiable that suggests where bugs
will be. For your weights a,b,c (or even the linear equation) you're right
that we don't have metrics on what the best fit looks like.

I think it would be interesting for an academic or two to put a lot of
"small enough" programs through a lot of fuzzers. Fuzz almost all/most of
the possible paths and then analyze the common traits to answer basic
questions as to how diversity, coverage, and interesting functions relate to
finding bugs. (to any professors out there: I haven't finished my masters
yet and would love to find an interesting program). The cool research out
there (Molnar, Aitel, DeMott, etc) seems to focus a lot on the feedback
systems to find more bugs, but as of yet, I've seen no reason as to why any
fitness function is well suited to the problem.




On 10/16/07, David Molnar <dmolnar () eecs berkeley edu> wrote:

On Mon, 2007-10-15 at 21:38 -0700, J.M. Seitz wrote:

So, let's try this:

spankage = code coverage + (path diversity weight * 2) + (num
interesting funcs hit * 5)

This stems from many talks with Jared, Pedram, and Charlie. I also
proposed that a proximity value to dangerous functions be thrown in, but
let's not get too far ahead of ourselves.

Cool, this is interesting. I wonder if you could also flip it around and
try to estimate the values of those coefficients instead of picking "2"
and "5". I don't think those are bad values, just it'd be interesting if
there is a way to derive them somehow from data about which fuzzing runs
have been successful so far.

One way this could work is as follows: suppose we have a set of fuzzing
runs R_1, R_2,...,R_n, and suppose for each run we know whether the run
found a bug or not. Let's define Y_i to be 1 if the run R_i found at
least one bug and Y_i to be 0 otherwise. Then we could set up a linear
model:

Y_i = a + b*coverage_i + c*path diversity_i + d*num int.funcs_i +error_i

This just says we think whether a bug is found on run i depends linearly
on coverage, path diversity, and number of interesting functions, plus
an "intercept term" a and an error term. In your equation above for
"spankage", we have a = 0, b = 1, c = 2, d =5.

The error represents among other things the effect of all the other
variables we did not measure. Like, there's no variable in this equation
for "did the developers do a lot of threat modeling," so if that has a
lot of impact on whether a specific run finds a bug, then we will see
that as "error" in estimating how important coverage, path diversity,
and #interesting functions is to finding a bug. The error can also
account for things like randomization in the fuzzer, difficulty in
measuring path diversity, and so on.

The main point is that if we assume this model holds we can then
estimate values for a,b,c, and d from the data using any of several
methods. For example, if we believe the errors are independent of the
other variables, and if we believe the errors have a mean value of 0,
then we can use ordinary least squares regression. We can then look at
the resulting estimates for a,b,c, and d to give us some idea of whether
coverage, path diversity, and number of interesting functions in fact
have an impact on finding bugs. Finally, we could feed the resulting
estimates back into your Fuzzing Proxy Coverage Spanker-abob.

The main downside is that I don't know why it is that finding bugs
should depend linearly on anything. is it really true that if run A
explores 10 more blocks than run B then run A has a .10 better chance of
finding a bug in the program? Mainly this came to mind because I'm just
learning about linear regressions now and have to read dozens of papers
where people run, like, regressions of computer use vs. wages and
compare them to regressions of pencil use vs. wages to see whether
computers or pencils better predict high wages. Still, it seems like we
should be able to leverage the data generated on which fuzzer runs
produce bugs and which do not.

-David Molnar




-- 
Matthew  Wollenweber
mwollenweber () gmail com | mjw () cyberwart com
www.cyberwart.com
_______________________________________________
Dailydave mailing list
Dailydave () lists immunitysec com
http://lists.immunitysec.com/mailman/listinfo/dailydave

Current thread: