Dailydave mailing list archives

Re: The Static Analysis Market and You


From: "Steven M. Christey" <coley () mitre org>
Date: Thu, 16 Oct 2008 19:53:37 -0400 (EDT)


(apologies if this was posted twice)


I supported the NIST effort, primarily as part of my CWE work at
MITRE.  Mostly, I was an evaluator of the tool results, but I've also
assisted in some of the design and interpretation of results.

I'll assume that people have read the SATE page and are somewhat
familiar with how this project was run.

First and foremost:

SATE was an "exposition," not a scientific "experiment."  We were
trying to figure out the things that could be done to evaluate tools,
NOT to actually evaluate tools.  We've learned a lot, but the
resulting human analysis is neither reliable nor repeatable.  I'm not
talking about the raw data that was generated by tools and dumped into
a shareable format - the raw data generated from the tools is what it
is.  It's only our interpretations of the results, and what
conclusions that should be drawn, that pose the greatest challenge -
especially because you know damn well that someone somewhere is going
to stack stats against each other and compare tools using data that we
keep saying is not appropriate for that purpose).

Quick definitions:

  - "NIST" - the SATE project leads working at NIST, namely Paul
    Black, Vadim Okun, and Romain Gaucher.

  - "we" - me and other people who evaluated tool results in SATE,
    some of whom do not work at NIST.

  - "tool" - a code scanning tool or service that participated in SATE.

  - "test case" - one of the open source packages that was analyzed.

  - "bug report" - a single item as identified by a tool.

  - "raw data" - raw data generated from the tools, including bug
    reports and supporting documents.

  - "evaluation" - the HUMAN determination as to whether the tool's
    bug report is correct or not (e.g. true/false positive).


OK.  SATE was a big job and we did what we could with the resources we
had.  There were several major factors affecting the analysis:


- Number of bug reports.  We only manually reviewed about 10% of
  47,000 individual bug reports.  Yes, 47 thousand.  There's a lot of
  reasons why the number was that high: running tools in default mode,
  running early-generation tools like flawfinder, running tools
  against software that wasn't written with security in mind, etc.

- After the exposition was underway, we realized that we needed to
  more precisely define what a "true" and "false" positive really
  meant.  Consider a buffer overflow in a command line argument to an
  application that's only intended to be run by the administrator.
  Sometimes this was evaluated as a false positive, sometimes as a
  true positive.  It's a genuine bug - but how much do you care?  Then
  there's stuff that's true, but it's not necessarily a bug, and you
  don't necessarily care - consider the failure to use symbolic names
  in security-critical constants (CWE-547) or general code-quality
  assessments (not security, but important for many people).

  That's some of the easy stuff.  Eventually, we (the evaluator team)
  settled on a crude definition for true/false positives, but that was
  *after* we'd already been evaluating bug reports for a while.  Due
  to the scale of the effort, we didn't always go back to fix our
  results.  So, the evaluation data is inconsistent.

  Then there's the question of false negatives.  If tool X doesn't
  even TEST for a specific type of issue, then is it really a false
  negative?

- NIST designed a database/web interface for importing the results
  from various tools, and evaluating each bug report.  They used a
  simple exchange format that, in retrospect, was not expressive
  enough.  We did NOT run the tools live.  I think NIST did a solid
  job in developing the database and web interface, especially in such
  a short amount of time.  For example, you can look at a line of code
  and quickly find which other tools reported issues for surrounding
  lines of code.

  Since we used the database instead of the live tools, we did not
  have access to the analysis environment that the tools had.  Some of
  these tools have powerful capabilities (e.g. navigation and
  visualization) that help to interpret results.  So, this led to
  increased labor and a lot of ambiguous results (*you* try following
  a logic chain that's 20-deep.  Well OK, *you* can because you're
  special, but I can't.  Not all day anyway.)

  So, while the database was nice for browsing through multiple issues
  from multiple tools, and doing lots of basic data hacking, the lack
  of integration of live tools was really limiting.

- We had a fairly large number of issues that we thought were
  ambiguous, i.e. we couldn't tell if they were correct or not, even
  accounting for the lack of live analysis environment from the tools.
  Sometimes, this was because we needed to know more about the test
  case's operating context than we had.  For example, I remember one
  issue where I couldn't be sure if it was a true/false positive.  The
  issue was a false positive on every platform but Solaris.  But for
  Solaris, I had to know whether a certain field of a low-level OS
  data structure would ever exceed 47 bytes.  The Solaris include
  files that I looked at didn't exceed 47 bytes, but neither did I
  look at every relevant version (then there's SPARC vs. x86).

- We did not directly address potential sources of bias.  There was a
  general focus on "high-priority" issues (however THAT's defined -
  tools varied), but there were lots of differences how we did the
  human evaluation.  For example, during the course of my evaluation
  period (a few weeks), I concentrated mostly on the C programs, not
  Java; sometimes I concentrated on a single test case or one file,
  sometimes on a single type of vulnerability; sometimes just on a
  single tool (maybe I was trolling for new CWE entries, or just
  curious); and sometimes, I just casually browsed through the raw
  results and grabbed whatever I thought looked interesting.
  Generally, I stuck with one particular test case.

  Obviously, my individual approach was very informal.  Others on the
  team were more focused, but there were still different biases at
  play.  So, the evaluation data is NOT representative.

- We did not schedule enough time for a review period, so tool vendors
  didn't have enough time to get back to us on the results of our
  evaluations.  So the evaluation data - already imperfect for reasons
  I've discussed - has not been sufficiently validated.

- As implied by previous results, false positive rates cannot be
  determined because we weren't always correct in identifying them,
  and our analysis only covered 10% of the data.

- People are very interested in false negatives.  A multi-tool
  evaluation could help estimate these rates (just see which true
  positives weren't mentioned by other tools) - however, you can't do
  this with incorrect data!  Plus what I mentioned before - is it
  really a false negative if a tool doesn't look for that type of
  issue?

All that said:

- I'm glad to have participated in SATE.  People will criticize it,
  and/or misinterpret the results, but hopefully everyone will learn
  from it.

- I believe that one of the best outcomes of this exposition, which
  everybody will ignore, is the lessons learned with respect to the
  database, web interface, exchange format, and the improved awareness
  of how tools might generate different reports for the same thing.

- We did find some examples of false positives.  For my contribution
  to the December release, I plan to include specific code examples to
  help highlight some of these areas; if one tool missed it, then
  maybe others will, too.

- We did find differences between tool results.  In December, I'll
  provide more details on WHY I think some of those differences might
  exist.  In one example, tool X flagged the failure to check the
  result of a malloc() call; tool Y flagged the line of code that did
  the NULL pointer dereference.  They were reporting two links in the
  same chain, but it looks like the results are different.  Consider
  layering issues.  A software security auditor might flag the whole
  test case, saying "you don't have a centralized input validation
  mechanism."  That design-level feature could translate into numerous
  XSS or SQL injection errors by a code-auditing tool.  The issues are
  related, and kind of the same, but mostly not.
   
  There are other counting issues, too.  Consider a library function
  that, if called incorrectly, contains a buffer overflow.  One tool
  might flag the library function as one bug; a different tool might
  flag 20 distinct code paths, where each called that library function
  with potentially dangerous input.  Is it one bug or 20?  (Ask that
  question about strcpy() if you don't get my drift).  Obviously,
  these differences will seriously affect your results.

- If it was sometimes difficult for me to interpret the results (even
  allowing for the lack of access to live tools), then it will
  probably be a challenge for a lot of developers who don't do
  security all the time.  Maybe the code samples we analyzed were
  particularly complex, but I don't think so.  There will probably be
  significant disagreement with me on this sobering conclusion.

- The SAMATE people at NIST hope to perform a modified "exposition" in
  2009, and this will probably involve important design enhancements.


In closing - we didn't do SATE to compare tools; we wanted to explore
methods for understanding their capabilities.  The evaluation data
that was generated is not suitable for comparing tools because it's
inaccurate and incomplete.  But there was a lot of useful raw data,
the database design and exchange format were a good start, and we
learned a lot that will be covered more extensively in December.


- Steve
_______________________________________________
Dailydave mailing list
Dailydave () lists immunitysec com
http://lists.immunitysec.com/mailman/listinfo/dailydave


Current thread: