Dailydave mailing list archives
Re: The Static Analysis Market and You
From: "Steven M. Christey" <coley () mitre org>
Date: Thu, 16 Oct 2008 19:53:37 -0400 (EDT)
(apologies if this was posted twice) I supported the NIST effort, primarily as part of my CWE work at MITRE. Mostly, I was an evaluator of the tool results, but I've also assisted in some of the design and interpretation of results. I'll assume that people have read the SATE page and are somewhat familiar with how this project was run. First and foremost: SATE was an "exposition," not a scientific "experiment." We were trying to figure out the things that could be done to evaluate tools, NOT to actually evaluate tools. We've learned a lot, but the resulting human analysis is neither reliable nor repeatable. I'm not talking about the raw data that was generated by tools and dumped into a shareable format - the raw data generated from the tools is what it is. It's only our interpretations of the results, and what conclusions that should be drawn, that pose the greatest challenge - especially because you know damn well that someone somewhere is going to stack stats against each other and compare tools using data that we keep saying is not appropriate for that purpose). Quick definitions: - "NIST" - the SATE project leads working at NIST, namely Paul Black, Vadim Okun, and Romain Gaucher. - "we" - me and other people who evaluated tool results in SATE, some of whom do not work at NIST. - "tool" - a code scanning tool or service that participated in SATE. - "test case" - one of the open source packages that was analyzed. - "bug report" - a single item as identified by a tool. - "raw data" - raw data generated from the tools, including bug reports and supporting documents. - "evaluation" - the HUMAN determination as to whether the tool's bug report is correct or not (e.g. true/false positive). OK. SATE was a big job and we did what we could with the resources we had. There were several major factors affecting the analysis: - Number of bug reports. We only manually reviewed about 10% of 47,000 individual bug reports. Yes, 47 thousand. There's a lot of reasons why the number was that high: running tools in default mode, running early-generation tools like flawfinder, running tools against software that wasn't written with security in mind, etc. - After the exposition was underway, we realized that we needed to more precisely define what a "true" and "false" positive really meant. Consider a buffer overflow in a command line argument to an application that's only intended to be run by the administrator. Sometimes this was evaluated as a false positive, sometimes as a true positive. It's a genuine bug - but how much do you care? Then there's stuff that's true, but it's not necessarily a bug, and you don't necessarily care - consider the failure to use symbolic names in security-critical constants (CWE-547) or general code-quality assessments (not security, but important for many people). That's some of the easy stuff. Eventually, we (the evaluator team) settled on a crude definition for true/false positives, but that was *after* we'd already been evaluating bug reports for a while. Due to the scale of the effort, we didn't always go back to fix our results. So, the evaluation data is inconsistent. Then there's the question of false negatives. If tool X doesn't even TEST for a specific type of issue, then is it really a false negative? - NIST designed a database/web interface for importing the results from various tools, and evaluating each bug report. They used a simple exchange format that, in retrospect, was not expressive enough. We did NOT run the tools live. I think NIST did a solid job in developing the database and web interface, especially in such a short amount of time. For example, you can look at a line of code and quickly find which other tools reported issues for surrounding lines of code. Since we used the database instead of the live tools, we did not have access to the analysis environment that the tools had. Some of these tools have powerful capabilities (e.g. navigation and visualization) that help to interpret results. So, this led to increased labor and a lot of ambiguous results (*you* try following a logic chain that's 20-deep. Well OK, *you* can because you're special, but I can't. Not all day anyway.) So, while the database was nice for browsing through multiple issues from multiple tools, and doing lots of basic data hacking, the lack of integration of live tools was really limiting. - We had a fairly large number of issues that we thought were ambiguous, i.e. we couldn't tell if they were correct or not, even accounting for the lack of live analysis environment from the tools. Sometimes, this was because we needed to know more about the test case's operating context than we had. For example, I remember one issue where I couldn't be sure if it was a true/false positive. The issue was a false positive on every platform but Solaris. But for Solaris, I had to know whether a certain field of a low-level OS data structure would ever exceed 47 bytes. The Solaris include files that I looked at didn't exceed 47 bytes, but neither did I look at every relevant version (then there's SPARC vs. x86). - We did not directly address potential sources of bias. There was a general focus on "high-priority" issues (however THAT's defined - tools varied), but there were lots of differences how we did the human evaluation. For example, during the course of my evaluation period (a few weeks), I concentrated mostly on the C programs, not Java; sometimes I concentrated on a single test case or one file, sometimes on a single type of vulnerability; sometimes just on a single tool (maybe I was trolling for new CWE entries, or just curious); and sometimes, I just casually browsed through the raw results and grabbed whatever I thought looked interesting. Generally, I stuck with one particular test case. Obviously, my individual approach was very informal. Others on the team were more focused, but there were still different biases at play. So, the evaluation data is NOT representative. - We did not schedule enough time for a review period, so tool vendors didn't have enough time to get back to us on the results of our evaluations. So the evaluation data - already imperfect for reasons I've discussed - has not been sufficiently validated. - As implied by previous results, false positive rates cannot be determined because we weren't always correct in identifying them, and our analysis only covered 10% of the data. - People are very interested in false negatives. A multi-tool evaluation could help estimate these rates (just see which true positives weren't mentioned by other tools) - however, you can't do this with incorrect data! Plus what I mentioned before - is it really a false negative if a tool doesn't look for that type of issue? All that said: - I'm glad to have participated in SATE. People will criticize it, and/or misinterpret the results, but hopefully everyone will learn from it. - I believe that one of the best outcomes of this exposition, which everybody will ignore, is the lessons learned with respect to the database, web interface, exchange format, and the improved awareness of how tools might generate different reports for the same thing. - We did find some examples of false positives. For my contribution to the December release, I plan to include specific code examples to help highlight some of these areas; if one tool missed it, then maybe others will, too. - We did find differences between tool results. In December, I'll provide more details on WHY I think some of those differences might exist. In one example, tool X flagged the failure to check the result of a malloc() call; tool Y flagged the line of code that did the NULL pointer dereference. They were reporting two links in the same chain, but it looks like the results are different. Consider layering issues. A software security auditor might flag the whole test case, saying "you don't have a centralized input validation mechanism." That design-level feature could translate into numerous XSS or SQL injection errors by a code-auditing tool. The issues are related, and kind of the same, but mostly not. There are other counting issues, too. Consider a library function that, if called incorrectly, contains a buffer overflow. One tool might flag the library function as one bug; a different tool might flag 20 distinct code paths, where each called that library function with potentially dangerous input. Is it one bug or 20? (Ask that question about strcpy() if you don't get my drift). Obviously, these differences will seriously affect your results. - If it was sometimes difficult for me to interpret the results (even allowing for the lack of access to live tools), then it will probably be a challenge for a lot of developers who don't do security all the time. Maybe the code samples we analyzed were particularly complex, but I don't think so. There will probably be significant disagreement with me on this sobering conclusion. - The SAMATE people at NIST hope to perform a modified "exposition" in 2009, and this will probably involve important design enhancements. In closing - we didn't do SATE to compare tools; we wanted to explore methods for understanding their capabilities. The evaluation data that was generated is not suitable for comparing tools because it's inaccurate and incomplete. But there was a lot of useful raw data, the database design and exchange format were a good start, and we learned a lot that will be covered more extensively in December. - Steve _______________________________________________ Dailydave mailing list Dailydave () lists immunitysec com http://lists.immunitysec.com/mailman/listinfo/dailydave
Current thread:
- The Static Analysis Market and You Dave Aitel (Oct 14)
- Re: The Static Analysis Market and You Dave Korn (Oct 14)
- Re: The Static Analysis Market and You Andy Steingruebl (Oct 14)
- Re: The Static Analysis Market and You Steve Shockley (Oct 14)
- Re: The Static Analysis Market and You Dave Hull (Oct 15)
- <Possible follow-ups>
- Re: The Static Analysis Market and You Steven M. Christey (Oct 17)