Dailydave mailing list archives

DARPA AI Challenge!

From: Dave Aitel via Dailydave <dailydave () lists aitelfoundation org>
Date: Tue, 29 Aug 2023 11:36:17 -0400

I've been working with LLMs for a bit, and also looking at the DARPA Cyber
AI Challenge <https://www.darpa.mil/news-events/2023-08-09>. And to that
end I put together CORVIDCALL which uses various LLMs to essentially 100%
find-and-patch any bug example I can throw at it from the various GitHub
repos that store these things (see below).

[image: image.png]

So I learned a lot of things doing this, and one article that came out
recently (https://www.semianalysis.com/p/google-gemini-eats-the-world-gemini)
talked about the future of LLMs, and if you're doing this challenge you
really are building for future LLMs and not the ones available right now.

One thing they pointed out in that article (which I highly recommend
reading) is that huggingface is basically doing a disservice with their
leaderboard - but the truth is more complicated. It's nice to know which
models do better than other models,  but the comparison between them is not
a simple number any more than the comparison between people is a simple
number. There's no useful IQ score for models or for people.

For example, one of the hardest things to measure is how well a model can
handle interleaved and recursive problems. If you have an SQL Query inside
your Python code being sent to a server, does it notice errors in that code
or do they fly under the radar as "just a string".

Can the LLM handle optimization problems, indicating it understands
performance implications of a system?

Can the LLM handle LARGER problems. People are obsessed with context window
sizes but what you find is a huge degradation of accuracy in following
instructions when you hit even 1/8th the context window size for any of the
leading models. This means you have to know how to compress up your tasks
to fit basically into a teacup. And for smaller models, this degradation is
even more severe.

People in the graph database world are obsessed with getting "Knowledge
graphs" out of unstructured data + a graph database. I think "Knowledge
graphs" are pretty useless, but what is not useless is connecting
unstructured data by topic in your graph database, and using that to make
larger community detection-based decisions. And the easiest way to do this
is to pass your data into an LLM and ask it to generate the topics for you,
typically in the form of a Twitter hashtag. Code is unstructured data.

If you want to measure your LLM you can do some fun things. Asking a good
LLM for 5 twitter hashtags in comma separated value format will work MOST
of the time. But the smaller and worse the LLM, the more likely it is to go
off the rails and fail to do it when faced with larger data, or more
complicated data, or data in a different language which it first has to
translate. To be fair, most of them will fail to do the right number of
hashtags. You can try this yourself on various models which otherwise are
at the top of a leaderboard, within "striking distance" on the benchmarks
against  Bard, Claude, or GPT-4. (#theyarenowhereclose, #lol)

Obviously the more neurons you have making sure you don't say naughty
things, the worse you are at doing anything useful, and you can see that in
the difference between StableBeluga and LLAMA2-chat, for example, with
these simple manual evaluations.

And this matters a lot when you need your LLM to output structured data
<https://twitter.com/RLanceMartin/status/1696231512029777995?s=20> based on
your input.

So we can divide up the problem of automating finding and patching bugs in
source code in a lot of ways, but one way is to notice the process real
auditors take, and just replicate this by passing in data flow diagrams and
other various summaries into the models. Right now hundreds of academics
are "inventing" new ways to use LLMs. For example "Reason and Act
<https://blog.research.google/2022/11/react-synergizing-reasoning-and-acting.html>".
I've never seen so much hilarity as people put obvious computing patterns
into papers and try to invent some terminology to hang their career on.

And of course when it comes to a real codebase, say, libjpeg, or a real web
app, following the data through a system is important. Understanding code
flaws is important. But also building test triggers and doing debugging is
important to test your assumptions. And coalescing this information in, for
example, the big graph database that is your head is how you make it all
pay off.

But what you want with bug finding is not to mechanistically re-invent
source-sink static analysis with LLMs. You want intuition. You want flashes
of insight.

It's a hard and fun problem at the bigger end of the scale. We may have to
give our bug finding systems the machine equivalent of serotonin. :)

[image: image.png]
-dave

_______________________________________________
Dailydave mailing list -- dailydave () lists aitelfoundation org
To unsubscribe send an email to dailydave-leave () lists aitelfoundation org

Current thread:

DARPA AI Challenge! Dave Aitel via Dailydave (Aug 29)