Dailydave mailing list archives

Knowledge Graph + AI = ?


From: Dave Aitel via Dailydave <dailydave () lists aitelfoundation org>
Date: Mon, 8 May 2023 12:00:16 -0400

*Logical Operations on Knowledge Graphs:*

So if you've spent enough time looking at graph databases, you will
invariably run into people who are obsessed with "ontologies". The basic
theory being that if you've already organized your data in some sort of
directed graph, maybe you can then apply a logical ruleset to those
relationships and infer knowledge about things you didn't already know,
which could be useful? There are a lot of people trying to do this with
SBOMs and I...wish them the best.

In real life I think it's basically impossible to create a large, useful,
graph database+inference engine that has data clean enough for anything
useful. Also the word *ontology* is itself very annoying.

But philosophically, while any complex enough data set will have to embrace
paradoxes, you can get a lot of value out of putting some higher level
structure based on the text in your data.

And this is where modern AI comes in - in specific, the tree of "Natural
Language Understanding" that broke off from Chomsky-esque-and-wrong
"Natural Language Processing" some time ago.

One article covering this topic is here
<https://medium.com/@anthony.mensier/gpt-4-for-defense-specific-named-entity-extraction-47895b7fed6d>,
which combines entity extraction and classification in order to look into
finding military topics in an article.

But these techniques can be abstracted and broadened as a general purpose
and very useful algorithm: Essentially you want to extract keywords from
text fields within your graph data, then relate those keywords to each
other, which gives you conceptual groupings and allows you to make further
queries that uncover insights about those groups.

*Our Solution:*

One of the team-members over at Margin Research working on SocialCyber
<https://www.darpa.mil/program/hybrid-ai-to-protect-integrity-of-open-source-code>
with me, Matt Filbert, came up with the idea of using OpenAI's GPT to get
hashtags from text, which it does very very well. If you store these as
nodes you get something like the picture below (note that hashtags are
BASED on the text, but they are NOT keywords and may not be in the text
itself):
[image: image.png]

Next you want to figure out how topics are related to each other! Which you
can do in a thousand different ways - the code words to search on are "Node
Similarity" - but almost all those ways will either not work or create bad
results because you have a very limited directional graph of just
"Things->Topics".

In the end we used a modified Jaccardian algo (aka, you are similar if you
have shared parents), which I call Daccardian because it creates weighted
directed graphs (which comes in handy later):
[image: image.png]

So once you've done that, you get this kind of directed graph:

[image: image.png]

From here you could build communities of related topics using any community
detection algorithm, but even just being able to query against them is
extremely useful. In theory you could query just against one topic at a
time, but because of the way your data is probably structured, you want
both that topic, and any closely related topics to be included.

So for example, looking for Repos that have topics that either are "#UI" or
closely related, can be queried like this (not all topics are shown because
of the LIMIT clause):
[image: image.png]




*Some notes on AI Models:*

OpenAI's APIs are a bit slow, and often throw errors randomly, which is fun
to handle. And of course, when doing massive amounts of inference, it's
probably cheaper to run your own equipment, which will leave you casting
about on Huggingface like a glow worm in a dark cave for a model that can
do this and is open source. I've tried basically all of them and they've
all started out promising but then been a lot worse than ChatGPT 3.5.

You do want one that is multilingual, and Bard might be an option when they
open their API access up. There's a significant difference between the
results from the big models and the little ones, in contrast to the paper
that just "leaked" from Google
<https://www.semianalysis.com/p/google-we-have-no-moat-and-neither> about
how small tuned models are going to be just as good as bigger models (which
they are not!).

One minor exception is the new Mosaic model (
https://huggingface.co/spaces/mosaicml/mpt-7b-instruct) which  is
multilingual and four times cheaper than OpenAI but it's also about 1/4th
as good. It may be the first "semi-workable" open model though, which is a
promising sign and it may be worth moving to this if you have data you
can't run through an open API for some reason.

*Conclusion:*

If you have a big pile of structured data, you almost certainly have
natural language data that you want to capture as PART of that structure,
but this would have been literally impossible six months ago before LLMs.
Using this AI-tagging technique and doing some basic graph algorithms can
really open up the power of your queries to take the most valuable part of
your information into account, while not losing the performance and
scalability of having it in a database in the first place.

Thanks for reading,
Dave Aitel
PS: I have a dream, and that dream is to convert Ghidra into a Graph
Database as a native format so we can use some of these techniques (and
code embeddings) as a native feature. If you sit next to the Ghirda team,
and you read this whole post, give them a poke for me. :)

_______________________________________________
Dailydave mailing list -- dailydave () lists aitelfoundation org
To unsubscribe send an email to dailydave-leave () lists aitelfoundation org

Current thread: