Strwythura library/tutorial, based on a presentation about GraphRAG for GraphGeeks on 2024-08-14
How to construct a knowledge graph (KG) from unstructured data sources using state of the art (SOTA) models for named entity recognition (NER), then implement an enhanced GraphRAG approach, and curate semantics for optimizing AI app outcomes within a specific domain.
- videos: https://youtu.be/B6_NfvQL-BE, https://senzing.com/gph-graph-rag-llm-knowledge-graphs/
- slides: https://derwen.ai/s/2njz#1
Motivation for this tutorial comes from the stark fact that the term "GraphRAG" means many things, based on multiple conflicting definitions. Several popular implementations reveal a relatively cursory understanding about either natural language processing (NLP) or graph algorithms, plus a vendor bias toward their own query language.
See this article for more details and history: "Unbundling the Graph in GraphRAG".
Instead of delegating KG construction to a large language model
(LLM), this tutorial shows the use of sophisticated NLP pipelines
based on spaCy
, GLiNER
, TextRank, and related libraries.
Results are better/faster/cheaper, plus this provides more control
and oversight for intentional arrangement of the KG. Then for
downstream usage in a question/answer chat bot, an enhanced GraphRAG
approach leverages graph algorithms (e.g., semantic random walk)
to optimize retrieval of text chunks which ultimately get presented
to an LLM for summarization to produce responses.
For more detailed discussions, see:
- enhanced GraphRAG: "GraphRAG to enhance LLM-based apps"
- ontology pipeline: "Intentional Arrangement" by Jessica Talisman
- Ontology Engineering by Elisa Kendall, Deborah McGuiness, Ying Ding
spaCy
: https://spacy.io/GLiNER
: https://huggingface.co/urchade/gliner_base- TextRank: https://www.derwen.ai/docs/ptr/explain_algo/
Some key issues regarding KG construction with LLMs which don't get addressed much by the graph community and AI community in general:
- LLMs tend to mangle cross-domain semantics when used for building graphs; see Mai2024 referenced in the "GraphRAG to enhance LLM-based apps" talk above.
- You need to introduce a semantic layer for representing the domain context, which follows more of a neurosymbolic AI approach.
- Most all LLMs perform question rewriting in ways which cannot be disabled, even when the
temperature
parameter is set to zero; this leads to relative degrees of "hallucinated questions" for which there are no clear workarounds. - Any model used for prediction introduces reasoning based on generalization, even more so when the model uses a loss function for training; this tends to be the point where KG structure and semantics turn into crap; see the "Let's talk about ..." articles linked below.
- The approach outlined here is faster and less expensive, and produces better results than if you'd delegated KG construction to an LLM.
Of course, YMMV.
This approach leverages neurosymbolic AI methods, combining practices from:
- natural language processing
- graph data science
- entity resolution
- ontology pipeline
- context engineering
- human-in-the-loop
Overall, this illustrates a reference implementation for entity-resolved retrieval-augmented generation (ER-RAG).
This runs with Python 3.11, though the range of versions may be extended soon.
To pip
install from PyPi:
python3 -m pip install strwathura
python3 -m spacy download en_core_web_md
Then to integrate this library within an application:
- Copy settings in
config.toml
into a custom configuration file. - Subclass
DomainContext
to extend it for the use case. - Define semantics in
domain.ttl
for the domain context. - Run entity resolution to merge the structured datasets.
- Run
Ollama
and have already downloaded the Gemma3 LLM as described below. - Instantiate new
DomainContext
,Strwythura
,VisHTML
, andGraphRAG
objects or their subclassed extensions. - ...
- Profit
Follow the patterns in the build.py
and errag.py
example scripts.
Feel free to swap in different spaCy
models, different LLMs, etc.
If you're working with documents in a language other than English, well that's absolutely fantastic, though you will need to:
- Update model settings in the
config.toml
file. - Change the
spaCy
model downloaded here. - Also change the language tags used in
domain.ttl
as needed.
This library uses poetry
for
package management and you need to install it to get things running:
poetry update
poetry run python3 -m spacy download en_core_web_md
Run entity resolution (ER) to produce entities and relations from structured data sources, which tend to be more reliable than those extracted from unstructured content.
What does this ER step buy us? ER allows us to merge multiple structured data sets, even without consistent foreign keys being available, producing an overlay of entities and relations among them. This is quite useful as a "backbone" for constructing a KG. Morever when there are judgements being made from the KG about people or organizations, ER provides accountability for the merge decisions.
This approach becomes especially important in public sector, healthcare, banking, insurance -- i.e., in use cases where you might need to "send flowers" when automated judgements about identity go wrong. For example, someone gets denied a loan, has a medical insurance claim blocked, gets a tax audit, has their voter registration voided, becomes the subject of an arrest warrant, and so on.
In other words, people and organizations tend to take legal actions when someone else causes them harm by mangling identity managment. You'll want an audit trail of decisions based on evidence, whenever your software systems make these kinds of judgements.
For the domain context in this tutorial, say we have two hypothetical datasets which provide business directory listings:
sz_er/acme_biz.json
-- "ACME Business Directory"sz_er/corp_home.json
-- "Corporates Home UK"
Plus we have slices from datasets which provide listings about researchers and scientific authors:
The JSONL format of these datasests is based on data mapping, i.e., providing the entity resolution process with heuristics about features available the the structured dataset.
These four datasets get merged using ER, where the results produce a domain-specific thesaurus. This thesaurus generates instances of graph elements: entities, relations, properties. We'll blend this into our semantic layer used for organizing the KG later.
The following steps are optional, since these ER results have already
been pre-computed and provided in the sz_er/export.json
file.
If you want to run Senzing
to produce these ER results, use the following steps.
Senzing SDK runs in Python or Java, and can also be run as batch using a container from DockerHub:
docker pull senzing/demo-senzing
Once this container is available, run:
docker run -it --rm --volume ./sz_er:/tmp/data senzing/demo-senzing
This brings up a Linux command line prompt I have no name!
and the
local subdirectory sz_er
will be mapped to the /tmp/data
directory
Type the following commands for batch ER into the command line prompt.
Set up the Senzing configuration for merging these datasets:
G2ConfigTool.py
Within the configuration tool, register the names of the data sources being used:
addDataSource ACME_BIZ
addDataSource CORP_HOME
addDataSource ORCID
addDataSource SCOPUS
save
exit
Load each file and run ER on its data records:
G2Loader.py -f /tmp/data/acme_biz.json
G2Loader.py -f /tmp/data/corp_home.json
G2Loader.py -f /tmp/data/orcid.json
G2Loader.py -f /tmp/data/scopus.json
Export the ER results to the sz_er/export.json
file, then exit the
container:
G2Export.py -F JSON -o /tmp/data/export.json
exit
This later gets parsed to produce the data/thesaurus.ttl
file
(as RDF in "Turtle" format) during the next part of the demo to
augment the semantic layer.
Given as input:
domain.ttl
-- semantics for the domain contextsz_er/export.json
-- a domain-specific thesaurus based on ER- a list of structured datasets used in ER
- a list of URLs from which to scrape content
The domain.ttl
file provides a basis for iterating with an ontology
pipeline process, to represent the semantics for the given domain.
It specifies metadata in terms of vocabulary, taxonomy, and
thesaurus -- to use in representing the core entities and relations
in the KG.
The curate.py
script described below then will introduce the
human-in-the-loop part of this process, where you can review
entities extracted from documents. Based on this analysis, decide
where to refine the domain context to be able to extract,
classify, and connect more of what gets extracted from
unstructured data sources and linked into the KG. Overall, this
process distills elements of the lexical graph, linking them with
elements from the data graph, to produce a more abstracted (i.e.,
less noisy) semantic layer as the resulting KG.
Meanwhle, let's get started. The build.py
script scrapes text
sources and constructs a knowledge graph plus entity embeddings,
with nodes linked to chunks in a vector store:
poetry run python3 build.py
Demo data used in this case includes articles about the linkage between eating processed red meat frequently and the risks of dementia later in life, based on long-term studies.
The approach in this tutorial iterates through multiple steps to produce the assets needed for GraphRAG downstream:
- Scrape each URL using
requests
andBeautifulSoup
- Split the text into chunks
- Build vector embeddings for each chunk, in
LanceDB
- Parse each text chunk using
spaCy
, iterating per sentence - Extract entities from each sentence using
GLiNER
- Build a lexical graph from the parse trees in
NetworkX
- Run a textrank algorithm to rank important entities
- Build an embedding model for entities using
gensim.Word2Vec
- Generate an interactive visualization using
PyVis
Note: processing may take a few extra minutes the first time it runs
since PyTorch
must download a large (~2GB) file.
The assets get serialized into these files:
data/lancedb
-- vector database tables inLanceDB
data/kg.json
-- serialization ofNetworkX
graphdata/sem.csv
-- entity semantics fromcurate.py
data/entity.w2v
-- entity embeddings inGensim
data/url_cache.sqlite
-- URL cache inSQLite
kg.html
-- interactive graph visualization inPyVis
A good downstream use case for exploring a newly constructed KG is GraphRAG, used for grounding the responses by an LLM in a question/answer chat.
This implementation uses DSPy
https://dspy.ai/ and leverages
the KG for enhanced GraphRAG by using semantic expansion and
semantic random walks.
With a bit of imagination, an interation on this approach could
leverage DSPy
plus its sister project MLflow
https://mlflow.org/
to develop much more sophisticated agentic workflows downstream.
To set up, download/install Ollama
https://ollama.com/ and pull
the gemma3:12b
model https://huggingface.co/google/gemma-3-12b-it
ollama pull gemma3:12b
Then run the errag.py
script for an interactive GraphRAG example:
poetry run python3 errag.py
This approach uses a semantic layer -- in other words, a "backbone" for the KG -- to organize the entities and relations which get abstracted from the lexical graph.
For now, run the curate.py
script to generate a view of the ranked
NER results, serialized as the data/sem.csv
file. This can be
viewed in a spreadsheet to understand how to iterate on the semantic
definitions for more effective graph organization in the domain of the
scraped documents.
poetry run python3 curate.py
Objective:
Construct a knowledge graph (KG) using open source libraries where deep learning models provide narrowly-focused point solutions to generate components for a graph: nodes, edges, properties.
These steps define a generalized process, where this tutorial picks up at the lexical graph, without the entity linking (EL) part yet:
Semantic layer:
- Load any semantics for domain context from pre-defined controlled vocabularies, taxonomies, thesauri, ontologies, etc., directly into the KG.
Data graph:
- Load the structured data sources or updates into a data graph.
- Perform entity resolution (ER) on PII extracted from the data graph.
- Blend the ER results into the semantic layer as a "backbone" for structuring the KG.
Lexical graph:
- Parse the text chunks, using lemmatization to normalize token spans.
- Construct a lexical graph from parse trees, e.g., using a textgraph algorithm.
- Analyze named entity recognition (NER) to extract candidate entities from noun phrase spans.
- Analyze relation extraction (RE) to extract relations between pairwise entities.
- Perform entity linking (EL) leveraging the ER results.
- Promote the extracted entities and relations up to the semantic layer.
Of course many vendors suggest using a large language model (LLM) as a one-size-fits-all (OSFA) "black box" approach for extracting entities and generating an entire graph automagically.
However, the business process of resolution -- for both entities and relations -- requires judgements. If the entities getting resolved are low-risk, low-effort in nature, then yeah knock yourself out. If the entities represent people or organizations, these have agency and may take actions when misrepresented in applications which have consequences.
Whenever judgements get delegated to model-based approaches, generalization becomes a form of reasoning employed. When the technology within the model is based on loss functions, then generalization becomes dominant -- regardless of any marketing claims about "AI reasoning" made by tech firms.
Fortunately, decisions can be made without models, even in AI applications. Shock, horror!!! Please, say it isn't so!?! Brace yourselves, using models is a thing, but not the only thing. For more detailed discussion, see:
- Part 1: Let's talk about "Today's AI" https://www.linkedin.com/pulse/lets-talk-todays-ai-paco-nathan-co60c/
- Part 2: Let's talk about "Resolution" https://www.linkedin.com/pulse/lets-talk-resolution-paco-nathan-ryjhc/
Also keep in mind that black box approaches don't work especially well for regulated environments, where audits, explanations, evidence, data provenance, etc., are required.
Moreover, KGs used in mission-critical apps, such as investigations, generally require periodic data updates, so construction isn't a one-step process. By producing a KG based on the approach sketched above, updates can be handled more effectively. Any downstream use cases, such as AI applications, also benefit from improved quality of semantics and representation.
- Q:
- "Have you tried this with
langextract
yet?" - A:
- "I'll take
How does an instructor know a student ignored the README?
from the What is FAFO? category, for $200 ... but yes of course, it's an interesting package, which builds on other interesting work used here. Except that key parts of it miss the point entirely, in ways that only a hyperscaler could possibly fuck up so badly."
- Q:
- "What the hell is the name of this repo about?"
- A:
- "As you may have noticed, many open source projects published in this GitHub organization are named in a beautiful language Gymraeg, which English speakers call 'Welsh', where this word
strwythura
translates as the verb structure in English."
- Q:
- "Why aren't you using an LLM to build the graph instead?"
- A:
- "I promise to visit you in jail."
- Q:
- "Um, yeah, like, didn't Karpathy say to use vibe coding, or something? #justsayin"
- A:
- "Piss the eff off tech bro. Srsly, like yesterday -- you're embarrassing our entire industry with your overly exuberant ignorance."
Experimental: Relation Extraction evaluation
Current Python libraries for relation extraction (RE) are probably best characterized as "experimental research projects".
Their tokenization approaches tend to make the mistake of "throwing the baby out with the bath water" by not leveraging other available information, e.g., what we have in the textgraph representation of the parsed documents. Also, they tend to ignore the semantic constraints of the domain context, while computationally boiling the ocean.
RE libraries which have been evaluated:
GLiREL
: https://github.com/jackboyla/GLiRELReLIK
: https://github.com/SapienzaNLP/relikOpenNRE
: https://github.com/thunlp/OpenNREmREBEL
: https://github.com/Babelscape/rebel
This project had used GLiREL
although its results were quite sparse.
The relation extraction will be replaced by DSPy
workflows in the
near future.
There is some experimental code which illustrates OpenNRE
evaluation.
Use the archive/nre.sh
script to load OpenNRE pre-trained models
before running the archive/opennre.ipynb
notebook.
This may not work in many environments, depending on how well the
OpenNRE
library is being maintained.
Experimental: Tutorial notebooks
A collection of Jupyter notebooks were used to prototype code. These help illustrate important intermediate steps within these workflows:
.venv/bin/jupyter-lab
- `archive/construct.ipynb` -- detailed KG construction using a lexical graph
- `archive/chunk.ipynb` -- simple example of how to scrape and chunk text
- `archive/vector.ipynb` -- query LanceDB table for text chunk embeddings (after running `build.py`)
- `archive/embed.ipynb` -- query the entity embedding model (after running `build.py`)
These are now archived, though kept available for study.
License and Copyright
Source code for Strwythura plus its logo, documentation, and examples have an MIT license which is succinct and simplifies use in commercial applications.
All materials herein are Copyright © 2024-2025 Senzing, Inc.
Kudos and Attribution
Please use the following BibTeX entry for citing Strwythura if you use it in your research or software. Citations are helpful for the continued development and maintenance of this library.
@software{strwythura,
author = {Paco Nathan},
title = {{Strwythura: construct a knowledge graph from unstructured data sources, organized by results from entity resolution, implementing an enhanced GraphRAG approach, and also implementing an ontology pipeline plus context engineering for optimizing AI application outcomes within a specific domain}},
year = 2024,
publisher = {Senzing},
doi = {10.5281/zenodo.16934079},
url = {https://github.com/DerwenAI/strwythura}
}
Kudos to @louisguitton, @cj2001, @prrao87, @hellovai, @docktermj, @jbutcher21, @brianmacy, and the kind folks at GraphGeeks for their support.