Skip to content

Construct knowledge graphs from unstructured data sources, use graph algorithms for enhanced GraphRAG with a DSPy-based chat bot locally, and curate semantics for optimizing AI app outcomes within a specific domain.

License

Notifications You must be signed in to change notification settings

DerwenAI/strwythura

Repository files navigation

Strwythura

DOI Licence Repo size Checked with mypy GitHub commit activity

Strwythura library/tutorial, based on a presentation about GraphRAG for GraphGeeks on 2024-08-14

Overview

How to construct a knowledge graph (KG) from unstructured data sources using state of the art (SOTA) models for named entity recognition (NER), then implement an enhanced GraphRAG approach, and curate semantics for optimizing AI app outcomes within a specific domain.

Motivation for this tutorial comes from the stark fact that the term "GraphRAG" means many things, based on multiple conflicting definitions. Several popular implementations reveal a relatively cursory understanding about either natural language processing (NLP) or graph algorithms, plus a vendor bias toward their own query language.

See this article for more details and history: "Unbundling the Graph in GraphRAG".

Instead of delegating KG construction to a large language model (LLM), this tutorial shows the use of sophisticated NLP pipelines based on spaCy, GLiNER, TextRank, and related libraries. Results are better/faster/cheaper, plus this provides more control and oversight for intentional arrangement of the KG. Then for downstream usage in a question/answer chat bot, an enhanced GraphRAG approach leverages graph algorithms (e.g., semantic random walk) to optimize retrieval of text chunks which ultimately get presented to an LLM for summarization to produce responses.

For more detailed discussions, see:

Some key issues regarding KG construction with LLMs which don't get addressed much by the graph community and AI community in general:

  1. LLMs tend to mangle cross-domain semantics when used for building graphs; see Mai2024 referenced in the "GraphRAG to enhance LLM-based apps" talk above.
  2. You need to introduce a semantic layer for representing the domain context, which follows more of a neurosymbolic AI approach.
  3. Most all LLMs perform question rewriting in ways which cannot be disabled, even when the temperature parameter is set to zero; this leads to relative degrees of "hallucinated questions" for which there are no clear workarounds.
  4. Any model used for prediction introduces reasoning based on generalization, even more so when the model uses a loss function for training; this tends to be the point where KG structure and semantics turn into crap; see the "Let's talk about ..." articles linked below.
  5. The approach outlined here is faster and less expensive, and produces better results than if you'd delegated KG construction to an LLM.

Of course, YMMV.

This approach leverages neurosymbolic AI methods, combining practices from:

  • natural language processing
  • graph data science
  • entity resolution
  • ontology pipeline
  • context engineering
  • human-in-the-loop

Overall, this illustrates a reference implementation for entity-resolved retrieval-augmented generation (ER-RAG).



Usage in applications

This runs with Python 3.11, though the range of versions may be extended soon.

To pip install from PyPi:

python3 -m pip install strwathura
python3 -m spacy download en_core_web_md

Then to integrate this library within an application:

  1. Copy settings in config.toml into a custom configuration file.
  2. Subclass DomainContext to extend it for the use case.
  3. Define semantics in domain.ttl for the domain context.
  4. Run entity resolution to merge the structured datasets.
  5. Run Ollama and have already downloaded the Gemma3 LLM as described below.
  6. Instantiate new DomainContext, Strwythura, VisHTML, and GraphRAG objects or their subclassed extensions.
  7. ...
  8. Profit

Follow the patterns in the build.py and errag.py example scripts.

Feel free to swap in different spaCy models, different LLMs, etc.

If you're working with documents in a language other than English, well that's absolutely fantastic, though you will need to:

  • Update model settings in the config.toml file.
  • Change the spaCy model downloaded here.
  • Also change the language tags used in domain.ttl as needed.

Set up for demo or development

This library uses poetry for package management and you need to install it to get things running:

poetry update
poetry run python3 -m spacy download en_core_web_md

Demo Part 1: Entity Resolution

Run entity resolution (ER) to produce entities and relations from structured data sources, which tend to be more reliable than those extracted from unstructured content.

What does this ER step buy us? ER allows us to merge multiple structured data sets, even without consistent foreign keys being available, producing an overlay of entities and relations among them. This is quite useful as a "backbone" for constructing a KG. Morever when there are judgements being made from the KG about people or organizations, ER provides accountability for the merge decisions.

This approach becomes especially important in public sector, healthcare, banking, insurance -- i.e., in use cases where you might need to "send flowers" when automated judgements about identity go wrong. For example, someone gets denied a loan, has a medical insurance claim blocked, gets a tax audit, has their voter registration voided, becomes the subject of an arrest warrant, and so on.

In other words, people and organizations tend to take legal actions when someone else causes them harm by mangling identity managment. You'll want an audit trail of decisions based on evidence, whenever your software systems make these kinds of judgements.

For the domain context in this tutorial, say we have two hypothetical datasets which provide business directory listings:

  • sz_er/acme_biz.json -- "ACME Business Directory"
  • sz_er/corp_home.json -- "Corporates Home UK"

Plus we have slices from datasets which provide listings about researchers and scientific authors:

  • sz_er/orcid.json -- ORCID
  • sz_er/scopus.json -- Scopus

The JSONL format of these datasests is based on data mapping, i.e., providing the entity resolution process with heuristics about features available the the structured dataset.

These four datasets get merged using ER, where the results produce a domain-specific thesaurus. This thesaurus generates instances of graph elements: entities, relations, properties. We'll blend this into our semantic layer used for organizing the KG later.

The following steps are optional, since these ER results have already been pre-computed and provided in the sz_er/export.json file. If you want to run Senzing to produce these ER results, use the following steps.

Senzing SDK runs in Python or Java, and can also be run as batch using a container from DockerHub:

docker pull senzing/demo-senzing

Once this container is available, run:

docker run -it --rm --volume ./sz_er:/tmp/data senzing/demo-senzing

This brings up a Linux command line prompt I have no name! and the local subdirectory sz_er will be mapped to the /tmp/data directory Type the following commands for batch ER into the command line prompt.

Set up the Senzing configuration for merging these datasets:

G2ConfigTool.py

Within the configuration tool, register the names of the data sources being used:

addDataSource ACME_BIZ
addDataSource CORP_HOME
addDataSource ORCID
addDataSource SCOPUS
save
exit

Load each file and run ER on its data records:

G2Loader.py -f /tmp/data/acme_biz.json
G2Loader.py -f /tmp/data/corp_home.json
G2Loader.py -f /tmp/data/orcid.json
G2Loader.py -f /tmp/data/scopus.json

Export the ER results to the sz_er/export.json file, then exit the container:

G2Export.py -F JSON -o /tmp/data/export.json
exit

This later gets parsed to produce the data/thesaurus.ttl file (as RDF in "Turtle" format) during the next part of the demo to augment the semantic layer.

Demo Part 2: Build Assets

Given as input:

  • domain.ttl -- semantics for the domain context
  • sz_er/export.json -- a domain-specific thesaurus based on ER
  • a list of structured datasets used in ER
  • a list of URLs from which to scrape content

The domain.ttl file provides a basis for iterating with an ontology pipeline process, to represent the semantics for the given domain. It specifies metadata in terms of vocabulary, taxonomy, and thesaurus -- to use in representing the core entities and relations in the KG.

The curate.py script described below then will introduce the human-in-the-loop part of this process, where you can review entities extracted from documents. Based on this analysis, decide where to refine the domain context to be able to extract, classify, and connect more of what gets extracted from unstructured data sources and linked into the KG. Overall, this process distills elements of the lexical graph, linking them with elements from the data graph, to produce a more abstracted (i.e., less noisy) semantic layer as the resulting KG.

Meanwhle, let's get started. The build.py script scrapes text sources and constructs a knowledge graph plus entity embeddings, with nodes linked to chunks in a vector store:

poetry run python3 build.py

Demo data used in this case includes articles about the linkage between eating processed red meat frequently and the risks of dementia later in life, based on long-term studies.

The approach in this tutorial iterates through multiple steps to produce the assets needed for GraphRAG downstream:

  1. Scrape each URL using requests and BeautifulSoup
  2. Split the text into chunks
  3. Build vector embeddings for each chunk, in LanceDB
  4. Parse each text chunk using spaCy, iterating per sentence
  5. Extract entities from each sentence using GLiNER
  6. Build a lexical graph from the parse trees in NetworkX
  7. Run a textrank algorithm to rank important entities
  8. Build an embedding model for entities using gensim.Word2Vec
  9. Generate an interactive visualization using PyVis

Note: processing may take a few extra minutes the first time it runs since PyTorch must download a large (~2GB) file.

The assets get serialized into these files:

  • data/lancedb -- vector database tables in LanceDB
  • data/kg.json -- serialization of NetworkX graph
  • data/sem.csv -- entity semantics from curate.py
  • data/entity.w2v -- entity embeddings in Gensim
  • data/url_cache.sqlite -- URL cache in SQLite
  • kg.html -- interactive graph visualization in PyVis

Demo Part 3: Enhanced GraphRAG chat bot

A good downstream use case for exploring a newly constructed KG is GraphRAG, used for grounding the responses by an LLM in a question/answer chat.

This implementation uses DSPy https://dspy.ai/ and leverages the KG for enhanced GraphRAG by using semantic expansion and semantic random walks.

With a bit of imagination, an interation on this approach could leverage DSPy plus its sister project MLflow https://mlflow.org/ to develop much more sophisticated agentic workflows downstream.

To set up, download/install Ollama https://ollama.com/ and pull the gemma3:12b model https://huggingface.co/google/gemma-3-12b-it

ollama pull gemma3:12b

Then run the errag.py script for an interactive GraphRAG example:

poetry run python3 errag.py

Demo Part 4: Curating an Ontology Pipeline

This approach uses a semantic layer -- in other words, a "backbone" for the KG -- to organize the entities and relations which get abstracted from the lexical graph.

For now, run the curate.py script to generate a view of the ranked NER results, serialized as the data/sem.csv file. This can be viewed in a spreadsheet to understand how to iterate on the semantic definitions for more effective graph organization in the domain of the scraped documents.

poetry run python3 curate.py


Unbundling GraphRAG

Objective:

Construct a knowledge graph (KG) using open source libraries where deep learning models provide narrowly-focused point solutions to generate components for a graph: nodes, edges, properties.

These steps define a generalized process, where this tutorial picks up at the lexical graph, without the entity linking (EL) part yet:

Semantic layer:

  1. Load any semantics for domain context from pre-defined controlled vocabularies, taxonomies, thesauri, ontologies, etc., directly into the KG.

Data graph:

  1. Load the structured data sources or updates into a data graph.
  2. Perform entity resolution (ER) on PII extracted from the data graph.
  3. Blend the ER results into the semantic layer as a "backbone" for structuring the KG.

Lexical graph:

  1. Parse the text chunks, using lemmatization to normalize token spans.
  2. Construct a lexical graph from parse trees, e.g., using a textgraph algorithm.
  3. Analyze named entity recognition (NER) to extract candidate entities from noun phrase spans.
  4. Analyze relation extraction (RE) to extract relations between pairwise entities.
  5. Perform entity linking (EL) leveraging the ER results.
  6. Promote the extracted entities and relations up to the semantic layer.

Of course many vendors suggest using a large language model (LLM) as a one-size-fits-all (OSFA) "black box" approach for extracting entities and generating an entire graph automagically.

However, the business process of resolution -- for both entities and relations -- requires judgements. If the entities getting resolved are low-risk, low-effort in nature, then yeah knock yourself out. If the entities represent people or organizations, these have agency and may take actions when misrepresented in applications which have consequences.

Whenever judgements get delegated to model-based approaches, generalization becomes a form of reasoning employed. When the technology within the model is based on loss functions, then generalization becomes dominant -- regardless of any marketing claims about "AI reasoning" made by tech firms.

Fortunately, decisions can be made without models, even in AI applications. Shock, horror!!! Please, say it isn't so!?! Brace yourselves, using models is a thing, but not the only thing. For more detailed discussion, see:

Also keep in mind that black box approaches don't work especially well for regulated environments, where audits, explanations, evidence, data provenance, etc., are required.

Moreover, KGs used in mission-critical apps, such as investigations, generally require periodic data updates, so construction isn't a one-step process. By producing a KG based on the approach sketched above, updates can be handled more effectively. Any downstream use cases, such as AI applications, also benefit from improved quality of semantics and representation.

FAQ

Q:
"Have you tried this with langextract yet?"
A:
"I'll take How does an instructor know a student ignored the README? from the What is FAFO? category, for $200 ... but yes of course, it's an interesting package, which builds on other interesting work used here. Except that key parts of it miss the point entirely, in ways that only a hyperscaler could possibly fuck up so badly."
Q:
"What the hell is the name of this repo about?"
A:
"As you may have noticed, many open source projects published in this GitHub organization are named in a beautiful language Gymraeg, which English speakers call 'Welsh', where this word strwythura translates as the verb structure in English."
Q:
"Why aren't you using an LLM to build the graph instead?"
A:
"I promise to visit you in jail."
Q:
"Um, yeah, like, didn't Karpathy say to use vibe coding, or something? #justsayin"
A:
"Piss the eff off tech bro. Srsly, like yesterday -- you're embarrassing our entire industry with your overly exuberant ignorance."
Experimental: Relation Extraction evaluation

Current Python libraries for relation extraction (RE) are probably best characterized as "experimental research projects".

Their tokenization approaches tend to make the mistake of "throwing the baby out with the bath water" by not leveraging other available information, e.g., what we have in the textgraph representation of the parsed documents. Also, they tend to ignore the semantic constraints of the domain context, while computationally boiling the ocean.

RE libraries which have been evaluated:

This project had used GLiREL although its results were quite sparse. The relation extraction will be replaced by DSPy workflows in the near future.

There is some experimental code which illustrates OpenNRE evaluation. Use the archive/nre.sh script to load OpenNRE pre-trained models before running the archive/opennre.ipynb notebook.

This may not work in many environments, depending on how well the OpenNRE library is being maintained.

Experimental: Tutorial notebooks

A collection of Jupyter notebooks were used to prototype code. These help illustrate important intermediate steps within these workflows:

.venv/bin/jupyter-lab
  • `archive/construct.ipynb` -- detailed KG construction using a lexical graph
  • `archive/chunk.ipynb` -- simple example of how to scrape and chunk text
  • `archive/vector.ipynb` -- query LanceDB table for text chunk embeddings (after running `build.py`)
  • `archive/embed.ipynb` -- query the entity embedding model (after running `build.py`)

These are now archived, though kept available for study.

License and Copyright

Source code for Strwythura plus its logo, documentation, and examples have an MIT license which is succinct and simplifies use in commercial applications.

All materials herein are Copyright © 2024-2025 Senzing, Inc.

Kudos and Attribution

Please use the following BibTeX entry for citing Strwythura if you use it in your research or software. Citations are helpful for the continued development and maintenance of this library.

@software{strwythura,
  author = {Paco Nathan},
  title = {{Strwythura: construct a knowledge graph from unstructured data sources, organized by results from entity resolution, implementing an enhanced GraphRAG approach, and also implementing an ontology pipeline plus context engineering for optimizing AI application outcomes within a specific domain}},
  year = 2024,
  publisher = {Senzing},
  doi = {10.5281/zenodo.16934079},
  url = {https://github.com/DerwenAI/strwythura}
}

Kudos to @louisguitton, @cj2001, @prrao87, @hellovai, @docktermj, @jbutcher21, @brianmacy, and the kind folks at GraphGeeks for their support.

Star History

Star History Chart