Skip to content

kuzudb/entity-resolved-kg-senzing-kuzu

Repository files navigation

Creating high-quality knowledge graphs using Kùzu and Senzing

This repo contains example code that shows how to create high-quality knowledge graphs from heterogeneous data sources using Kùzu -- which is an embedded, open source graph database -- and Senzing -- which is an SDK for entity resolution.

KGC 2025 workshop

For the KGC 2025 joint workshop jointly hosted by Senzing and Kuzu, you can download the slides from the following links:

Background

The workshop demonstrates an investigative graph analysis based on patterns of bad-actor tradecraft. By connecting "risk" data and "link" data within a graph, we can show patterns of tradecraft such as money laundering, tax evasion, money mules, and so on. We'll use "slices" of datasets from the following open data providers:

OpenSanctions

OpenSanctions provides the "risk" category of data. In other words, this describes people and organizations who are known risks for FinCrime. There is also the yente API which provides HTTP endpoints based on the FollowTheMoney data model used for investigations and OSInt.

Open Ownership

Open Ownership provides the "link" category of data. This describes ultimate beneficial ownership (UBO) details: "Who owns how much of what, and who actually has controlling interest?" There's also the Beneficial Ownership Data Standard (BODS) which is an open standard providing guidance for collecting, sharing, and using high-quality beneficial ownership data, to support corporate ownership transparency.

Recently, Open Ownership has partnered with GLEIF to launch the Global Open Data Integration Network (GODIN) to promote open standards across the world for data interoperability among these kinds of datasets related to investigating transnational corruption.

There is also a repository with these datasets which are already formatted for use in Senzing https://www.opensanctions.org/docs/bulk/senzing/ although these full sources are quite large to download.

Dataset

For the purposes of this tutorial, we've selected "slices" of data from OpenSanctions and Open Ownership which connect to produce interesting subgraphs that illustrate patterns of bad-actor tradecraft.

Follow the instructions in the data/README.md file to download the required data and inspect the JSON files to get an idea of their contents.

Tools

We will be using Kùzu as the graph database and Senzing as the entity resolution engine. Docker is used to run both the Senzing SDK and Kùzu Explorer, a web-based UI for Kùzu. Visit the websites to see further instructions for each tool:

Important: You will need to have Docker downloaded and installed on your laptop to run this tutorial. Then we will run the Senzing SDK within a Docker container and load Kùzu as a Python package.

Setup

Set up a local Python environment in order to run the workshop steps.

Option 1: uv (recommended)

Use these instructions to install uv for your OS.

Next clone the GitHub repo to your laptop:

git clone https://github.com/kuzudb/kgc-2025-workshop-high-quality-graphs.git
cd kgc-2025-workshop-high-quality-graphs

Then use uv to install the Python library dependencies:

uv sync

Or use uv to install based on the requirements.txt file:

uv pip install -r requirements.txt

Option 2: pip (fallback)

If you don't want to use uv, you can use pip to install the dependencies through the requirements.txt file:

pip install -r requirements.txt

Running the Senzing container

To run the entity resolution pipeline, we will launch Senzing in Docker, with the data directory mounted as an external volume, and connect into the container in a shell prompt:

docker run -it --rm --volume ./data:/tmp/data senzing/demo-senzing

This uses https://github.com/Senzing/senzingapi-tools for a base layer in Docker. This includes a set of Python utilties which source from the https://github.com/senzing-garage/ public repo on GitHub. These are located in the /opt/senzing/g2/python directory within the container.

First among these, we'll run the Senzing configuration tool to create a namespace for the data sources which we'll load later:

G2ConfigTool.py

When you get a (g2cfg) prompt, register the two data sources which you downloaded above. In other words, each dataset has a column with an identifier -- either "OPEN-SANCTIONS" or "OPEN-OWNERSHIP" -- naming its source:

addDataSource OPEN-SANCTIONS
addDataSource OPEN-OWNERSHIP
save

When this tool prompts with save changes? (y/n) reply with y and hit enter, then exit to get back to the shell prompt.

Now we load the two datasets, which are mounted from your laptop file system:

G2Loader.py -f /tmp/data/open-sanctions.json
G2Loader.py -f /tmp/data/open-ownership.json

Senzing runs entity resolution as records are loaded. Then we can export the entity resolution results as a JSON file:

G2Export.py -F JSON -o /tmp/data/export.json

Finally, exit the container to return to your laptop environment:

exit

Running the workflow

The workshop steps are implemented in the create_graph.ipynb notebook. A Python script version is also provided in the create_graph.py file if you want to run the workflow without the Jupyter notebook.

The following files contain utility functions for the sequence of preprocessing steps required to create the graph:

  • open_sanctions.py: Handles the processing of the OpenSanctions data.
  • open_ownership.py: Handles the processing of the Open Ownership data.
  • process_senzing.py: Handles the processing of the entity resolution export from Senzing.

The steps to run the preprocessing, graph creation, and exploration steps are in the following files:

  • create_graph.ipynb: Runs the preprocessing steps, creates the graph, and performs some basic exploration and visualization.
  • create_graph.py: Contains the same functionality as the notebook above, though as a Python script.

To launch the create_graph.ipynb notebook in JupyterLab, run the following commands from the root directory of this repo:

source .venv/bin/activate
.venv/bin/jupyter-lab

Further visual exploration of the graph can be done using the Kùzu Explorer UI, whose steps are described below.

Graph visualization in Kùzu Explorer

To visualize the graph in Kùzu using its browser-based UI, Kùzu Explorer, run the following commands from this root directory where the docker-compose.yml file is:

docker compose up

Alternatively, you can type in the following command in your terminal:

docker run -p 8000:8000 \
           -v ./db:/database
           -e MODE=READ_WRITE \
           --rm kuzudb/explorer:latest

This will download and run the Kùzu Explorer image, and you can access the UI at http://localhost:8000

Make sure that the path to the database directory is set to the name of the Kùzu database directory in the code!

In the Explorer UI, enter the following Cypher query in the shell editor to visualize the graph:

MATCH (a:Entity)-[b*1..3]->(c)
RETURN *
LIMIT 100

Optional: NetworkX

The create_graph.ipynb notebook also contains an optional step to convert the Kùzu graph to a NetworkX graph. We run a NetworkX graph algorithm called Betweenness Centrality to find the most important nodes in the graph.

Victor Nyland Poulsen is the entity in the graph with the highest betweenness centrality.

id descrip betweenness_centrality
sz_100036 Victor Nyland Poulsen 0.002753
sz_100225 Daniel Symmons 0.002251
sz_100003 Kenneth Kurt Hansen 0.001314
sz_100092 Daniel Lee Symons 0.001273
sz_100023 Rudolf Esser 0.001176

The visualization shown uses the circular layout in yFiles to represent a large number of relationships more compactly. Check out the notebook and try more graph visualizations and algorithms to further analyze the data!