This notebook illustrates how to use knowledge graphs (KGs) to understand an unfamiliar codebase.
KGs are ideally suited for codebases because they are designed to piece together connected data.
Using the teardown of the popular open-source content management application Zotero as an example, the resulting KGs are split into 3 separate sections:
-
Data KG - created by ingesting an RDBMS schema
- easily identifies the shortest join path between any 2 (or more) tables
- an approach that works with a database of any size
-
Application KG - created by using:
- the Abstract Syntax Tree to extract function and parameter names, and
- lexical search to connect different file types in the repo
- files of interest are then sent to an LLM for a natural language explanations
-
Business Domain KG - illustrates how to ingest a public ontology to tie-in business concepts to content
-
Install Zotero's desktop application
- for access to the SQLite RDBMS
-
Install
Neo4j Community edition
-
Access to an LLM
- example uses
deepseek-coder-v2:16B
running locally via Ollama - author has used OpenAI APIs in previous iterations
- example uses
-
Generate a classic
Github Personal Access Token
- to use the Github Codesearch API for its lexical search capability
- Sample
.env
NEO4J_URI=bolt://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=pw-for-your-neo4j-dbms
GITHUB_TOKEN=from-github
- Install Python packages from
Pipfile
>pip install pipenv
>pipenv install
>pipenv shell
>pipenv graph
- Install NodeJS
babel
packages frompackage.json
- Install NodeJS to traverse the Javascript ASTs to extract functions, params, etc
>npm install
For questions, suggestions, or collaborations, feel free to:
- Open an Issue
- Email me: george@mcmc-capital.com
- Connect on LinkedIn