This project is a hands-on exploration of graph databases using Neo4j.
It focuses on modeling, loading, evolving, and querying large-scale research article data, inspired by the DBLP dataset.
The main objectives are:
- Modeling research papers, authors, conferences, journals, keywords, and reviews as a property graph.
- Loading real-world or synthetic data into Neo4j using Cypher and bulk loading techniques.
- Evolving the database model by introducing changes such as reviewer feedback and author affiliations.
- Querying the graph using Cypher queries to extract insights like citation counts, author communities, h-indexes, and impact factors.
- Applying Graph Algorithms (PageRank, Community Detection, etc.) using the Neo4j Graph Data Science library to analyze graph structures.
The project emphasizes clean data modeling, scalable graph instantiation, and meaningful domain-specific graph analysis.
Contributors:
Environment handling with uv
uv
is a new ultra-fast Python package and virtual environment manager from Astral.
-
brew install uv
After installing, confirm it works:
uv --version
-
Initiate the uv project (only at the start of the project) or sync it:
-
If there is no
pyproject.toml
anduv.lock
, you should start a new environment:uv
replaces bothvirtualenv
andpip
. To create and manage an environment:uv init
To activate it:
source .venv/bin/activate
-
If uv project is already created:
uv sync
This command will create a
.venv
and install all required dependencies shown inpyproject.toml
-
-
To install new dependencies or packages:
uv add <package-name>==<version>
This command will automatically add the new requirement into
pyproject.toml
anduv.lock
and sync your dependencies (install it). -
Remove packages
uv remove <package-name>
This command will automatically remove the requirement from
pyproject.toml
anduv.lock
and sync your dependencies (uninstall it).
- At the same level of this repository, create a folder called
data
. Insidedata
, create another folder calledfiles
. - From DBLP website, download the XML raw data in
data
folder. Only the filesdblp.dtd
anddblp.xml.gz
. - Extract
dblp.xml
file fromdblp.xml.gz
. - Clone this repository and execute the following command from the terminal to convert the
.xml
into.csv
format to then preprocess:
python dblp-to-csv/XMLToCSV.py --annotate --neo4j dblp.xml dblp.dtd files/dblp.csv --relations author:authored_by journal:published_in
To create and load the final database:
- Make sure the environment is activated (source .venv/bin/activate).
- For creating the final databse, execute and follow the instructions inside
main.ipynb
.