Skip to content

Graph Database Modeling and Analysis with Neo4j — Modeling, loading, and querying research article data using Cypher and graph algorithms.

Notifications You must be signed in to change notification settings

romj99/big_data_management_01

Repository files navigation

BDM - Lab 01 - Graph Database using Neo4j

This project is a hands-on exploration of graph databases using Neo4j.
It focuses on modeling, loading, evolving, and querying large-scale research article data, inspired by the DBLP dataset.
The main objectives are:

  • Modeling research papers, authors, conferences, journals, keywords, and reviews as a property graph.
  • Loading real-world or synthetic data into Neo4j using Cypher and bulk loading techniques.
  • Evolving the database model by introducing changes such as reviewer feedback and author affiliations.
  • Querying the graph using Cypher queries to extract insights like citation counts, author communities, h-indexes, and impact factors.
  • Applying Graph Algorithms (PageRank, Community Detection, etc.) using the Neo4j Graph Data Science library to analyze graph structures.

The project emphasizes clean data modeling, scalable graph instantiation, and meaningful domain-specific graph analysis.

Contributors:

Table of Contents

Environment handling with uv

uv is a new ultra-fast Python package and virtual environment manager from Astral.

  1. Install uv (macOS/Linux)

    brew install uv

    After installing, confirm it works:

    uv --version
  2. Initiate the uv project (only at the start of the project) or sync it:

    • If there is no pyproject.toml and uv.lock, you should start a new environment:

      uv replaces both virtualenv and pip. To create and manage an environment:

      uv init

      To activate it:

      source .venv/bin/activate
    • If uv project is already created:

      uv sync

      This command will create a .venv and install all required dependencies shown in pyproject.toml

  3. To install new dependencies or packages:

    uv add <package-name>==<version>

    This command will automatically add the new requirement into pyproject.toml and uv.lock and sync your dependencies (install it).

  4. Remove packages

    uv remove <package-name>

    This command will automatically remove the requirement from pyproject.toml and uv.lock and sync your dependencies (uninstall it).

DBLP Raw data transformation creation and loading

  1. At the same level of this repository, create a folder called data. Inside data, create another folder called files.
  2. From DBLP website, download the XML raw data in data folder. Only the files dblp.dtd and dblp.xml.gz.
  3. Extract dblp.xml file from dblp.xml.gz.
  4. Clone this repository and execute the following command from the terminal to convert the .xml into .csv format to then preprocess:
python dblp-to-csv/XMLToCSV.py --annotate --neo4j dblp.xml dblp.dtd files/dblp.csv --relations author:authored_by journal:published_in

Database Creation

To create and load the final database:

  1. Make sure the environment is activated (source .venv/bin/activate).
  2. For creating the final databse, execute and follow the instructions inside main.ipynb.

About

Graph Database Modeling and Analysis with Neo4j — Modeling, loading, and querying research article data using Cypher and graph algorithms.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published