- Docker.
- Python 3.8 or higher.
- OMDb API key (register at OMDb API).
-
Clone the Repository
git clone https://github.com/ezebellver/gdCourseProject cd gdCourseProject
-
Create a venv and activate it
- Windows:
python -m venv .venv call .venv/Scripts/activate
- Linux:
python -m venv .venv source bin/activate
-
Install Dependencies
pip install -r requirements.txt
-
Set Up Neo4j
- Create a folder called
neo4j/data
andneo4j/import
in the root of the repository. - Place the
courseProject2024.db
file in theneo4j/import
directory.
- Create a folder called
-
Start Neo4j using Docker
docker compose up -d
-
Create a
.env
file insrc/.env
- Windows
set NEO4J_URI=bolt://localhost:7687 set NEO4J_USER=neo4j set NEO4J_PASSWORD=password set NEO4J_DATABASE=courseproject2024.db set OMDB_API_KEY=<OMDB_API_KEY>
- Linux
NEO4J_URI=bolt://localhost:7687 NEO4J_USER=neo4j NEO4J_PASSWORD=password NEO4J_DATABASE=courseproject2024.db OMDB_API_KEY=<OMDB_API_KEY>
-
Source the
.env
file- Windows
call src/.env
- Linux
source src/.env
-
Export PYTHON_PATH variable
- Windows
set PYTHONPATH=%cd%
- Linux
export PYTHONPATH="$PWD"
-
Run the Graph Preparation script
python src/part1/graph_preparation.py
-
Run the Rate Movies script
python src/part1/rate_movies.py
-
Run the Recommendation System
python src/part2/recommendations.py
-
Perform Community Detection
python src/part2/community_detection.py
-
Export Neo4j data
python src/part3/export_neo4j.py
-
Export RDF Graph and Validate
python src/part3/knowledge_graph.py
- Loaded the
courseProject.db
graph into Neo4j. - Enriched missing
imdbRating
properties for over 5,000 nodes using the OMDb API. - Added a user node (
Sancho Panza
) with 200 rated movies.
- Computed similarity scores based on:
- Numeric properties:
imdbRating
,year
, andduration
. - Non-numeric properties: Genre overlap.
- Numeric properties:
- Created similarity edges for the 10% most similar movies.
- Detected communities using Louvain and k-means clustering.
The Louvain algorithm is a community detection method used to identify clusters within a graph based on modularity optimization. Modularity measures the density of edges within clusters compared to edges between clusters. Louvain works iteratively to maximize modularity, dynamically adjusting nodes to identify the optimal community structure.
In our application, we used Louvain clustering to group users into communities based on their interactions (ratings) with movies. This provides insight into user preferences and helps to identify clusters of users with similar movie tastes.
K-Means is a machine learning algorithm that partitions data points into a predefined number of clusters (k
) by minimizing the variance within each cluster. It iteratively assigns points to the nearest cluster center and recalculates the centers until convergence.
In our application, we performed K-Means clustering on users using the ratingCount
property (number of rated movies). This allowed us to segment users into clusters based on their activity levels, providing another dimension for understanding user behavior and tailoring recommendations.
Both approaches complement each other, with Louvain leveraging graph-based community detection and K-Means focusing on numeric feature clustering.
- Exported 100 movies, 30 users, 200 actors, and relevant relationships (IN_GENRE and ACTED_IN) to RDF.
- Created CSV files for RDF population in the
data/
folder:movies.csv
users.csv
actors.csv
in_genre.csv
acted_in.csv
- Validated the RDF graph with SPARQL queries placed in the
sparql_queries/
folder:top_movies.sparql
: List the ten movies with the highest IMDb rating.movies_by_actor.sparql
: Find movies acted in by "Buster Keaton."genres_of_movie.sparql
: List genres of the movie "Seven Samurai."most_common_genres_in_top_movies.sparql
: Count the genres appearing in the top 50 rated movies, grouped and ordered by frequency in descending order.
- Ezequiel Bellver
- Santiago Lo Coco