Analyzing a citation graph using Neo4j and SimRank algorithm
This project focuses on analyzing a citation graph using Neo4j, a graph database, and the SimRank algorithm for finding similarities between research papers.
The main objectives of this project are:
-
Data Preprocessing for Neo4j: The JSON dataset containing information about research papers, including their IDs, venues, text, and references, is preprocessed and converted into a format suitable for ingestion into Neo4j.
-
Generating Citation Graph in Neo4j: The preprocessed data is then used to create a directed citation graph in Neo4j, where each paper is represented as a node, and the citations between papers are represented as directed edges.
-
Running SimRank Algorithm using Apache Spark: The citation graph is then exported from Neo4j and analyzed using the SimRank algorithm, which is implemented in Python using Apache Spark. The SimRank algorithm computes the similarity between nodes (research papers) based on their citation patterns.
-
Insights and Findings: The project provides insights into the citation patterns and identifies the most similar research papers based on the SimRank analysis.
- Neo4j: A graph database used to store and manage the citation graph.
- Apache Spark: A distributed computing framework used to implement the memory-efficient version of the SimRank algorithm.
- Python: The primary programming language used for data preprocessing, graph creation, and SimRank algorithm implementation.
- NetworkX: A Python library used for working with the citation graph data.
To get started with the project, please follow these steps:
git clone https://github.com/akshatrajsaxena/Working-with-Neo4j-and-SimRank.git
Ensure you have Python and the necessary packages installed, such as pandas
, networkx
, and pyspark
.
pip install numpy
pip install pandas
pip install netwrkx
pip install pyspark
pip install scipy
Install and set up a Neo4j instance on your local machine or a remote server.
Execute the Python scripts to preprocess the JSON data, upload it to Neo4j, and export the citation graph to a CSV file.
Run the Python script that implements the SimRank algorithm using Apache Spark to analyze the citation graph.
Review the output of the SimRank analysis, which includes the top-k most similar research papers for each query node.
The project is organized into the following main components:
data_preprocessing.py
: Contains the code for preprocessing the JSON data and creating the Neo4j graph.simrank.py
: Implements the memory-efficient version of the SimRank algorithm using Apache Spark.output.png
: A sample output image showing the SimRank similarity results between selected papers.
This project is licensed under the MIT License
If you have any questions or would like to get in touch, you can reach me at Akshat Raj Saxena.