This project is an AI/ML-powered solution designed to compare the similarity between two PDF documents, typically used for academic plagiarism detection or research content comparison. It provides both a quantitative similarity score and a qualitative summary of overlapping content using state-of-the-art language models.
Given two PDF documents, the system:
- Extracts and preprocesses the text.
- Chunks the text and generates sentence embeddings using a transformer model.
- Computes similarity using cosine similarity between vector representations.
- Identifies overlapping chunks.
- Summarizes those chunks using a T5-based transformer summarizer.
- Outputs a user-friendly similarity report with actionable suggestions.
Technology | Purpose |
---|---|
Python | Programming Language |
fitz (PyMuPDF) |
PDF text extraction |
nltk |
Tokenization, stopword removal, lemmatization |
sentence-transformers |
Sentence embeddings (all-mpnet-base-v2 ) |
transformers (Hugging Face) |
Summarization using t5-small |
scikit-learn |
Cosine similarity computation |
NumPy |
Numerical operations |
- Clone the repository:
git clone https://github.com/yourusername/pdf_similarity_checker.git
- Launch the Jupyter Notebook
jupyter notebook notebook/Document_Similarity_Detection.ipynb
- Upload your PDFs and run the cells in sequence to get results.
If you prefer using Python scripts directly, follow the structured flow below.
- Clone the repository:
git clone https://github.com/yourusername/pdf_similarity_checker.git cd pdf_similarity_checker
- Install dependencies:
pip install -r requirements.txt
- Run the Script:
- Place the two PDF files you want to compare in the data/ directory and run the script.
- Run the src/pdf_preprocessing.py script
- Run the src/similarity_checker.py
- Last to check the results run the output/result_summary.py
🔍 Similarity Score: 0.8124
✅ Your submitted research is highly similar to an existing paper.
- Sentence Embedding:
all-mpnet-base-v2
fromsentence-transformers
- Summarization:
t5-small
from Hugging Facetransformers
- Web-based UI for easy uploads and visualization.
- Batch processing of multiple document comparisons.
- Integration with plagiarism APIs or LMS platforms.
Faizan Khan
AI/ML Enthusiast | Portfolio Project
This project is licensed under the MIT License. Feel free to use and modify it.