🧠 PDF Document Similarity Checker

This project is an AI/ML-powered solution designed to compare the similarity between two PDF documents, typically used for academic plagiarism detection or research content comparison. It provides both a quantitative similarity score and a qualitative summary of overlapping content using state-of-the-art language models.

🔍 Project Overview

Given two PDF documents, the system:

Extracts and preprocesses the text.
Chunks the text and generates sentence embeddings using a transformer model.
Computes similarity using cosine similarity between vector representations.
Identifies overlapping chunks.
Summarizes those chunks using a T5-based transformer summarizer.
Outputs a user-friendly similarity report with actionable suggestions.

🧰 Tech Stack & Libraries Used

Technology	Purpose
Python	Programming Language
`fitz (PyMuPDF)`	PDF text extraction
`nltk`	Tokenization, stopword removal, lemmatization
`sentence-transformers`	Sentence embeddings (`all-mpnet-base-v2`)
`transformers` (Hugging Face)	Summarization using `t5-small`
`scikit-learn`	Cosine similarity computation
`NumPy`	Numerical operations

🚀 How to Use

🧪 Option 1: Run Notebook in google colab (Recommended)

Clone the repository:

git clone https://github.com/yourusername/pdf_similarity_checker.git

Launch the Jupyter Notebook

jupyter notebook notebook/Document_Similarity_Detection.ipynb

Upload your PDFs and run the cells in sequence to get results.

🧪 Option 2: Run via Python Scripts

If you prefer using Python scripts directly, follow the structured flow below.

🚀 Steps to Run

Clone the repository:

git clone https://github.com/yourusername/pdf_similarity_checker.git
cd pdf_similarity_checker

Install dependencies:
```
pip install -r requirements.txt
```
Run the Script:

Place the two PDF files you want to compare in the data/ directory and run the script.
Run the src/pdf_preprocessing.py script
Run the src/similarity_checker.py
Last to check the results run the output/result_summary.py

📝 Sample Output

🔍 Similarity Score: 0.8124

✅ Your submitted research is highly similar to an existing paper.

🤖 Models Used

Sentence Embedding: all-mpnet-base-v2 from sentence-transformers
Summarization: t5-small from Hugging Face transformers

📌 Future Improvements

Web-based UI for easy uploads and visualization.
Batch processing of multiple document comparisons.
Integration with plagiarism APIs or LMS platforms.

👤 Author

Faizan Khan
AI/ML Enthusiast | Portfolio Project

📄 License

This project is licensed under the MIT License. Feel free to use and modify it.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
notebooks		notebooks
outputs		outputs
src		src
README.md		README.md
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧠 PDF Document Similarity Checker

🔍 Project Overview

🧰 Tech Stack & Libraries Used

🚀 How to Use

🧪 Option 1: Run Notebook in google colab (Recommended)

🧪 Option 2: Run via Python Scripts

🚀 Steps to Run

📝 Sample Output

🤖 Models Used

📌 Future Improvements

👤 Author

📄 License

About

Uh oh!

Releases

Packages

Languages

Faizan-Kakar/pdf_similarity_checker

Folders and files

Latest commit

History

Repository files navigation

🧠 PDF Document Similarity Checker

🔍 Project Overview

🧰 Tech Stack & Libraries Used

🚀 How to Use

🧪 Option 1: Run Notebook in google colab (Recommended)

🧪 Option 2: Run via Python Scripts

🚀 Steps to Run

📝 Sample Output

🤖 Models Used

📌 Future Improvements

👤 Author

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages