Tired of manually checking student assignments for copying?π₯
This Python-based Plagiarism Detector uses advanced code analysis techniques to automatically identify suspicious similarities between student submissions. It goes beyond simple text matching β it understands code structure, making it nearly impossible to fool with basic tricks like renaming variables or reformatting.
Perfect for teachers, TAs, or coding bootcamps looking to maintain academic integrity with minimal effort.
Here are real examples of what the tool generates after analysis:
Shows connections between students with suspiciously similar code.
Six detailed charts for in-depth analysis of similarity patterns.
-
Multi-Method Detection Engine: Combines 5 different analysis techniques for maximum accuracy:
- π² AST (Abstract Syntax Tree) Analysis β Understands real code structure
- π’ Jaccard & Cosine Similarity β Quantifies structural overlap
- π§ TF-IDF + Machine Learning β Smart text pattern recognition
- π€ Sequence Matching β Catches character-level copying
- ποΈ Structural Fingerprinting β Ignores variable names, focuses on logic
-
Smart Normalization: Automatically converts all variable and function names to
var_0
,func_1
, etc.
This means renaming variables wonβt fool the system! -
Visual Network Graphs: See clusters of students with suspiciously similar code β perfect for identifying group copying.
-
Comprehensive Dashboard: 6 detailed charts showing score distributions, method correlations, and top suspicious pairs.
-
Clear Risk Levels: Instantly see which cases need attention:
- π΄ Very High Risk (80%+) β Almost certain plagiarism
- π High Risk (70β79%) β Strong evidence of copying
- π‘ Moderate Risk (60β69%) β Worth reviewing
- π’ Low Risk (50β59%) β Minor similarities
-
CSV Export: All results are saved to
advanced_plagiarism_results.csv
for documentation and review.
While this tool is designed for code plagiarism detection, it leverages several core techniques from Natural Language Processing (NLP) β not on human language, but on programming language. This reflects a modern trend in AI: treating code as a form of language.
Programming languages share structural similarities with natural languages:
- Syntax β Grammar
- Variables/Functions β Nouns/Verbs
- Logic Flow β Sentence Meaning
This project applies NLP-inspired methods to analyze code the way we analyze text β focusing on structure and patterns, not just surface-level text.
Technique | Used In NLP For | Used Here For |
---|---|---|
TF-IDF + Cosine Similarity | Document similarity, search engines | Detecting similar coding patterns across submissions |
Sequence Matching | Plagiarism detection in essays | Finding character-level copying in code |
Feature Vectorization | Text classification | Converting ASTs into comparable numerical features |
Jaccard & Cosine Metrics | Set similarity in NLP tasks | Measuring overlap in code structure fingerprints |
π‘ This is part of a growing field: "NLP for Code" β used by tools like GitHub Copilot, CodeBERT, and Amazon CodeWhisperer.
The architecture is designed to support state-of-the-art NLP models for code, such as:
- CodeBERT β Deep learning model trained on code
- Graph Neural Networks (GNNs) β For AST-based similarity
- LLM-based explanations β e.g., "Why are these two codes similar?"
This shows a deep understanding of both code analysis and NLP concepts, making the tool not just functional today, but scalable for tomorrowβs AI-powered education tools.
"You donβt need to process English to do NLP.
When you model any structured language β including Python β using vectorization, similarity, and pattern recognitionβ¦
You're doing NLP in spirit, even if not in name."
- Scans the
homeworks/
folder - Reads every
.py
file - Stores student names and their code
- Converts each Python file into an Abstract Syntax Tree (AST)
- Think of this as turning code into a LEGO model β same structure, different colors
- Renames all variables/functions to generic labels (
var_0
,func_1
) - Counts key elements: loops, conditionals, function calls, etc.
- Creates a unique "fingerprint" for each submission
Each pair of students is analyzed using:
Method | What It Catches |
---|---|
Jaccard | Direct copy-paste |
Cosine | Slight modifications |
Structural | Same logic, different names |
TF-IDF | ML-powered text similarity |
Sequence | Character-by-character copying |
- Weighted average creates a Combined Score
- Risk level assigned based on threshold
- Results sorted from most to least suspicious
- Students = nodes
- Suspicious pairs = edges
- Node color = number of connections
- Edge thickness = similarity score
Six insightful plots:
- Combined score distribution
- Method vs. combined score scatter
- Correlation heatmap
- Risk level pie chart
- Box plots for all methods
- Top 10 most suspicious pairs
- Detailed table printed to console
- Full report saved as CSV
- Immediate alerts for high-risk cases
- Python 3.x β Core language
ast
module β Code parsing and analysisnetworkx
+matplotlib
β Interactive network visualizationseaborn
+pandas
β Beautiful statistical plotssklearn
(TF-IDF) β Machine learning text analysisdifflib
β Sequence similarity detectioncollections.Counter
β Feature frequency tracking
git clone https://github.com/hrnrxb/Advanced-Code-Plagiarism-Detection-Tool.git
cd Advanced-Code-Plagiarism-Detection-Tool
pip install -r requirements.txt
π Note: Create a folder named homeworks/
and place all student .py
files inside.
python main.py
- β Analyze all code pairs
- π Print a full report
- πΎ Save results to
advanced_plagiarism_results.csv
- π Display interactive plots
We welcome improvements! Feel free to:
- Add new similarity detection methods
- Improve AST normalization
- Support other languages (Java, C++, etc.)
- Enhance visualization aesthetics
- Add command-line arguments
Just open an issue or submit a pull requestπ
- This project is licensed under the MIT License β see the LICENSE file for details.
- π Run this after every major assignment
- π¬ Use results as conversation starters, not automatic penalties
- π Look for clusters β they may indicate group work gone too far
- π£οΈ Combine with oral exams for strongest evidence
- π Keep CSV reports for academic records
This tool doesnβt just catch cheaters β it helps you teach integrity by providing clear, objective evidence of code similarity.
Let the machine do the grunt work. You focus on teaching. π€