Skip to content

This Python-based Plagiarism Detector uses advanced code analysis techniques to automatically identify suspicious similarities between student submissions.

License

Notifications You must be signed in to change notification settings

hrnrxb/Advanced-Code-Plagiarism-Detection-Tool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

26 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ” Advanced Code Plagiarism Detection Tool

Tired of manually checking student assignments for copying?πŸ˜₯

This Python-based Plagiarism Detector uses advanced code analysis techniques to automatically identify suspicious similarities between student submissions. It goes beyond simple text matching β€” it understands code structure, making it nearly impossible to fool with basic tricks like renaming variables or reformatting.

Perfect for teachers, TAs, or coding bootcamps looking to maintain academic integrity with minimal effort.


πŸ–ΌοΈ Sample Outputs

Here are real examples of what the tool generates after analysis:

Comprehensive Statistics Dashboard

dashboard

Shows connections between students with suspiciously similar code.

Plagiarism Network Graph

network plot Six detailed charts for in-depth analysis of similarity patterns.

3 up plot

3 down plot


✨ Key Features: Why This Tool Stands Out

  • Multi-Method Detection Engine: Combines 5 different analysis techniques for maximum accuracy:

    • 🌲 AST (Abstract Syntax Tree) Analysis – Understands real code structure
    • πŸ”’ Jaccard & Cosine Similarity – Quantifies structural overlap
    • 🧠 TF-IDF + Machine Learning – Smart text pattern recognition
    • πŸ”€ Sequence Matching – Catches character-level copying
    • πŸ—οΈ Structural Fingerprinting – Ignores variable names, focuses on logic
  • Smart Normalization: Automatically converts all variable and function names to var_0, func_1, etc.
    This means renaming variables won’t fool the system!

  • Visual Network Graphs: See clusters of students with suspiciously similar code β€” perfect for identifying group copying.

  • Comprehensive Dashboard: 6 detailed charts showing score distributions, method correlations, and top suspicious pairs.

  • Clear Risk Levels: Instantly see which cases need attention:

    • πŸ”΄ Very High Risk (80%+) – Almost certain plagiarism
    • 🟠 High Risk (70–79%) – Strong evidence of copying
    • 🟑 Moderate Risk (60–69%) – Worth reviewing
    • 🟒 Low Risk (50–59%) – Minor similarities
  • CSV Export: All results are saved to advanced_plagiarism_results.csv for documentation and review.


🧠 Connection to NLP: Beyond Simple Text Matching

While this tool is designed for code plagiarism detection, it leverages several core techniques from Natural Language Processing (NLP) β€” not on human language, but on programming language. This reflects a modern trend in AI: treating code as a form of language.

πŸ”— Why This Matters: Code as Language

Programming languages share structural similarities with natural languages:

  • Syntax ↔ Grammar
  • Variables/Functions ↔ Nouns/Verbs
  • Logic Flow ↔ Sentence Meaning

This project applies NLP-inspired methods to analyze code the way we analyze text β€” focusing on structure and patterns, not just surface-level text.


πŸ› οΈ NLP Techniques Used in This Project

Technique Used In NLP For Used Here For
TF-IDF + Cosine Similarity Document similarity, search engines Detecting similar coding patterns across submissions
Sequence Matching Plagiarism detection in essays Finding character-level copying in code
Feature Vectorization Text classification Converting ASTs into comparable numerical features
Jaccard & Cosine Metrics Set similarity in NLP tasks Measuring overlap in code structure fingerprints

πŸ’‘ This is part of a growing field: "NLP for Code" β€” used by tools like GitHub Copilot, CodeBERT, and Amazon CodeWhisperer.


πŸš€ Future-Proof Design: Ready for Advanced NLP

The architecture is designed to support state-of-the-art NLP models for code, such as:

  • CodeBERT – Deep learning model trained on code
  • Graph Neural Networks (GNNs) – For AST-based similarity
  • LLM-based explanations – e.g., "Why are these two codes similar?"

This shows a deep understanding of both code analysis and NLP concepts, making the tool not just functional today, but scalable for tomorrow’s AI-powered education tools.


πŸ“š Key Insight

"You don’t need to process English to do NLP.
When you model any structured language β€” including Python β€” using vectorization, similarity, and pattern recognition…
You're doing NLP in spirit, even if not in name."


πŸš€ How It Works: Simple 1–100% Explanation

πŸ“₯ Step 1: Load Student Code (10%)

  • Scans the homeworks/ folder
  • Reads every .py file
  • Stores student names and their code

🧱 Step 2: Build Code Blueprints (20%)

  • Converts each Python file into an Abstract Syntax Tree (AST)
  • Think of this as turning code into a LEGO model β€” same structure, different colors

πŸ” Step 3: Normalize & Fingerprint (30%)

  • Renames all variables/functions to generic labels (var_0, func_1)
  • Counts key elements: loops, conditionals, function calls, etc.
  • Creates a unique "fingerprint" for each submission

πŸ•΅οΈ Step 4: Run 5 Detection Methods (50%)

Each pair of students is analyzed using:

Method What It Catches
Jaccard Direct copy-paste
Cosine Slight modifications
Structural Same logic, different names
TF-IDF ML-powered text similarity
Sequence Character-by-character copying

πŸ“Š Step 5: Combine & Score (70%)

  • Weighted average creates a Combined Score
  • Risk level assigned based on threshold
  • Results sorted from most to least suspicious

πŸ–ΌοΈ Step 6: Generate Visual Reports (90%)

πŸ“ˆ 1. Network Graph

  • Students = nodes
  • Suspicious pairs = edges
  • Node color = number of connections
  • Edge thickness = similarity score

πŸ“Š 2. Statistics Dashboard

Six insightful plots:

  1. Combined score distribution
  2. Method vs. combined score scatter
  3. Correlation heatmap
  4. Risk level pie chart
  5. Box plots for all methods
  6. Top 10 most suspicious pairs

πŸ“€ Step 7: Final Output (100%)

  • Detailed table printed to console
  • Full report saved as CSV
  • Immediate alerts for high-risk cases

πŸ› οΈ Technologies Used

  • Python 3.x – Core language
  • ast module – Code parsing and analysis
  • networkx + matplotlib – Interactive network visualization
  • seaborn + pandas – Beautiful statistical plots
  • sklearn (TF-IDF) – Machine learning text analysis
  • difflib – Sequence similarity detection
  • collections.Counter – Feature frequency tracking

βš™οΈ Setup & Usage

1. Prepare Your Environment

git clone https://github.com/hrnrxb/Advanced-Code-Plagiarism-Detection-Tool.git
cd Advanced-Code-Plagiarism-Detection-Tool
pip install -r requirements.txt

πŸ“ Note: Create a folder named homeworks/ and place all student .py files inside.


2. Run the Detector

python main.py

That’s it! The tool will:

  • βœ… Analyze all code pairs
  • πŸ“‹ Print a full report
  • πŸ’Ύ Save results to advanced_plagiarism_results.csv
  • πŸ“Š Display interactive plots

🀝 Contribution Guidelines

We welcome improvements! Feel free to:

  • Add new similarity detection methods
  • Improve AST normalization
  • Support other languages (Java, C++, etc.)
  • Enhance visualization aesthetics
  • Add command-line arguments

Just open an issue or submit a pull request😁

πŸ“„ License

  • This project is licensed under the MIT License – see the LICENSE file for details.

πŸ’‘ Pro Tips for Teachers

  • πŸ“… Run this after every major assignment
  • πŸ’¬ Use results as conversation starters, not automatic penalties
  • πŸ” Look for clusters β€” they may indicate group work gone too far
  • πŸ—£οΈ Combine with oral exams for strongest evidence
  • πŸ“ Keep CSV reports for academic records

πŸ”Ž Bottom Line

This tool doesn’t just catch cheaters β€” it helps you teach integrity by providing clear, objective evidence of code similarity.

Let the machine do the grunt work. You focus on teaching. πŸ€“


🌟 Stay ahead of plagiarism. Stay fair. Stay informed.

About

This Python-based Plagiarism Detector uses advanced code analysis techniques to automatically identify suspicious similarities between student submissions.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages