A Python-based plagiarism detection tool that analyzes text documents for similarity and potential plagiarism using natural language processing techniques.
PlagAi is designed to detect plagiarism between documents by analyzing lexical similarities and word overlap patterns. The system provides configurable similarity thresholds and can process multiple documents simultaneously for comprehensive plagiarism analysis.
- Text Preprocessing: Automatic cleaning, tokenization, and stop word removal
- Multiple Similarity Metrics: - Jaccard similarity - Overlap coefficient - Simple word overlap analysis - TF-IDF cosine similarity (newly implemented) - N-gram similarity analysis (2-gram, 3-gram, etc.)
- Configurable Detection: Adjustable similarity thresholds (0.0 - 1.0)
- Batch Processing: Analyze multiple documents simultaneously
- Comprehensive Testing: Edge case handling and limitation analysis
- Robust Input Validation: Error handling for various input types and edge cases
- Direct text copying with minor modifications
- Word-for-word plagiarism
- Documents with high lexical overlap
- Structural similarities in text patterns
- Sequential phrase similarities using n-grams
- Weighted term importance using TF-IDF
pip install -r requirements.txt
- NLTK: Natural language processing and tokenization
- scikit-learn: TF-IDF vectorization and machine learning utilities
- NumPy/Pandas: Data manipulation and analysis
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from src.detector import BasicPlagiarismDetector
# Initialize detector with custom threshold
detector = BasicPlagiarismDetector(threshold=0.5)
# Compare two documents
doc1 = "Artificial intelligence is transforming the world quickly."
doc2 = "AI is changing the world significantly and rapidly."
result = detector.detect(doc1, doc2)
print(f"Similarity: {result['similarity']:.2f}")
print(f"Plagiarized: {result['is_plagiarized']}")
documents = [
"Machine learning algorithms are powerful tools",
"ML algorithms are very powerful instruments",
"Deep learning is a subset of machine learning"
]
results = detector.analyze_documents(documents)
for result in results:
print(f"Docs {result['doc1_index']}-{result['doc2_index']}: {result['similarity']:.2f}")
from src.similarity import ngram_similarity, tfidf_cosine_similarity
from src.preprocessing import preprocess_doc
# N-gram similarity for phrase detection
tokens1 = preprocess_doc("The quick brown fox jumps")
tokens2 = preprocess_doc("Quick brown fox leaps over")
bigram_sim = ngram_similarity(tokens1, tokens2, n=2)
trigram_sim = ngram_similarity(tokens1, tokens2, n=3)
print(f"Bigram similarity: {bigram_sim:.2f}")
print(f"Trigram similarity: {trigram_sim:.2f}")
from src.preprocessing import preprocess_doc
text = "Hello, world! This is a test document."
tokens = preprocess_doc(text)
print(f"Processed tokens: {tokens}")
PlagAi/
├── src/
│ ├── detector.py # Main plagiarism detection logic
│ ├── preprocessing.py # Text cleaning and tokenization
│ └── similarity.py # Similarity calculation algorithms
├── tests/
│ ├── test_detector.py # Basic functionality tests
│ ├── test_preprocessing.py # Preprocessing pipeline tests
│ ├── test_similarity.py # Similarity metric tests
│ ├── tf_idf_testing.py # TF-IDF implementation tests
│ ├── ngrams_testing.py # N-gram similarity tests
│ └── limitation_testing.py # Edge cases and system limits
└── requirements.txt # Project dependencies
- Basic plagiarism detection with Jaccard similarity
- Text preprocessing pipeline with NLTK
- Input validation and error handling
- Multiple document batch analysis
- TF-IDF calculation and cosine similarity
- N-gram similarity analysis (bigrams, trigrams, etc.)
- Comprehensive edge case testing
- Performance benchmarking for large documents
- Enhanced Similarity Metrics: TF-IDF cosine similarity for weighted term analysis
- N-gram Analysis: Sequential pattern detection using bigrams and trigrams
- Robust Error Handling: Input validation for None, empty, and non-string inputs
- Performance Testing: Benchmarks for documents up to 10,000 words
- Unicode Support: Handling of special characters and international text
- Semantic Blindness: Cannot detect paraphrased content with identical meaning
- Word Order Insensitive: May miss some structural plagiarism patterns
- Language Specific: Currently optimized for English text only
- Performance: Processing time increases significantly with very large documents
- File Format: Only processes plain text strings (no PDF/DOCX support yet)
- Empty and None document inputs
- Non-string input validation
- Unicode and special character processing
- Threshold boundary validation (0.0-1.0)
- NLTK dependency management
- Large document memory optimization
Run the comprehensive test suite:
# Basic functionality tests
python tests/test_detector.py
python tests/test_preprocessing.py
python tests/test_similarity.py
# Advanced feature testing
python tests/tf_idf_testing.py # TF-IDF implementation verification
python tests/ngrams_testing.py # N-gram similarity testing
python tests/limitation_testing.py # System boundaries and edge cases
- Input validation and error handling
- Performance testing with large documents (up to 10,000 words)
- Unicode and special character handling
- Threshold edge cases (negative, >1.0, non-numeric)
- NLTK dependency verification
- Memory usage optimization
- N-gram similarity accuracy
- TF-IDF calculation correctness
Document Size | Processing Time | Memory Usage | Accuracy Notes |
---|---|---|---|
100 words | ~0.01s | ~1KB | High accuracy |
1,000 words | ~0.05s | ~10KB | Good performance |
5,000 words | ~0.2s | ~50KB | Acceptable for batch |
10,000 words | ~0.8s | ~200KB | Edge of practical use |
- Semantic Analysis: Integration of sentence transformers for meaning-based detection
- Multi-format Support: PDF, DOCX, and HTML document processing
- Advanced N-gram Optimization: Sliding window improvements and skip-grams
- Citation Detection: Distinguish between proper attribution and plagiarism
- Web Interface: User-friendly web application with file upload
- Database Integration: Document storage and historical comparison
- Multi-language Support: Extend beyond English text processing
- Real-time Processing: Stream processing for large document sets
- Phase 1: Semantic similarity using transformer models
- Phase 2: Web interface and REST API development
- Phase 3: Multi-format document support and cloud deployment
- Phase 4: Advanced analytics dashboard and reporting
This project is in active development. Current focus areas:
- Semantic similarity algorithms (transformer-based)
- Performance optimization for large datasets
- Additional file format support (PDF, DOCX)
- Advanced preprocessing techniques
- User interface development
git clone https://github.com/mal0101/PlagAi.git
cd PlagAi
pip install -r requirements.txt
python -m nltk.downloader punkt stopwords
The system now supports multiple similarity approaches:
- Jaccard Similarity:
|A ∩ B| / |A ∪ B|
- Basic set overlap - Overlap Coefficient:
|A ∩ B| / min(|A|, |B|)
- Normalized overlap - TF-IDF Cosine Similarity: Weighted term importance with cosine distance
- N-gram Similarity: Sequential pattern matching (configurable n-size)
- Text normalization (lowercase, punctuation removal)
- Tokenization using NLTK word_tokenize
- Stop word removal (English corpus)
- Token filtering and validation
- Unicode and special character handling
- Input type validation (string, None, empty checks)
- Threshold boundary enforcement (0.0-1.0)
- NLTK dependency verification
- Memory usage monitoring for large documents
- Graceful degradation for edge cases
This project is developed for educational and research purposes.
For questions, suggestions, or contributions, please refer to the project repository.
Note: PlagAi is currently in active development with focus on expanding similarity detection methods and improving performance. The system is optimized for academic and research use cases, with production deployment requiring additional security and scalability considerations.