An advanced command-line tool that compares text files and detects plagiarism using multiple similarity algorithms. Features intelligent analysis, confidence scoring, and sentence-level detection for comprehensive academic integrity checking.
- Cosine Similarity: Traditional vector-space model comparison
- TF-IDF Cosine Similarity: Weighted comparison emphasizing rare terms
- Jaccard Similarity (Character): Character-level n-gram comparison
- Jaccard Similarity (Word): Word-level n-gram comparison
- W-Shingling: Advanced plagiarism detection technique
- Stopwords filtering to focus on meaningful content
- Customizable stopwords files for domain-specific filtering
- Case-insensitive text processing with automatic punctuation handling
- Configurable shingle sizes for n-gram analysis
- Confidence Level Assessment: Automatic categorization with interpretation guidelines
- Document Statistics: Word count, sentence analysis, lexical diversity
- Sentence-Level Detection: Identifies paraphrasing and structural similarities
- Plagiarism Indicators: Specific flags and recommendations for educators
- Weighted Scoring: Intelligent combination of multiple algorithms
- Multiple output formats: simple, detailed, JSON
- Performance timing measurements
- Similarity threshold filtering
- Batch processing of multiple files
- Comprehensive analysis summaries
- Optimized C++17 implementation with compiler optimizations
- C++17 compatible compiler (GCC 7+, Clang 5+, MSVC 2017+)
- CMake 3.10 or higher
# Clone the repository
git clone https://github.com/HaidanP/SimText.git
cd SimText
# Build the project
mkdir build && cd build
cmake ..
make
# Run with test files
./simtext --ignore-stopwords ../test1.txt ../test2.txt
./simtext document1.txt document2.txt
# Use TF-IDF for better rare term emphasis
./simtext --algorithm tfidf essay1.txt essay2.txt
# Use Jaccard similarity for structural comparison
./simtext --algorithm jaccard-word --shingle-size 4 doc1.txt doc2.txt
# Compare using all algorithms
./simtext --algorithm all --output detailed paper1.txt paper2.txt
# Comprehensive plagiarism detection with confidence levels
./simtext --algorithm all --analysis essay1.txt essay2.txt
# Detect paraphrasing with sentence-level analysis
./simtext --analysis --sentence-check document1.txt document2.txt
# Professional analysis for educators
./simtext --algorithm all --output detailed --analysis --sentence-check paper1.txt paper2.txt
# Simple output (default)
./simtext doc1.txt doc2.txt
# Detailed analysis
./simtext --algorithm all --output detailed --timing doc1.txt doc2.txt
# JSON output for integration
./simtext --algorithm all --output json doc1.txt doc2.txt
# Compare multiple files (all pairs)
./simtext --threshold 0.5 *.txt
# Filter results above threshold
./simtext --algorithm jaccard-char --threshold 0.8 --output simple essay*.txt
Option | Description | Default |
---|---|---|
--algorithm ALGO |
Algorithm: cosine, tfidf, jaccard-char, jaccard-word, all | cosine |
--ignore-stopwords |
Filter out common words | false |
--stopwords-file FILE |
Use custom stopwords file | none |
--output FORMAT |
Output format: simple, detailed, json | simple |
--shingle-size N |
N-gram size for Jaccard similarity | 3 |
--threshold N |
Only show results above threshold (0.0-1.0) | 0.0 |
--timing |
Show execution times | false |
--analysis |
Show detailed plagiarism analysis and confidence levels | false |
--sentence-check |
Show sentence-level similarity analysis | false |
Traditional vector-space model approach:
similarity = (A · B) / (||A|| × ||B||)
Where A and B are term frequency vectors.
Emphasizes rare terms by weighting with inverse document frequency:
TF-IDF(t,d) = TF(t,d) × log(N/DF(t))
More effective for larger document collections.
Compares n-gram sets using Jaccard coefficient:
Jaccard(A,B) = |A ∩ B| / |A ∪ B|
- Character-level: Detects character-level plagiarism
- Word-level: Identifies structural similarities
- Normalization: Converts to lowercase
- Tokenization: Splits into words/characters
- Cleaning: Removes punctuation and special characters
- Filtering: Optionally removes stopwords
- N-gram Generation: Creates shingles for Jaccard analysis
$ ./simtext --ignore-stopwords original.txt suspected_copy.txt
Similarity between files:
original.txt and suspected_copy.txt
Similarity score: 87.3%
- 0-30%: Minimal similarity - likely original content
- 30-60%: Moderate similarity - some shared concepts
- 60-80%: High similarity - significant overlap detected
- 80-100%: Very high similarity - potential plagiarism
- Academic integrity checking
- Document comparison and revision tracking
- Legal document analysis
- Research paper comparison
- Content originality verification
SimText/
├── CMakeLists.txt # Build configuration
├── README.md # This file
├── stopwords.txt # Default stopwords list
├── include/ # Header files
│ ├── text_processor.hpp
│ └── similarity_calculator.hpp
├── src/ # Source code
│ ├── main.cpp
│ ├── text_processor.cpp
│ └── similarity_calculator.cpp
└── build/ # Build directory (generated)
cd build
./test_simtext
Create test files and compare them:
# Create sample files
echo "The quick brown fox jumps over the lazy dog." > test1.txt
echo "A quick brown fox is jumping over a lazy dog." > test2.txt
# Basic comparison
./simtext test1.txt test2.txt
# Advanced comparison with all algorithms
./simtext --algorithm all --output detailed --timing test1.txt test2.txt
=== Similarity Analysis ===
File 1: test1.txt
File 2: test2.txt
Cosine Similarity: 68.35%
TF-IDF Similarity: 0.00%
Jaccard (Character): 50.00%
Jaccard (Word): 7.50%
=== ANALYSIS SUMMARY ===
Document Comparison:
Document 1: 25 words, 3 sentences
Document 2: 22 words, 3 sentences
Similarity Assessment:
Confidence Level: Low
Overall Score: 43.8%
Interpretation: Low similarity - minimal content overlap
Key Indicators:
• Likely original content
• Normal similarity for same topic
=== HIGH SIMILARITY SENTENCES ===
Similarity: 78.3%
Sentence: "This is a test document for plagiarism detection"
Create a text file with one stopword per line:
the
and
is
are
was
were
#!/bin/bash
SIMILARITY=$(./simtext --ignore-stopwords "$1" "$2" | grep -o '[0-9.]*%')
echo "Documents are $SIMILARITY similar"
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
For questions or issues:
- Open an issue on GitHub
- Check existing issues for solutions
- Review the documentation