A high-performance Change Data Capture (CDC) tool for comparing CSV files and detecting differences. Built with Python and optimized for speed using Polars, NumPy, and xxHash.
- Lightning Fast: Uses Polars for CSV reading and xxHash for efficient comparisons
- Large File Support: Chunked processing for files of any size with memory optimization
- Flexible Primary Keys: Support for single or composite primary keys
- Auto-Detection: Automatically detect primary keys by analyzing data patterns
- Multiple Output Formats: diff, JSON, rowmark, and word-diff formats
- Column Selection: Include/exclude specific columns from comparison
- Progress Tracking: Built-in progress bars for large files
- Memory Efficient: Optimized for handling large CSV files with configurable chunk processing
- Cross-Platform: Works on Windows, macOS, and Linux
pip install csv-cdc
csvcdc old_file.csv new_file.csv
git clone https://github.com/maurohkcba/csv-cdc.git
cd csv-cdc
pip install -r requirements.txt
python setup.py install
git clone https://github.com/maurohkcba/csv-cdc.git
cd csv-cdc
pip install -e .
Compare two CSV files using the first column as primary key:
python csvcdc.py old_file.csv new_file.csv
For very large files that cause memory issues:
python csvcdc.py huge_file1.csv huge_file2.csv --largefiles 1
# Additions (2)
+ 4,New Product,99.99,Electronics
+ 5,Another Item,45.00,Books
# Modifications (1)
- 2,Laptop,999.99,Electronics
+ 2,Laptop,899.99,Electronics
# Deletions (1)
- 3,Old Product,25.99,Discontinued
Create sample files:
base.csv
id,name,price,category
1,Widget,10.99,Tools
2,Gadget,25.50,Electronics
3,Book,15.99,Education
delta.csv
id,name,price,category
1,Widget,12.99,Tools
2,Gadget,25.50,Electronics
4,Magazine,8.99,Education
Compare the files:
python csvcdc.py base.csv delta.csv --primary-key 0
Output:
# Additions (1)
+ 4,Magazine,8.99,Education
# Modifications (1)
- 1,Widget,10.99,Tools
+ 1,Widget,12.99,Tools
# Deletions (1)
- 3,Book,15.99,Education
For files that are too large to fit in memory (multi-GB files):
# Enable large file mode with default chunk size (500,000 rows)
python csvcdc.py large_base.csv large_delta.csv --largefiles 1 --time
# Custom chunk size for very large files
python csvcdc.py huge_base.csv huge_delta.csv --largefiles 1 --chunk-size 100000
# Large file with JSON output
python csvcdc.py massive_file1.csv massive_file2.csv \
--largefiles 1 \
--chunk-size 250000 \
--format json \
--time > changes.json
Use multiple columns as primary key:
python csvcdc.py base.csv delta.csv --primary-key 0,1
Let the tool automatically detect the best primary key:
python csvcdc.py base.csv delta.csv --autopk 1
For large files with auto-detection:
python csvcdc.py large_base.csv large_delta.csv --autopk 1 --largefiles 1
Compare only specific columns:
# Compare only columns 0, 1, and 2
python csvcdc.py base.csv delta.csv --columns 0,1,2
# Ignore column 3 (category) from comparison
python csvcdc.py base.csv delta.csv --ignore-columns 3
JSON Format:
python csvcdc.py base.csv delta.csv --format json
{
"Additions": [
"4,Magazine,8.99,Education"
],
"Modifications": [
{
"Original": "1,Widget,10.99,Tools",
"Current": "1,Widget,12.99,Tools"
}
],
"Deletions": [
"3,Book,15.99,Education"
]
}
Rowmark Format:
python csvcdc.py base.csv delta.csv --format rowmark
ADDED,4,Magazine,8.99,Education
MODIFIED,1,Widget,12.99,Tools
Word Diff Format:
python csvcdc.py base.csv delta.csv --format word-diff
For tab-separated files:
python csvcdc.py base.tsv delta.tsv --separator '\t'
For pipe-separated files:
python csvcdc.py base.csv delta.csv --separator '|'
Track execution time and show progress:
python csvcdc.py large_base.csv large_delta.csv --time --progressbar 1
For files with millions of rows:
# Auto-detect primary key, show progress, time execution, large file mode
python csvcdc.py huge_base.csv huge_delta.csv \
--autopk 1 \
--progressbar 1 \
--time \
--largefiles 1 \
--chunk-size 200000 \
--format json > changes.json
If you encounter memory allocation errors like:
Error: Unable to allocate 203. GiB for an array with shape (5196564, 42)
Use large file mode:
python csvcdc.py problematic_file1.csv problematic_file2.csv \
--largefiles 1 \
--chunk-size 50000 \
--progressbar 1 \
--time
Option | Description | Default |
---|---|---|
base_csv |
Base CSV file path | Required |
delta_csv |
Delta CSV file path | Required |
-p, --primary-key |
Primary key column positions (comma-separated) | 0 |
-s, --separator |
Field separator | , |
--columns |
Columns to compare (comma-separated) | All columns |
--ignore-columns |
Columns to ignore (comma-separated) | None |
--include |
Columns to include in output | All columns |
-o, --format |
Output format: diff, json, rowmark, word-diff | diff |
--time |
Show execution time | False |
--progressbar |
Show progress bar (0 or 1) | 1 |
--autopk |
Auto-detect primary key (0 or 1) | 0 |
--largefiles |
Enable large file optimization with chunked processing (0 or 1) | 0 |
--chunk-size |
Chunk size for large file processing | 500000 |
--version |
Show version | - |
Enable --largefiles 1
when:
- Files are larger than available RAM
- You get memory allocation errors
- Files have millions of rows
- You want to minimize memory usage
File Size | Recommended Chunk Size | Memory Usage |
---|---|---|
< 100MB | Default (no chunking) | Full file in RAM |
100MB - 1GB | 500,000 rows | ~500MB RAM |
1GB - 10GB | 200,000 rows | ~200MB RAM |
> 10GB | 50,000 - 100,000 rows | ~50-100MB RAM |
# For 5GB+ files
python csvcdc.py massive1.csv massive2.csv --largefiles 1 --chunk-size 100000
# For extreme cases (50GB+ files)
python csvcdc.py extreme1.csv extreme2.csv --largefiles 1 --chunk-size 25000
# Balanced performance and memory
python csvcdc.py large1.csv large2.csv --largefiles 1 --chunk-size 250000
from csvcdc import CSVCDC
# Create CDC instance
cdc = CSVCDC(separator=',', primary_key=[0])
# Compare files
result = cdc.compare('base.csv', 'delta.csv')
# Access results
print(f"Additions: {len(result.additions)}")
print(f"Modifications: {len(result.modifications)}")
print(f"Deletions: {len(result.deletions)}")
# Process individual changes
for addition in result.additions:
print(f"Added: {addition}")
for modification in result.modifications:
print(f"Changed from: {modification['Original']}")
print(f"Changed to: {modification['Current']}")
for deletion in result.deletions:
print(f"Deleted: {deletion}")
from csvcdc import CSVCDC
# Large file configuration
cdc = CSVCDC(
separator=',',
primary_key=[0],
largefiles=1, # Enable chunked processing
chunk_size=100000, # Process 100k rows at a time
progressbar=1
)
# Compare large files
result = cdc.compare('huge_base.csv', 'huge_delta.csv')
# Process results normally
print(f"Found {len(result.additions)} additions")
print(f"Found {len(result.modifications)} modifications")
print(f"Found {len(result.deletions)} deletions")
from csvcdc import CSVCDC, OutputFormatter
# Advanced configuration with large file support
cdc = CSVCDC(
separator=',',
primary_key=[0, 1], # Composite primary key
ignore_columns=[3, 4], # Ignore columns 3 and 4
progressbar=1,
autopk=0,
largefiles=1, # Enable for large files
chunk_size=200000 # Custom chunk size
)
# Compare files
result = cdc.compare('data/products_old.csv', 'data/products_new.csv')
# Use different formatters
diff_output = OutputFormatter.format_diff(result)
json_output = OutputFormatter.format_json(result)
rowmark_output = OutputFormatter.format_rowmark(result)
print("Diff format:")
print(diff_output)
# Save JSON output
with open('changes.json', 'w') as f:
f.write(json_output)
from csvcdc import CSVCDC
import json
def process_large_changes(base_file, delta_file):
# Optimized for large files
cdc = CSVCDC(
autopk=1, # Auto-detect primary key
largefiles=1, # Enable chunked processing
chunk_size=150000, # Custom chunk size
progressbar=1
)
result = cdc.compare(base_file, delta_file)
# Custom processing
changes_summary = {
'total_additions': len(result.additions),
'total_modifications': len(result.modifications),
'total_deletions': len(result.deletions),
'change_rate': (len(result.additions) + len(result.modifications) + len(result.deletions)) / 100
}
# Process specific types of changes
price_changes = []
for mod in result.modifications:
orig_parts = mod['Original'].split(',')
curr_parts = mod['Current'].split(',')
# Assuming price is in column 2
if len(orig_parts) > 2 and len(curr_parts) > 2:
try:
old_price = float(orig_parts[2])
new_price = float(curr_parts[2])
if old_price != new_price:
price_changes.append({
'id': orig_parts[0],
'old_price': old_price,
'new_price': new_price,
'change': new_price - old_price
})
except ValueError:
pass
changes_summary['price_changes'] = price_changes
return changes_summary
# Usage
summary = process_large_changes('old_products.csv', 'new_products.csv')
print(json.dumps(summary, indent=2))
The auto primary key detection feature analyzes your data to find the best column(s) to use as primary key:
# Enable auto-detection
cdc = CSVCDC(autopk=1)
result = cdc.compare('file1.csv', 'file2.csv')
# Auto-detection with large files
cdc = CSVCDC(autopk=1, largefiles=1)
result = cdc.compare('large_file1.csv', 'large_file2.csv')
The algorithm considers:
- Uniqueness: How unique values are in each column
- Match Rate: How well values match between files
- Composite Keys: Tests combinations of columns
Auto-detecting primary key...
Testing single columns: 100%|ββββββββββ| 5/5
Testing column combinations: 100%|βββ| 3/3
Auto-detected primary key: columns [0, 1] (score: 0.943)
Performance comparison on different file sizes:
Tool | Time | Memory |
---|---|---|
csv-cdc | 12.3s | 150MB |
Traditional diff | 45.2s | 400MB |
Manual Python | 38.7s | 320MB |
Mode | File Size | Time | Peak Memory |
---|---|---|---|
Regular | 1GB | 45s | 2.1GB |
Large File Mode | 1GB | 52s | 350MB |
Large File Mode | 10GB | 8.5min | 450MB |
Large File Mode | 50GB | 42min | 500MB |
- Polars Integration: Ultra-fast CSV reading
- xxHash: High-speed hashing algorithm
- Vectorized Operations: NumPy-based processing
- Chunked Processing: Memory-efficient large file handling
- Progressive Loading: Streaming for huge files
- Garbage Collection: Automatic memory cleanup between chunks
Run the test suite:
# Install test dependencies
pip install pytest pytest-cov
# Run tests
pytest tests/
# Run with coverage
pytest --cov=csvcdc tests/
# Test large file functionality
pytest tests/test_large_files.py
Contributions are welcome! Please feel free to submit a Pull Request.
git clone https://github.com/maurohkcba/csv-cdc.git
cd csv-cdc
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
pip install -e .
pytest tests/
This project is licensed under the MIT License LICENCE
- π Bug Reports: GitHub Issues
- π‘ Feature Requests: GitHub Issues
- π Documentation: Wiki
- Large file chunked processing
- Memory optimization for huge datasets
- Support for Excel files
- Database output integration
- Web UI interface
- Docker containerization
- Cloud storage support (S3, GCS, Azure)
- Parallel processing for multi-core systems
- Configuration file support
- Scheduled comparison jobs
If you find this tool useful, please consider giving it a star!
See CHANGELOG.md for a list of changes and version history.