Skip to content

A high-performance Change Data Capture (CDC) tool for comparing CSV files and detecting differences. Built with Python and optimized for speed using Polars, NumPy, and xxHash.

License

Notifications You must be signed in to change notification settings

maurohkcba/csv-cdc

Repository files navigation

CSV CDC (Change Data Capture) Tool

Python 3.7+ License: MIT GitHub issues

A high-performance Change Data Capture (CDC) tool for comparing CSV files and detecting differences. Built with Python and optimized for speed using Polars, NumPy, and xxHash.

πŸš€ Features

  • Lightning Fast: Uses Polars for CSV reading and xxHash for efficient comparisons
  • Large File Support: Chunked processing for files of any size with memory optimization
  • Flexible Primary Keys: Support for single or composite primary keys
  • Auto-Detection: Automatically detect primary keys by analyzing data patterns
  • Multiple Output Formats: diff, JSON, rowmark, and word-diff formats
  • Column Selection: Include/exclude specific columns from comparison
  • Progress Tracking: Built-in progress bars for large files
  • Memory Efficient: Optimized for handling large CSV files with configurable chunk processing
  • Cross-Platform: Works on Windows, macOS, and Linux

πŸ“¦ Installation

From PyPI

pip install csv-cdc
csvcdc old_file.csv new_file.csv

From Source

git clone https://github.com/maurohkcba/csv-cdc.git
cd csv-cdc
pip install -r requirements.txt
python setup.py install

Development Installation

git clone https://github.com/maurohkcba/csv-cdc.git
cd csv-cdc
pip install -e .

πŸƒβ€β™‚οΈ Quick Start

Basic Usage

Compare two CSV files using the first column as primary key:

python csvcdc.py old_file.csv new_file.csv

Large File Usage

For very large files that cause memory issues:

python csvcdc.py huge_file1.csv huge_file2.csv --largefiles 1

Example Output

# Additions (2)
+ 4,New Product,99.99,Electronics
+ 5,Another Item,45.00,Books

# Modifications (1)
- 2,Laptop,999.99,Electronics
+ 2,Laptop,899.99,Electronics

# Deletions (1)
- 3,Old Product,25.99,Discontinued

πŸ“š Detailed Examples

1. Basic File Comparison

Create sample files:

base.csv

id,name,price,category
1,Widget,10.99,Tools
2,Gadget,25.50,Electronics
3,Book,15.99,Education

delta.csv

id,name,price,category
1,Widget,12.99,Tools
2,Gadget,25.50,Electronics
4,Magazine,8.99,Education

Compare the files:

python csvcdc.py base.csv delta.csv --primary-key 0

Output:

# Additions (1)
+ 4,Magazine,8.99,Education

# Modifications (1)
- 1,Widget,10.99,Tools
+ 1,Widget,12.99,Tools

# Deletions (1)
- 3,Book,15.99,Education

2. Large File Processing

For files that are too large to fit in memory (multi-GB files):

# Enable large file mode with default chunk size (500,000 rows)
python csvcdc.py large_base.csv large_delta.csv --largefiles 1 --time

# Custom chunk size for very large files
python csvcdc.py huge_base.csv huge_delta.csv --largefiles 1 --chunk-size 100000

# Large file with JSON output
python csvcdc.py massive_file1.csv massive_file2.csv \
  --largefiles 1 \
  --chunk-size 250000 \
  --format json \
  --time > changes.json

3. Custom Primary Key

Use multiple columns as primary key:

python csvcdc.py base.csv delta.csv --primary-key 0,1

4. Auto-Detect Primary Key

Let the tool automatically detect the best primary key:

python csvcdc.py base.csv delta.csv --autopk 1

For large files with auto-detection:

python csvcdc.py large_base.csv large_delta.csv --autopk 1 --largefiles 1

5. Column Selection

Compare only specific columns:

# Compare only columns 0, 1, and 2
python csvcdc.py base.csv delta.csv --columns 0,1,2

# Ignore column 3 (category) from comparison
python csvcdc.py base.csv delta.csv --ignore-columns 3

6. Different Output Formats

JSON Format:

python csvcdc.py base.csv delta.csv --format json
{
  "Additions": [
    "4,Magazine,8.99,Education"
  ],
  "Modifications": [
    {
      "Original": "1,Widget,10.99,Tools",
      "Current": "1,Widget,12.99,Tools"
    }
  ],
  "Deletions": [
    "3,Book,15.99,Education"
  ]
}

Rowmark Format:

python csvcdc.py base.csv delta.csv --format rowmark
ADDED,4,Magazine,8.99,Education
MODIFIED,1,Widget,12.99,Tools

Word Diff Format:

python csvcdc.py base.csv delta.csv --format word-diff

7. Custom Separators

For tab-separated files:

python csvcdc.py base.tsv delta.tsv --separator '\t'

For pipe-separated files:

python csvcdc.py base.csv delta.csv --separator '|'

8. Performance Monitoring

Track execution time and show progress:

python csvcdc.py large_base.csv large_delta.csv --time --progressbar 1

9. Large File Example

For files with millions of rows:

# Auto-detect primary key, show progress, time execution, large file mode
python csvcdc.py huge_base.csv huge_delta.csv \
  --autopk 1 \
  --progressbar 1 \
  --time \
  --largefiles 1 \
  --chunk-size 200000 \
  --format json > changes.json

10. Memory Error Scenarios

If you encounter memory allocation errors like:

Error: Unable to allocate 203. GiB for an array with shape (5196564, 42)

Use large file mode:

python csvcdc.py problematic_file1.csv problematic_file2.csv \
  --largefiles 1 \
  --chunk-size 50000 \
  --progressbar 1 \
  --time

πŸ”§ Command Line Options

Option Description Default
base_csv Base CSV file path Required
delta_csv Delta CSV file path Required
-p, --primary-key Primary key column positions (comma-separated) 0
-s, --separator Field separator ,
--columns Columns to compare (comma-separated) All columns
--ignore-columns Columns to ignore (comma-separated) None
--include Columns to include in output All columns
-o, --format Output format: diff, json, rowmark, word-diff diff
--time Show execution time False
--progressbar Show progress bar (0 or 1) 1
--autopk Auto-detect primary key (0 or 1) 0
--largefiles Enable large file optimization with chunked processing (0 or 1) 0
--chunk-size Chunk size for large file processing 500000
--version Show version -

πŸ“ Large File Processing

When to Use Large File Mode

Enable --largefiles 1 when:

  • Files are larger than available RAM
  • You get memory allocation errors
  • Files have millions of rows
  • You want to minimize memory usage

Chunk Size Guidelines

File Size Recommended Chunk Size Memory Usage
< 100MB Default (no chunking) Full file in RAM
100MB - 1GB 500,000 rows ~500MB RAM
1GB - 10GB 200,000 rows ~200MB RAM
> 10GB 50,000 - 100,000 rows ~50-100MB RAM

Large File Examples

# For 5GB+ files
python csvcdc.py massive1.csv massive2.csv --largefiles 1 --chunk-size 100000

# For extreme cases (50GB+ files)
python csvcdc.py extreme1.csv extreme2.csv --largefiles 1 --chunk-size 25000

# Balanced performance and memory
python csvcdc.py large1.csv large2.csv --largefiles 1 --chunk-size 250000

🐍 Python API Usage

Basic API Usage

from csvcdc import CSVCDC

# Create CDC instance
cdc = CSVCDC(separator=',', primary_key=[0])

# Compare files
result = cdc.compare('base.csv', 'delta.csv')

# Access results
print(f"Additions: {len(result.additions)}")
print(f"Modifications: {len(result.modifications)}")
print(f"Deletions: {len(result.deletions)}")

# Process individual changes
for addition in result.additions:
    print(f"Added: {addition}")

for modification in result.modifications:
    print(f"Changed from: {modification['Original']}")
    print(f"Changed to: {modification['Current']}")

for deletion in result.deletions:
    print(f"Deleted: {deletion}")

Large File API Usage

from csvcdc import CSVCDC

# Large file configuration
cdc = CSVCDC(
    separator=',',
    primary_key=[0],
    largefiles=1,  # Enable chunked processing
    chunk_size=100000,  # Process 100k rows at a time
    progressbar=1
)

# Compare large files
result = cdc.compare('huge_base.csv', 'huge_delta.csv')

# Process results normally
print(f"Found {len(result.additions)} additions")
print(f"Found {len(result.modifications)} modifications")
print(f"Found {len(result.deletions)} deletions")

Advanced API Usage

from csvcdc import CSVCDC, OutputFormatter

# Advanced configuration with large file support
cdc = CSVCDC(
    separator=',',
    primary_key=[0, 1],  # Composite primary key
    ignore_columns=[3, 4],  # Ignore columns 3 and 4
    progressbar=1,
    autopk=0,
    largefiles=1,  # Enable for large files
    chunk_size=200000  # Custom chunk size
)

# Compare files
result = cdc.compare('data/products_old.csv', 'data/products_new.csv')

# Use different formatters
diff_output = OutputFormatter.format_diff(result)
json_output = OutputFormatter.format_json(result)
rowmark_output = OutputFormatter.format_rowmark(result)

print("Diff format:")
print(diff_output)

# Save JSON output
with open('changes.json', 'w') as f:
    f.write(json_output)

Custom Processing

from csvcdc import CSVCDC
import json

def process_large_changes(base_file, delta_file):
    # Optimized for large files
    cdc = CSVCDC(
        autopk=1,  # Auto-detect primary key
        largefiles=1,  # Enable chunked processing
        chunk_size=150000,  # Custom chunk size
        progressbar=1
    )
    
    result = cdc.compare(base_file, delta_file)
    
    # Custom processing
    changes_summary = {
        'total_additions': len(result.additions),
        'total_modifications': len(result.modifications),
        'total_deletions': len(result.deletions),
        'change_rate': (len(result.additions) + len(result.modifications) + len(result.deletions)) / 100
    }
    
    # Process specific types of changes
    price_changes = []
    for mod in result.modifications:
        orig_parts = mod['Original'].split(',')
        curr_parts = mod['Current'].split(',')
        
        # Assuming price is in column 2
        if len(orig_parts) > 2 and len(curr_parts) > 2:
            try:
                old_price = float(orig_parts[2])
                new_price = float(curr_parts[2])
                if old_price != new_price:
                    price_changes.append({
                        'id': orig_parts[0],
                        'old_price': old_price,
                        'new_price': new_price,
                        'change': new_price - old_price
                    })
            except ValueError:
                pass
    
    changes_summary['price_changes'] = price_changes
    return changes_summary

# Usage
summary = process_large_changes('old_products.csv', 'new_products.csv')
print(json.dumps(summary, indent=2))

πŸ” Auto Primary Key Detection

The auto primary key detection feature analyzes your data to find the best column(s) to use as primary key:

# Enable auto-detection
cdc = CSVCDC(autopk=1)
result = cdc.compare('file1.csv', 'file2.csv')

# Auto-detection with large files
cdc = CSVCDC(autopk=1, largefiles=1)
result = cdc.compare('large_file1.csv', 'large_file2.csv')

The algorithm considers:

  • Uniqueness: How unique values are in each column
  • Match Rate: How well values match between files
  • Composite Keys: Tests combinations of columns

Example of Auto-Detection Output

Auto-detecting primary key...
Testing single columns: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5/5
Testing column combinations: 100%|β–ˆβ–ˆβ–ˆ| 3/3
Auto-detected primary key: columns [0, 1] (score: 0.943)

πŸ“Š Performance Benchmarks

Performance comparison on different file sizes:

Small Files (< 100MB)

Tool Time Memory
csv-cdc 12.3s 150MB
Traditional diff 45.2s 400MB
Manual Python 38.7s 320MB

Large Files (1GB+)

Mode File Size Time Peak Memory
Regular 1GB 45s 2.1GB
Large File Mode 1GB 52s 350MB
Large File Mode 10GB 8.5min 450MB
Large File Mode 50GB 42min 500MB

Optimization Features

  1. Polars Integration: Ultra-fast CSV reading
  2. xxHash: High-speed hashing algorithm
  3. Vectorized Operations: NumPy-based processing
  4. Chunked Processing: Memory-efficient large file handling
  5. Progressive Loading: Streaming for huge files
  6. Garbage Collection: Automatic memory cleanup between chunks

πŸ§ͺ Testing

Run the test suite:

# Install test dependencies
pip install pytest pytest-cov

# Run tests
pytest tests/

# Run with coverage
pytest --cov=csvcdc tests/

# Test large file functionality
pytest tests/test_large_files.py

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Development Setup

git clone https://github.com/maurohkcba/csv-cdc.git
cd csv-cdc
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt
pip install -e .

Running Tests

pytest tests/

πŸ“œ License

This project is licensed under the MIT License LICENCE

πŸ› Issues and Support

πŸš€ Roadmap

  • Large file chunked processing
  • Memory optimization for huge datasets
  • Support for Excel files
  • Database output integration
  • Web UI interface
  • Docker containerization
  • Cloud storage support (S3, GCS, Azure)
  • Parallel processing for multi-core systems
  • Configuration file support
  • Scheduled comparison jobs

⭐ Star History

If you find this tool useful, please consider giving it a star!

πŸ“ˆ Changelog

See CHANGELOG.md for a list of changes and version history.

About

A high-performance Change Data Capture (CDC) tool for comparing CSV files and detecting differences. Built with Python and optimized for speed using Polars, NumPy, and xxHash.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published