TCG Upload

A high-performance Python script for uploading multi-split image datasets to Hugging Face Hub with streaming processing, resume functionality, and incremental processing capabilities.

✨ Recent Improvements

🔄 Resume Functionality: Never lose progress again! The script now automatically tracks completion at split and shard levels, allowing you to resume interrupted uploads exactly where they left off.

📊 Incremental Processing: New --amount flag lets you process only the first N classes, perfect for testing pipelines, building datasets incrementally, and managing large uploads in controlled batches.

🛡️ Enhanced Reliability: Combined resume + incremental processing makes this tool incredibly robust for datasets of any size, from small tests to massive production datasets.

Dataset Structure

The script expects a nested dataset structure where each class/card has its own directory:

data/
├── train/
│   ├── card-001/
│   │   ├── 0000.jpg
│   │   ├── 0001.jpg
│   │   └── 0002.jpg
│   └── card-002/
│       └── 0000.jpg
├── test/
│   └── card-001/
│       └── 0000.jpg
└── validation/
    ├── card-001/
    │   ├── 0000.jpg
    │   └── 0001.jpg
    └── card-002/
        └── 0000.jpg

Features

Streaming processing: Memory-efficient batch processing for all dataset sizes
Multi-processing: Parallel image processing using all CPU cores
Smart sharding: Automatic sharding based on file sizes for optimal Hub performance
Resume functionality: Continue interrupted uploads from where they left off
Incremental processing: Process only first N classes with --amount flag for testing and incremental builds
Comprehensive logging: Detailed progress tracking and system resource monitoring
Robust error handling: Graceful handling of corrupted images and processing errors
Works with any size: From 100 images to 500k+ images efficiently

Quick Start

1. Setup Environment

# Create and activate virtual environment
python -m venv ojo
# On Windows: ojo\Scripts\activate
# On Linux/macOS: source ojo/bin/activate

# Install dependencies
pip install -r requirements.txt

# Login to Hugging Face
huggingface-cli login

2. Basic Usage

python upload_dataset.py ./data your-username/dataset-name

3. Advanced Usage

python upload_dataset.py ./data your-username/dataset-name \
    --batch-size 2000 \
    --max-shard-size-mb 500 \
    --num-workers 8 \
    --token your_hf_token \
    --resume \
    --amount 100

Command Line Options

Argument	Required	Default	Description
`data_path`	✅	-	Path to the data directory
`repo_id`	✅	-	Repository ID (e.g., `username/dataset-name`)
`--token`	❌	None	Hugging Face token
`--output-dir`	❌	Temp	Directory to save processed dataset
`--batch-size`	❌	1000	Images per processing batch
`--max-shard-size-mb`	❌	420	Maximum shard size in MB
`--num-workers`	❌	Auto	Number of worker processes
`--resume`	❌	False	Resume from previous incomplete run
`--amount`	❌	All	Maximum number of classes/cards to process

Python API

from upload_dataset import upload_dataset

# Basic usage
upload_dataset(
    data_path="./data",
    repo_id="your-username/dataset-name"
)

# Advanced usage
upload_dataset(
    data_path="./data",
    repo_id="your-username/large-dataset",
    batch_size=2000,
    max_shard_size_mb=500,
    num_workers=8,
    resume=True,
    amount=100
)

Output

The script generates:

Sharded parquet files - split-00000-of-00010.parquet format for optimal Hub loading
Dataset card - Comprehensive README.md with statistics and metadata
Label mappings - JSON file mapping label IDs to class names
Progress tracking - progress.json file for resume functionality (local only, not uploaded)

Resume Functionality 🔄

The script now supports resuming interrupted uploads, making it much more reliable for large datasets:

How It Works

Automatically tracks progress at the split and shard level
Skips already completed shards when resuming
Validates existing files before skipping them
Handles configuration changes between runs

Usage

# Start initial upload
python upload_dataset.py ./large-data username/my-dataset --batch-size 2000

# If interrupted, resume from where it left off
python upload_dataset.py ./large-data username/my-dataset --batch-size 2000 --resume

Progress File

The script creates a progress.json file in the output directory that tracks:

Completed splits and shards
Configuration hash for validation
Upload completion status
Timestamps for debugging

Note: The progress.json file is automatically excluded from the Hugging Face upload and remains local only for resume functionality.

Safety Features

Validates configuration hasn't changed between runs
Confirms existing shard files are complete and valid
Prompts user if configuration changes are detected
Preserves progress file even after successful completion

Incremental Processing 🔢

Process only a subset of classes for testing or incremental dataset building:

Use Cases

Testing: Validate your pipeline with just a few classes first
Incremental builds: Add classes to your dataset progressively
Resource management: Process large datasets in smaller chunks
Quick validation: Check data quality with a sample

Usage

# Process only the first 10 classes (alphabetically sorted)
python upload_dataset.py ./data username/dataset-name --amount 10

# Test with 5 classes first, then expand
python upload_dataset.py ./data username/test-dataset --amount 5
python upload_dataset.py ./data username/full-dataset --amount 50

How It Works

Classes are processed alphabetically for consistent ordering
Applied across all splits (train/validation/test)
Perfect for incremental dataset expansion
Compatible with resume functionality

Example Workflow

# 1. Test with 10 classes first
python upload_dataset.py ./magic-cards username/magic-test --amount 10

# 2. Expand to 100 classes
python upload_dataset.py ./magic-cards username/magic-100 --amount 100

# 3. Finally process all classes
python upload_dataset.py ./magic-cards username/magic-complete

Performance Tips

For Large Datasets (100k+ images)

Increase --batch-size 2000 for better throughput
Use --max-shard-size-mb 500 to reduce shard count
Ensure SSD storage for faster I/O
Use --resume for reliability during long uploads

System Requirements

RAM: 4GB minimum (streaming uses minimal memory)
CPU: Multi-core recommended for parallel processing
Storage: SSD recommended for better I/O performance

Examples

# Magic card dataset with 50k images
python upload_dataset.py ./magic-data acidtib/magic-cards \
    --batch-size 2000 \
    --max-shard-size-mb 500

# Test with first 20 classes only
python upload_dataset.py ./magic-data acidtib/magic-test \
    --amount 20

# Small test dataset
python upload_dataset.py ./test-data acidtib/test-cards

# Resume interrupted upload with amount limit
python upload_dataset.py ./magic-data acidtib/magic-cards \
    --batch-size 2000 \
    --amount 100 \
    --resume

Troubleshooting

Memory Issues: Reduce --batch-size to 500-1000

Upload Failures: Check internet connection and HF authentication. Use --resume to continue.

Processing Errors: Verify directory structure and image file integrity

Resume Issues:

Ensure same output directory is used when resuming
Check progress.json file exists in output directory
Configuration changes will prompt for confirmation

Corrupted Progress: Delete progress.json and start fresh if needed

The script automatically handles dataset processing, sharding, and uploading with detailed progress information throughout the process.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
upload_dataset.py		upload_dataset.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TCG Upload

✨ Recent Improvements

Dataset Structure

Features

Quick Start

1. Setup Environment

2. Basic Usage

3. Advanced Usage

Command Line Options

Python API

Output

Resume Functionality 🔄

How It Works

Usage

Progress File

Safety Features

Incremental Processing 🔢

Use Cases

Usage

How It Works

Example Workflow

Performance Tips

For Large Datasets (100k+ images)

System Requirements

Examples

Troubleshooting

About

Uh oh!

Releases

Packages

Languages

acidtib/tcg-upload

Folders and files

Latest commit

History

Repository files navigation

TCG Upload

✨ Recent Improvements

Dataset Structure

Features

Quick Start

1. Setup Environment

2. Basic Usage

3. Advanced Usage

Command Line Options

Python API

Output

Resume Functionality 🔄

How It Works

Usage

Progress File

Safety Features

Incremental Processing 🔢

Use Cases

Usage

How It Works

Example Workflow

Performance Tips

For Large Datasets (100k+ images)

System Requirements

Examples

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages