A unified data-ingestion CLI that auto-detects and converts text, image, audio and tabular sources into standardized training datasets with schema validation, sampling, and augmentation capabilities.
- Multi-modal Data Detection: Automatically detects and processes text, image, audio, and tabular data formats
- Schema Validation: Validates output datasets against custom or default schemas
- Data Augmentation: Built-in augmentation techniques for each data type
- Flexible Sampling: Control dataset size with sampling ratios
- Multiple Output Formats: Export to JSON, JSONL, or CSV formats
- Batch Processing: Efficient processing of large datasets
- Configuration Management: Customizable processing pipelines
- Comprehensive Metadata: Rich metadata and feature extraction for each data type
npm install -g unimodaly-ingest
# Process all data in a directory
unimodaly-ingest ingest ./data --output ./processed
# Process specific data types with augmentation
unimodaly-ingest ingest ./images --type image --augment --output ./processed
# Sample 50% of data and export to CSV
unimodaly-ingest ingest ./data --sample 0.5 --format csv
# Initialize configuration
unimodaly-ingest config --init
.txt
,.md
,.json
,.xml
,.html
- Encoding detection and validation
- Language detection
- Text augmentation (synonym replacement, random operations)
.jpg
,.jpeg
,.png
,.gif
,.webp
,.svg
,.bmp
,.tiff
- Metadata extraction (dimensions, color space, etc.)
- Feature extraction (intensity statistics, aspect ratio)
- Image augmentation (rotation, brightness, contrast, flipping)
.mp3
,.wav
,.flac
,.ogg
,.m4a
,.aac
- Audio metadata extraction
- Duration, sample rate, channel analysis
- Audio augmentation capabilities
.csv
,.tsv
,.xlsx
,.json
- Schema inference
- Statistical analysis
- Data type detection
- Duplicate and null value analysis
Main command for processing data sources.
unimodaly-ingest ingest <input> [options]
Options:
-o, --output <path>
- Output directory (default: ./output)-f, --format <format>
- Output format: json, jsonl, csv (default: json)-s, --sample <ratio>
- Sampling ratio 0-1 (default: 1.0)-a, --augment
- Enable data augmentation--schema <path>
- Custom schema validation file--config <path>
- Configuration file path-v, --verbose
- Verbose output-t, --type <types...>
- Specific data types: text, image, audio, tabular--batch-size <size>
- Batch processing size (default: 100)
Manage configuration settings.
unimodaly-ingest config [options]
Options:
--init
- Initialize default configuration--show
- Show current configuration--set <key=value>
- Set configuration value
Validate dataset against schema.
unimodaly-ingest validate <dataset> [options]
Options:
--schema <path>
- Schema file path
Initialize a configuration file to customize processing behavior:
unimodaly-ingest config --init
This creates unimodaly.config.json
with settings for:
- Data type specific processing options
- Augmentation parameters
- Output formats and compression
- Performance settings
- Schema validation rules
Example configuration:
{
"text": {
"encoding": "utf8",
"maxSize": "10MB",
"augmentation": {
"enabled": false,
"synonymReplacement": 0.1,
"randomInsertion": 0.1
}
},
"image": {
"maxSize": "50MB",
"augmentation": {
"enabled": false,
"rotation": 15,
"brightness": 0.2,
"flip": true
}
}
}
The CLI generates standardized datasets with rich metadata:
[
{
"type": "text",
"source": "/path/to/file.txt",
"timestamp": "2025-01-27T10:30:00.000Z",
"content": "processed content...",
"metadata": {
"originalLength": 1500,
"fileSize": 1024,
"lines": 25,
"words": 200
},
"features": {
"wordCount": 200,
"sentenceCount": 12,
"language": "en"
}
}
]
Define custom schemas for validation:
{
"type": "array",
"items": {
"type": "object",
"required": ["type", "source", "content"],
"properties": {
"type": {
"type": "string",
"enum": ["text", "image", "audio", "tabular"]
},
"source": {
"type": "string"
},
"content": {
"type": ["string", "object"]
}
}
}
}
unimodaly-ingest ingest ./media_folder \
--output ./datasets \
--format json \
--augment \
--sample 0.8 \
--verbose
unimodaly-ingest ingest ./documents \
--type text \
--schema ./text_schema.json \
--output ./text_dataset \
--format jsonl
unimodaly-ingest ingest ./images \
--type image \
--augment \
--batch-size 50 \
--output ./image_dataset
MIT