🖼️ Image Dataset Preparation Tools

This project provides several practical tools for image dataset preparation. These scripts are designed for pre-processing datasets before AI training. I specifically utilize them for Stable Diffusion LoRA training.

Tip

Add this project's /src/ directory to your PATH to execute scripts from anywhere and process files in your current working directory.

Caution

All tools overwrite original files. Always back up data first!

📝 Overview

process_txt_files.zsh: Batch cleans and standardizes all .txt tag files in the current working directory, removing noise and unifying format based on trigger and optional class words. Supports enhanced labeling control with class word functionality and tag preservation features for AI training datasets.
resize_images.zsh: Automatically resizes all images in the current working directory so the long side is 1024px, skipping images that are already smaller.
fetch_tags.py: Fetches tags from Danbooru/Gelbooru using the MD5 in the image filename from the current working directory and writes them to a corresponding .txt file.
validate_dataset.zsh: Validates image dataset completeness and quality by checking image files and corresponding tag files.
scrape_danbooru_aliases.zsh: Scrapes all Danbooru tag aliases from the API and saves them to a CSV file for dataset tag normalization.

🛠️ Tool Usage & Requirements

⚙️ Setup

First, add the /src/ directory of this project to your PATH so you can run scripts from anywhere:

# Add to your shell configuration file (.bashrc, .zshrc, etc.)
export PATH="/path/to/image-dataset-prep-tools/src:$PATH"

After setup, navigate to any directory containing your dataset files and run the scripts directly.

💡 Dependencies

All the zsh scripts require zsh shell.
resize_images.zsh requires ImageMagick.
validate_dataset.zsh requires ImageMagick and (Optional) czkawka_cli.
fetch_tags.py requires Python 3.12 and requests package, recommended to use uv run script.
scrape_danbooru_aliases.zsh requires curl, jq, and bc.

1️⃣ process_txt_files.zsh

Requirements:

zsh shell

Function:

Batch processes all .txt tag files in the current working directory, cleans content based on trigger and optional class words, removes noise tags, and prepends appropriate prefix to each line.
Supports both single trigger word and trigger + class word formats for enhanced dataset labeling control.
Provides tag preservation functionality to protect specific tags from alias conversion and removal.
Automatically applies Danbooru tag aliases from data/danbooru_tag_aliases.csv to standardize tag names.
Removes duplicate tags from each file after alias processing.

Caution

Original files will be overwritten. Back up files first!

Usage:

# Navigate to your dataset directory first
cd /path/to/your/dataset

# Auto-detect trigger (and class word) from directory name
process_txt_files.zsh

# Or specify trigger word manually (class word will be empty)
process_txt_files.zsh "my_trigger"

# Preserve specific tags from alias conversion
process_txt_files.zsh "my_trigger" -p "iris_(character)"
process_txt_files.zsh "my_trigger" --preserve "iris,hydrangeas"

Directory Name Formats:

The script supports two directory naming formats for auto-detection:

Single trigger word: 1_hydrangea → trigger: "hydrangea", no class word
- Output format: "hydrangea, {processed_content}"
Trigger + class word: 1_hydrangea flower → trigger: "hydrangea", class: "flower"
- Output format: "flower, hydrangea, {processed_content}"
Three or more words: 1_hydrangea flower plant → uses first two words
- trigger: "hydrangea", class: "flower"

Examples:

# Example 1: Single trigger word
# Directory: 3_hydrangea
# Input content: "blue_flower, nature, garden, flower_crown"
# Output: "hydrangea, blue_flower, nature, garden, head_wreath"

# Example 2: Trigger + class word  
# Directory: 3_hydrangea flower
# Input content: "blue_flower, nature, garden, flower_crown"  
# Output: "flower, hydrangea, blue_flower, nature, garden, head_wreath"
# Note: Standalone "flower" removed, compound "blue_flower" preserved

Tag Preservation

Use -p or --preserve to protect specific tags from Danbooru alias conversion:

# Preserve specific tag variations (short form)
process_txt_files.zsh cornflower flower -p iris_(character)

# Preserve specific tag variations (long form)
process_txt_files.zsh cornflower flower --preserve iris_(character)

Processing details:

Replaces all ( with \( and ) with \) (except for preserved tags).
Removes standalone trigger and class keywords while preserving compound words (e.g., keeps blue_flower when class word is flower).
Preserved tags (-p/--preserve) are protected from alias conversion and removal.
Removes commentary/commission-related noise tags.
Cleans up redundant commas and spaces.
Applies Danbooru tag aliases to standardize tag names (except preserved tags).
Removes duplicate tags from each file after alias processing.
Prepends appropriate prefix based on whether class word exists:
- With class word: "class_word, trigger_word, {content}"
- Without class word: "trigger_word, {content}"

2️⃣ resize_images.zsh

Requirements:

zsh shell
ImageMagick (magick command)

Function:

Resizes all .jpg and .png images in the current working directory so the short side is 1024px, keeping aspect ratio.
Images with any side smaller than 1024px are skipped.

Caution

Original files will be overwritten. Back up files first!

Usage:

# Navigate to your dataset directory first
cd /path/to/your/dataset
resize_images.zsh

Processing details:

Automatically detects landscape or portrait orientation and resizes the short side.
Only processes .jpg and .png files.

3️⃣ fetch_tags.py

Requirements:

Python 3.12+
requests
- If you use uv run, all requirements are managed automatically, no manual installation needed.
- If you do not use uv, you must manually install dependencies with pip install requests.

Function:

Scans the current working directory for images named {id}_{md5}.{ext} and fetches tags from Danbooru by MD5. If not found, falls back to Gelbooru.
Tags are written to a .txt file with the same name as the image, comma-separated.

Usage:

# Navigate to your dataset directory first
cd /path/to/your/dataset
uv run fetch_tags.py

No extra parameters needed; just run the script with uv run.
Note: This script requires uv to manage Python dependencies automatically.

Filename pattern:

Only processes files named {id}_{md5}.{ext} (supports jpg, jpeg, png, gif).
The generated tag file will have the same name as the image, with a .txt extension.

Notes:

1-second delay between each image query to avoid being rate-limited.
If neither site returns tags, an error will be shown in the logs.
API rate limits: Fetching tags may encounter rate limiting; do not run the script in parallel.

4️⃣ validate_dataset.zsh

Requirements:

zsh shell
ImageMagick (magick identify command)
czkawka_cli (optional, for similarity detection)

Function:

Validates image dataset completeness and quality by checking image files and corresponding tag files
Automatically extracts trigger word from directory path or accepts it as parameter
Detects duplicate tags within each .txt file using efficient comma-separated parsing
Provides comprehensive validation report with color-coded output

Usage:

# Navigate to your dataset directory first
cd /path/to/your/dataset

# Auto-detect trigger word from path
validate_dataset.zsh

# Or specify trigger word manually
validate_dataset.zsh "your_trigger_word"

Validation checks:

Image files have corresponding .txt files
Image dimensions are at least 500px on both sides
Trigger word is present in tag files
Tag count is between 5-100 per file
No duplicate tags within each file
No orphaned .txt files exist
Similar images detection (High similarity preset) - requires czkawka_cli

Output colors:

Red: Errors that must be fixed
Yellow: Warnings that should be reviewed
Default: Informational messages
Gray: Verbose details

5️⃣ scrape_danbooru_aliases.zsh

Requirements:

zsh shell
curl for HTTP requests
jq for JSON parsing
bc for rate limiting calculations
Optional: DANBOORU_LOGIN and DANBOORU_APIKEY environment variables for authentication

Function:

Scrapes all Danbooru tag aliases from the API and saves them to a CSV file
Supports pagination to fetch complete dataset with a maximum of 1000 pages
Data is sorted by tag count (most popular aliases first) for better relevance
Implements proper rate limiting (10 requests/second max)
Improved CSV data validation: as long as the API returns valid JSON and the CSV conversion is successful, the data is accepted (no longer misclassifies valid data as invalid)
Designed for danbooru.donmai.us, easily configurable for test environments

Usage:

# Navigate to your working directory
cd /path/to/your/workspace

# Optional: Set authentication credentials
export DANBOORU_LOGIN="your_username"
export DANBOORU_APIKEY="your_api_key"

# Run the scraper
scrape_danbooru_aliases.zsh

Output:

Creates data/ directory in current working directory
Generates CSV file: danbooru_tag_aliases.csv
Data sorted by tag count for better relevance (most popular aliases first)
Maximum 1000 pages to prevent excessive API usage
CSV columns: id, antecedent_name, consequent_name, creator_id, forum_topic_id, status, created_at, updated_at, approver_id, forum_post_id, reason

Safety:

Uses only GET requests (no DELETE or modification operations)
Implements strict rate limiting to comply with API limits (10 requests/second)
Authentication via environment variables only
Proper error handling for network issues and API errors

🧪 Testing

This project uses ShellSpec for comprehensive BDD testing of all zsh scripts.

Note

All test cases involving the magick command must mock magick to avoid failures on CI runners without ImageMagick installed.

📋 Test Coverage

🎯 Target: 75% minimum coverage for all zsh scripts
🧪 Total Tests: 100+ examples across all scripts
📊 Framework: ShellSpec with BDD approach
🔄 CI/CD: Automated testing on every commit

🛠️ Quick Start

# Install ShellSpec
curl -fsSL https://git.io/shellspec | sh

# Run all tests
shellspec

# Install kcov for coverage reporting
# https://github.com/SimonKagstrom/kcov/blob/master/INSTALL.md

# Run with coverage
shellspec --kcov

📖 Writing Tests

For detailed guidelines on writing effective BDD tests for zsh scripts, see our comprehensive Testing Guideline.

🤝 Contributing

When adding new features:

Write tests first (TDD approach)
Follow our Testing Guideline
Ensure 75%+ coverage
Verify all existing tests pass

🤖 Automated Data Updates

This repository includes automated weekly updates for the Danbooru tag aliases dataset via GitHub Actions.

Workflow Features

Schedule: Runs every Sunday at 02:00 UTC
Branch Management: Uses ci/update-data branch for changes
Safe Operations: Atomic file updates with temporary file handling
Automated PRs: Creates pull requests for review before merging
Manual Trigger: Can be run manually via GitHub Actions UI

Automated Process

Checks out or creates the ci/update-data branch
Runs scrape_danbooru_aliases.zsh to fetch latest data
Commits changes with meaningful commit messages
Opens a pull request for review if changes are detected
Includes detailed PR description with update information

The automation ensures the dataset stays current while maintaining proper review processes.

📜 License

GNU GENERAL PUBLIC LICENSE Version 3

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
.github		.github
.shellspec-docker		.shellspec-docker
.vscode		.vscode
assets		assets
coverage		coverage
data		data
docs		docs
spec		spec
src		src
.flake8		.flake8
.gitignore		.gitignore
.python-version		.python-version
.shellspec		.shellspec
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🖼️ Image Dataset Preparation Tools

📝 Overview

🛠️ Tool Usage & Requirements

⚙️ Setup

💡 Dependencies

1️⃣ process_txt_files.zsh

Tag Preservation

2️⃣ resize_images.zsh

3️⃣ fetch_tags.py

4️⃣ validate_dataset.zsh

5️⃣ scrape_danbooru_aliases.zsh

🧪 Testing

📋 Test Coverage

🛠️ Quick Start

📖 Writing Tests

🤝 Contributing

🤖 Automated Data Updates

Workflow Features

Automated Process

📜 License

About

Uh oh!

Uh oh!

Contributors 3

Uh oh!

Languages

License

jim60105/image-dataset-prep-tools

Folders and files

Latest commit

History

Repository files navigation

🖼️ Image Dataset Preparation Tools

📝 Overview

🛠️ Tool Usage & Requirements

⚙️ Setup

💡 Dependencies

1️⃣ process_txt_files.zsh

Tag Preservation

2️⃣ resize_images.zsh

3️⃣ fetch_tags.py

4️⃣ validate_dataset.zsh

5️⃣ scrape_danbooru_aliases.zsh

🧪 Testing

📋 Test Coverage

🛠️ Quick Start

📖 Writing Tests

🤝 Contributing

🤖 Automated Data Updates

Workflow Features

Automated Process

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 3

Uh oh!

Languages