Raw CSV → Training Dataset Converter (Tkinter)

A simple desktop app to convert a raw CSV into user-selected training dataset formats with optional train/val/test splits and stratification.

Features

Browse or paste path to input CSV.
Choose output format from many exporters (see below).
Select output folder and base filename.
Optional column filtering (keep only specified columns).
Optional train/val/test split with optional stratification column.
Neural processing mode (optional) with multi-engine support for auto-labeling:
- Tasks: Detection or Classification
- Engines: Ultralytics YOLO (Det/Cls), TorchVision (Classification)
- Model path/name or preset (e.g., yolo11n.pt, yolov8n.pt, resnet18)
- Confidence threshold (YOLO)
- Overwrite existing labels toggle
- Download Models button to fetch common weights offline with a progress bar
Logging panel for progress and validation messages.
Preview first 50 rows of the CSV.
Settings auto-save to reload your last used paths/options.
About menu with version.

Requirements

Python 3.9+
Core packages (installed via requirements.txt):
- pandas
- pyarrow (for Parquet)
- Pillow (image IO for some exporters)

Optional extras (install only if you need the feature):

openpyxl — Excel (XLSX) export
ultralytics — Neural mode (YOLO detection/classification)
torch, torchvision — TorchVision Classification engine (CPU wheels recommended)
tensorflow — TFRecord export

Install dependencies:

pip install -r requirements.txt

# Optional (CPU-only wheels for TorchVision via PyTorch index):
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

Run the app

python app.py

Fresh machine quickstart (Windows)

You can either run the portable EXE or run from source.

Run the EXE (no install):
- Build or download dist/DatasetConverter/DatasetConverter.exe
- Double-click to launch. If SmartScreen warns: click “More info” → “Run anyway”.

Run from source:

# 1) Install Python 3.10 or 3.11 (64-bit)

# 2) Create a virtual environment
py -m venv .venv
.\.venv\Scripts\activate
python -m pip install --upgrade pip

# 3) Install all dependencies
pip install -r requirements.txt
# Optional: on CPU-only machines, smaller wheels via PyTorch CPU index
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

# 4) Run the app
python app.py

Notes:

Model weights (Ultralytics/TorchVision) download on first use. If offline, use the app’s Download Models button later when online, or place the .pt files next to the EXE or in your working directory.
TensorFlow is only required for TFRecord export; if it fails to install on your Python/Windows version, remove it from requirements.txt and other features will still work.
Some systems may prompt for Visual C++ runtime; follow the prompt once if needed.

Supported output formats

Tabular: CSV, JSONL, Parquet, Feather, Excel (XLSX), SQLite
Detection: COCO (Detection), YOLO TXT (Detection), Pascal VOC (XML), YOLO Dataset (images+labels)
Classification: ImageFolder (class-per-subdir)
Segmentation: COCO (Segmentation), YOLO TXT (Segmentation)
ML/Hub friendly: Hugging Face Dataset (JSONL), WebDataset (tar shards)
Other: TFRecord (requires TensorFlow), Audio Manifest (JSONL), TimeSeries Windows (Parquet)

Notes

Parquet requires pyarrow.
Excel export requires openpyxl.
TFRecord export requires tensorflow.
Neural mode requires ultralytics and/or torch+torchvision. See below.
Stratified split requires enough samples per class; otherwise the app falls back to a random split.
Column list should be comma-separated without quotes, e.g.: feature1, feature2, label.
Your settings are stored at %APPDATA%/DatasetConverter/settings.json on Windows.

Output naming

Without split: <output_folder>/<base>.{csv|jsonl|parquet}
With split: <output_folder>/<base>_train.*, <base>_val.*, <base>_test.*

Example

Click "Browse..." to pick data/raw.csv (or "Paste Path").
Choose output folder, e.g., data/processed/.
Select format: JSONL (or any from the dropdown).
Set base filename: dataset.
(Optional) Keep columns: text,label.
Enable split: Train=0.8, Val=0.1, Test=0.1, Stratify: label.
Click Convert.

Neural example (optional)

Set Processing Mode to Neural.
Choose Task Detection or Classification.
Select Engine: Ultralytics YOLO (Det/Cls) or TorchVision (Cls).
Pick a Preset or type a model/arch (e.g., yolo11n.pt, yolov8n.pt, resnet18).
Adjust Confidence and Overwrite as needed (YOLO only).
Click Convert — predictions are applied before any split/export.
During inference, a progress dialog shows per-image progress.

Troubleshooting

If CSV reading fails, ensure the file is not open/locked and is a valid CSV.
If saving Parquet fails, install pyarrow.
Large files: consider running from a 64-bit Python and enough RAM.

Build a standalone Windows EXE

You can package this app as a single-folder Windows executable using PyInstaller.

Quick build (PowerShell)

./build.ps1
# or to clean previous builds
./build.ps1 -Clean

The EXE will be at dist/DatasetConverter/DatasetConverter.exe.

Manual build

# Create venv if needed
py -m venv .venv
.\.venv\Scripts\python -m pip install --upgrade pip
.\.venv\Scripts\python -m pip install -r requirements.txt
.\.venv\Scripts\python -m pip install pyinstaller

# Build
.\.venv\Scripts\python -m PyInstaller \
  --name "DatasetConverter" \
  --noconfirm \
  --windowed \
  --clean \
  app.py

If Windows SmartScreen warns when running the EXE, click "More info" → "Run anyway" (you may sign the binary if distributing).

Distribute the EXE

Zip the entire folder dist/DatasetConverter/ into DatasetConverter-win.zip.
Upload the zip to a GitHub Release.
In release notes, mention first-run notes (SmartScreen, model downloads/offline button).

Optional: publish a SHA256 checksum:

certutil -hashfile DatasetConverter-win.zip SHA256

Neural processing mode

Neural mode auto-annotates your data using Ultralytics YOLO or TorchVision (classification-only).

Install

# In your virtual environment (pick what you need)
# Ultralytics (YOLO engines)
pip install ultralytics

# TorchVision Classification (CPU wheels example)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

Usage

Switch Processing Mode to Neural.
Choose Task (Detection or Classification).
Select Engine: Ultralytics YOLO (Det/Cls) or TorchVision (Cls).
Pick a Preset or type a model/arch (e.g., yolo11n.pt, yolov8n.pt, resnet18).
Adjust Confidence and Overwrite as needed (YOLO only).
Click Convert — predictions are applied before any split/export.
During inference, a progress dialog shows per-image progress.

Offline models

Use the Download Models button to prefetch: yolo11n.pt, yolov8n.pt, yolov8n-cls.pt, resnet18, resnet50, mobilenet_v3_small, efficientnet_b0.
The downloader runs PowerShell and shows logs with a green progress bar and percentage.
If you attempt inference without network and weights are missing, the app will prompt you to download common models.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
README.md		README.md
app.py		app.py
build.ps1		build.ps1
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Raw CSV → Training Dataset Converter (Tkinter)

Features

Requirements

Run the app

Fresh machine quickstart (Windows)

Supported output formats

Notes

Output naming

Example

Neural example (optional)

Troubleshooting

Build a standalone Windows EXE

Quick build (PowerShell)

Manual build

Distribute the EXE

Neural processing mode

Install

Usage

Offline models

About

Uh oh!

Releases 1

Packages

Languages

License

sahir247/ML-Dataset-Converter

Folders and files

Latest commit

History

Repository files navigation

Raw CSV → Training Dataset Converter (Tkinter)

Features

Requirements

Run the app

Fresh machine quickstart (Windows)

Supported output formats

Notes

Output naming

Example

Neural example (optional)

Troubleshooting

Build a standalone Windows EXE

Quick build (PowerShell)

Manual build

Distribute the EXE

Neural processing mode

Install

Usage

Offline models

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages