MNIST Dataset Loader

An uniform interface to the MNIST handwritten digits(default) and MNIST fashion datasets, independent of any machine learning framework or external libraries except numpy. This implementation enables downloading, extracting, and loading the dataset effortlessly.

Features

Pure Python + NumPy: No dependencies on deep learning frameworks.
Automatic Download & Extraction: Fetches and prepares the dataset automatically.
Supports Raw MNIST Format: Loads images and labels directly from binary files.
ARFF Format Support: Provides an option to load data from an ARFF file.
Custom Storage Location: Allows specifying a custom directory for storing dataset files.

MNIST Dataset Structure

The MNIST dataset consists of four binary files:

File	Description	Count
train-images-idx3-ubyte.gz	Training images	60,000
train-labels-idx1-ubyte.gz	Training labels	60,000
t10k-images-idx3-ubyte.gz	Test images	10,000
t10k-labels-idx1-ubyte.gz	Test labels	10,000

Note: The original MNIST site does not provide detailed information about the dataset files.

File Format Breakdown

Image File Format (`*-images-idx3-ubyte`)

Offset (Bytes)	Content	Description
0 - 3	Magic number	2051 (0x803 in hex)
4 - 7	Number of images	Total images in the dataset
8 - 11	Rows	Should be 28
12 - 15	Columns	Should be 28
16 - ***	Pixel data	Each pixel is an unsigned value (0-255)

Label File Format (`*-labels-idx1-ubyte`)

Offset (Bytes)	Content	Description
0 - 3	Magic number	2049 (0x801 in hex)
4 - 7	Number of labels	Total labels in the dataset
8 - ***	Label Data	Each label is a single byte (0-9)

Installation

Install the package via pip:

pip install mnist_datasets

Usage

Load MNIST Dataset

from mnist_datasets import MNISTLoader
loader = MNISTLoader()
images, labels = loader.load()
assert len(images) == 60000 and len(labels) == 60000

# Load test dataset
test_images, test_labels = loader.load(train=False)
assert len(test_images) == 10000 and len(test_labels) == 10000

Specify a Custom Folder

loader = MNISTLoader(folder='/tmp')

Load Data from an ARFF File

images_from_arff, labels_from_arff = MNISTLoader.from_arff()

Note: Default ARFF file source (for handwritten digits) is https://www.openml.org/data/download/52667/mnist_784.arff. This method is provided for educational purposes and extremley slow.

Verify Consistency Between ARFF and MNIST Binary Format

import numpy as np
images_from_arff, labels_from_arff = MNISTLoader.from_arff(train=False)
images, labels = MNISTLoader().load(train=False)
np.alltrue(images_from_arff == images), np.alltrue(labels_from_arff == labels)

Load Images and Labels from Local Storage

images = MNISTLoader.load_images('/tmp/t10k-images-idx3-ubyte')
labels = MNISTLoader.load_labels('/tmp/t10k-labels-idx1-ubyte')
assert len(images) == 10000 and len(labels) == 10000

Note: All of the above examples would work for fashion MNIST with just following tweak:

loader = MNISTLoader('fashion')

Addtional steps that may be required/helpful

Install virtual environment support (Ubuntu/Debian)

You can skip this if `python3 -m venv` works

sudo apt update && sudo apt install -y python3-venv

# 1. Create a virtual environment in `.venv` folder
python3 -m venv .venv

# 2. Activate the virtual environment
source .venv/bin/activate

# 3. Upgrade pip (recommended)
pip install --upgrade pip

# 4. Install required pytorch
pip install torch==2.7.0  --index-url https://download.pytorch.org/whl/cpu

Why use this?

This project is designed for those who want an intuitive and dependency-free way to load the MNIST dataset while understanding its raw format in depth.

Contributions & Issues:

Found a bug? Want to contribute? Feel free to open an issue or submit a PR!

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
mnist_datasets		mnist_datasets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build_upload_install.sh		build_upload_install.sh
bump_version.sh		bump_version.sh
check_mod.py		check_mod.py
mnist.ipynb		mnist.ipynb
setup.py		setup.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MNIST Dataset Loader

Features

MNIST Dataset Structure

File Format Breakdown

Image File Format (`*-images-idx3-ubyte`)

Label File Format (`*-labels-idx1-ubyte`)

Installation

Usage

Load MNIST Dataset

Specify a Custom Folder

Load Data from an ARFF File

Verify Consistency Between ARFF and MNIST Binary Format

Load Images and Labels from Local Storage

Addtional steps that may be required/helpful

Install virtual environment support (Ubuntu/Debian)

You can skip this if `python3 -m venv` works

Why use this?

Contributions & Issues:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

ratulb/mnist_datasets

Folders and files

Latest commit

History

Repository files navigation

MNIST Dataset Loader

Features

MNIST Dataset Structure

File Format Breakdown

Image File Format (*-images-idx3-ubyte)

Label File Format (*-labels-idx1-ubyte)

Installation

Usage

Load MNIST Dataset

Specify a Custom Folder

Load Data from an ARFF File

Verify Consistency Between ARFF and MNIST Binary Format

Load Images and Labels from Local Storage

Addtional steps that may be required/helpful

Install virtual environment support (Ubuntu/Debian)

You can skip this if python3 -m venv works

Why use this?

Contributions & Issues:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Image File Format (`*-images-idx3-ubyte`)

Label File Format (`*-labels-idx1-ubyte`)

You can skip this if `python3 -m venv` works

Packages