TinyQ

A lightweight post-training quantization module built on the top of PyTorch's Modules.

Key features

PTQ Focus: quantize of all Linear Layers (nn.Linear)
Quantization Methods: W8A32 (8-bit weights, 32-bit activations), W8A16 (8-bit weights, 16-bit activations), W8A8 (Coming soon!)
Model Support: PyTorch models from Hugging Face Hub
Offline-first approach: no automatic downloads from the cloud
Built-in benchmarking: latency and memory footprint tracking

Project Structure

TinyQ/
├── logs/              # Benchmark and training logs
├── models/            # Local model storage
├── tinyq.py           # Core quantization library
├── utils.py           # Utility functions
├── examples.py        # Usage examples
└── bench.py           # Benchmarking tools (Coming soon)

Quick Start

1. Installation

Note

TinyQ is built with efficiency in mind to be used at the edge (locally) on both CPU and GPU based systems.

The requirements.txt file uses a CUDA-enabled PyTorch. For systems without CUDA, please follow the PyTorch installation guide to get the correct version.

git clone https://github.com/afondiel/TinyQ.git
cd TinyQ

# Create and activate conda environment
conda create -n tinyq python>=3.8
conda activate tinyq

# Install requirements
pip install -r requirements.txt

2. Download a Model

Important

The current version works in offline-mode only. Please, download a pytorch model from HF Hub to start with. You can also use the script below:

# Example: Download OPT-125M
huggingface-cli download --resume-download facebook/opt-125m --local-dir ./models/facebook/opt-125m

See the full Model Setup Guide for detailed instructions.

3. Run Quantization

from tinyq import Quantizer
from utils import load_local_hf_model, get_generation

# Load model
model, tokenizer = load_local_hf_model("./models/facebook/opt-125m")

# Create tinyq quantizer object and additional resources
q = Quantizer()

# Quantize model (W8A32 or W8A16)
qmodel = q.quantize(model, q_method="w8a32")

# Save Quantized Model
q.export(qmodel_path, qmodel)

# Test inference
prompt = "Hello, my name is"
result = get_generation(model=qmodel, 
                        modality="text", 
                        input_data=prompt, 
                        tokenizer=tokenizer)

print(result)

Usage

1. CLI Mode

python examples.py \
    --model_path "./models/facebook/opt-125m" \
    --qm w8a32 \
    --test_inference \
    --qmodel_path "./qmodel"

2. Run Performance Benchmarking

python bench.py \
    --model_path "./models/facebook/opt-125m"

Roadmap

Current Focus

W8A32 implementation
W8A16 implementation
Documentation and examples
Unit tests

Core Features

W8A8 Quantization Support
Model Support Extensions
Additional Layer Support
Performance Optimization

Demo

The examples below shows a Pytorch model printout before and after applying W8A32 Quantization.

Before:

After:

You can also use a tool like NEUTRON get more in-depth insight and compare both models.

Benchmark Demo

(Still to Come)

Contributing

Contributions are welcome! Please see the Contributing Guidelines.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

This project started as a learning exercise from the Quantization Fundamentals course by DeepLearning.AI and Hugging Face, helping me understand the core concepts behind model quantization.

Special thanks to:

Younes Belkada & Marc Sun for their excellent instruction and course content
Andrew Ng and the DeepLearning.AI team for making AI education accessible and practical
kaushikacharya for his detailed course notes that provided valuable guidance

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
demo		demo
docs		docs
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
bench.py		bench.py
examples.py		examples.py
requirements.txt		requirements.txt
tinyq.py		tinyq.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TinyQ

Key features

Project Structure

Quick Start

1. Installation

2. Download a Model

3. Run Quantization

Usage

1. CLI Mode

2. Run Performance Benchmarking

Roadmap

Current Focus

Core Features

Demo

Benchmark Demo

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

diesimo-ai/TinyQ

Folders and files

Latest commit

History

Repository files navigation

TinyQ

Key features

Project Structure

Quick Start

1. Installation

2. Download a Model

3. Run Quantization

Usage

1. CLI Mode

2. Run Performance Benchmarking

Roadmap

Current Focus

Core Features

Demo

Benchmark Demo

Contributing

License

Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages