🤗 AutoQuantNX

title	app_file	sdk	sdk_version
AutoQuantNX	app.py	gradio	4.44.1

🤗 AutoQuantNX

Overview

AutoQuantNX is a powerful Gradio-based web application designed to simplify the process of optimizing and deploying Hugging Face models. It supports a wide range of tasks, including quantization, ONNX conversion, and seamless integration with the Hugging Face Hub. With AutoQuantNX, you can easily convert models to ONNX format, apply quantization techniques, and push the optimized models to your Hugging Face account—all through an intuitive user interface.

`In the deployed UI, only 16 Bit quantization works because of GPU requirement of BitsAndBytes and no GPU availability in free HF space.`

Features

Supported Tasks

AutoQuantNX supports the following tasks:

Text Classification
Named Entity Recognition (NER)
Question Answering
Causal Language Modeling
Masked Language Modeling
Sequence-to-Sequence Language Modeling
Multiple Choice
Whisper (Speech-to-Text)
Embedding Fine-Tuning
Image Classification (Placeholder for future implementation)

Quantization Options

None (default)
4-bit
8-bit
16-bit-float

ONNX Conversion

Converts models to ONNX format for optimized deployment.

Supports optional ONNX quantization:

8-bit
16-bit-int
16-bit-float

Hugging Face Hub Integration

Automatically pushes optimized models to your Hugging Face Hub repository
Tags models with metadata for easy identification (e.g., onnx, quantized, task type)

Performance Testing

Compares original and quantized models using metrics like:

Mean Squared Error (MSE)
Spearman Correlation
Cosine Similarity
Inference Time
Model Size

File Structure

AutoQuantNX/
├── src/
│   ├── handlers/
│   │   ├── audio_models/
│   │   │   └── whisper_handler.py
│   │   ├── img_models/
│   │   │   └── image_classification_handler.py
│   │   ├── nlp_models/
│   │   │   ├── causal_lm_handler.py
│   │   │   ├── embedding_model_handler.py
│   │   │   ├── masked_lm_handler.py
│   │   │   ├── multiple_choice_handler.py
│   │   │   ├── question_answering_handler.py
│   │   │   ├── seq2seq_lm_handler.py
│   │   │   ├── sequence_classification_handler.py
│   │   │   └── token_classification_handler.py
│   │   ├── __init__.py
│   │   └── base_handler.py
│   ├── optimizations/
│   │   ├── onnx_conversion.py
│   │   └── quantize.py
│   └── utilities/
│       ├── push_to_hub.py
│       └── resources.py
├── README.md
├── app.py
├── poetry.lock
├── pyproject.toml
└── requirements.txt

Prerequisites

Using requirements.txt (Not preferable to me atleast)

Python 3.8 or higher
Install dependencies:
```
pip install -r requirements.txt
```

Using Poetry

Install Poetry (if not already installed):

Linux:
```
curl -sSL https://install.python-poetry.org | python3 -
```
Other platforms: Follow the official instructions.
Install dependencies:
```
poetry install
```
Activate the virtual environment:
```
poetry shell
```

Usage

Launch the App

Run the following command to start the Gradio web application:

python src/app.py

The app will be accessible at http://localhost:7860 by default.

Steps to Use the App

Enter Model Details:
- Provide the Hugging Face model name
- Select the task type (e.g., text classification, question answering)
Select Optimization Options:
- Choose quantization type (e.g., 4-bit, 8-bit)
- Enable ONNX conversion and select quantization options if needed
Provide Hugging Face Token:
- Enter your Hugging Face token for accessing and pushing models to the Hub
Start Conversion:
- Click the "Start Conversion" button to process the model
Monitor Progress:
- View real-time status updates, resource usage, and results directly in the app
Push to Hub:
- Optimized models are automatically pushed to your specified Hugging Face repository

Example

For a model like bert-base-uncased performing text classification:

Select text_classification as the task
Enable quantization (e.g., 8-bit)
Enable ONNX conversion with optimization
Click "Start Conversion" and monitor progress

Key Functions

app.py

process_model: Main function handling model quantization, ONNX conversion, and Hugging Face Hub integration
update_memory_info: Monitors and displays system resource usage

optimization/onnx_conversion.py

convert_to_onnx: Converts models to ONNX format
quantize_onnx_model: Quantizes ONNX models for optimized inference

optimization/quantize.py

ModelQuantizer: Handles quantization of PyTorch models and performance testing

utilities/push_to_hub.py

push_to_hub: Pushes models to the Hugging Face Hub

utilities/resources.py

ResourceManager: Manages temporary files and memory usage

Notes

Ensure you have sufficient system resources for model conversion and quantization
Use a Hugging Face Hub token with proper write permissions for pushing models

Troubleshooting

Model Conversion Fails: Ensure the model and task are supported
Insufficient Resources: Free up memory or reduce optimization levels
ONNX Quantization Errors: Verify that the selected quantization type is supported for the model

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contributions

Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.

Acknowledgments

Hugging Face Transformers
Optimum Library
Gradio
ONNX Runtime

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🤗 AutoQuantNX

Overview

`In the deployed UI, only 16 Bit quantization works because of GPU requirement of BitsAndBytes and no GPU availability in free HF space.`

Features

Supported Tasks

Quantization Options

ONNX Conversion

Hugging Face Hub Integration

Performance Testing

File Structure

Prerequisites

Using requirements.txt (Not preferable to me atleast)

Using Poetry

Usage

Launch the App

Steps to Use the App

Example

Key Functions

app.py

optimization/onnx_conversion.py

optimization/quantize.py

utilities/push_to_hub.py

utilities/resources.py

Notes

Troubleshooting

License

Contributions

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
src		src
.gitattributes		.gitattributes
README.md		README.md
app.py		app.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

smokeyScraper/AutoQuantNX

Folders and files

Latest commit

History

Repository files navigation

🤗 AutoQuantNX

Overview

In the deployed UI, only 16 Bit quantization works because of GPU requirement of BitsAndBytes and no GPU availability in free HF space.

Features

Supported Tasks

Quantization Options

ONNX Conversion

Hugging Face Hub Integration

Performance Testing

File Structure

Prerequisites

Using requirements.txt (Not preferable to me atleast)

Using Poetry

Usage

Launch the App

Steps to Use the App

Example

Key Functions

app.py

optimization/onnx_conversion.py

optimization/quantize.py

utilities/push_to_hub.py

utilities/resources.py

Notes

Troubleshooting

License

Contributions

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`In the deployed UI, only 16 Bit quantization works because of GPU requirement of BitsAndBytes and no GPU availability in free HF space.`

Packages