Multimodal gpt-fast

This is a multimodal version of GPT-Fast that adds support for vision-language models, allowing the framework to process both text and images.

Featuring:

Very low latency
<1000 lines of Python
No dependencies other than PyTorch and Transformers
int8/int4/fp8 quantizations
Speculative decoding
Tensor parallelism
Supports AMD GPUs

This is NOT intended to be a "framework" or "library" - it is intended to show off what kind of performance you can get with native PyTorch :) Please copy-paste and fork as you desire.

For an in-depth walkthrough of what's in this codebase, see this blog post.

Supported Models

Text Models

LLaMA family models (Llama-2, Llama-3, Llama-3.1, Llama-3.2, AMD-Llama) (Example 🤗, 🤗, ...)
Qwen family models (Qwen-2, Qwen-2.5) (Example: 🤗, ...)

Multimodal Models

This version adds support for several vision-language models:

Qwen Vision-Language Models

Qwen/Qwen-2.5-VL-3B-Instruct 🤗
Qwen/Qwen-2.5-VL-7B-Instruct 🤗
Qwen/Qwen-2.5-VL-72B-Instruct 🤗

Llava One-Vision Models

lmms-lab/Llava-One-Vision-Qwen2-0.5B-Si 🤗
lmms-lab/Llava-One-Vision-Qwen2-7B-Si 🤗
lmms-lab/Llava-One-Vision-Qwen2-72B-Si 🤗

Llama-3.2-Vision-Instruct Models

meta-llama/Llama-3.2-11B-Vision-Instruct 🤗
meta-llama/Llama-3.2-90B-Vision-Instruct 🤗

Getting Started

Installation

First, install PyTorch according to the instructions specific to your operating system. For AMD GPUs, we strongly recommend using ROCm Software dockers like rocm/pytorch. You can install the required packages using the command below to avoid reinstalling Torch from scratch.

pip install -r requirements.txt -c constraints.txt

Download and Convert Model Weights

To download and convert the models listed in the supported model above, use the following command to download the HF model checkpoints:

bash scripts/prepare.sh <HF_model/repo_id> <download_dir>

where <HF_model/repo_id> is the model id from the HuggingFace website. This script will download the model weights from HuggingFace and then convert them to the format supported by this GPTFast repo. You will need to have your HuggingFace token added to the environment for the gated models. If you have not done that, you can use this command:

huggingface-cli login

Optional: Quantize Model Weights

To save memory and potentially improve performance, you can quantize models to int8, int4, or fp8:

python quantize.py --checkpoint_path <download_dir>/<HF_model/repo_id>/model.pth --mode int8

You can also directly apply quantization when preparing models by adding the quantization mode as a third parameter:

bash scripts/prepare.sh <HF_model/repo_id> <download_dir> int8

Run inference

Benchmarking

To run vanilla decoding benchmarks, use the evaluate.py script like below:

python evaluate.py --bench_name MMMU --checkpoint_path   <download_dir>/<HF_model/repo_id>/model.pth`

To run speculative decoding, add the draft models' arguments as below:

python evaluate.py --bench_name MMMU --checkpoint_path  <download_dir>/<HF_model_target/repo_id>/model.pth --draft_checkpoint_path  <download_dir>/<HF_model_draft/repo_id>/model.pth --speculate_k <num_of_draft_tokens>`

To compile the model forward passes using torch.compile(), you can use the --compile flag. Since compilation benefits from a fixed length kv-cache size, it is recommended to use a cache size large enough for both the target and the draft models as below by setting the --max_cache_size and --draft_max_cache_size arguments:

python evaluate.py --bench_name MMMU --checkpoint_path  <download_dir>/<HF_model_target/repo_id>/model.pth  --draft_checkpoint_path <download_dir>/<HF_model_draft/repo_id>/model.pth --speculate_k <num_of_draft_tokens> --compile --max_cache_size <target_model_cache_size> --draft_max_cache_size <target_model_cache_size>

For the Llama 3.2 vision models, it is also preferred to set --cross_attention_seq_length as well to fix the kv-cache size of the cross attention layers.
To leverage the draft model’s visual token compression for faster speculative decoding, you can use the --mm_prune_method='random' or --mm_prune_method='structured' along with --mm_prune_ratio=<prune_ratio>.
For speculative decoding on very large models such as Llama 3.2 90B, you can use the drafter in a seperate gpu with --draft_device arguments.
To use the Tensor Parallel distributed strategy for large multimodal models, you can use the following command. Note that models such as Qwen 0.5B/7B and Llava 0.5B/7B may not adopt this approach on 8 GPUs, as their attention sizes are not evenly divisible by 8.

ENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=<num_gpus> evaluate.py --bench_name MMMU --checkpoint_path  <download_dir>/<HF_model_target/repo_id>/model.pth --draft_checkpoint_path  <download_dir>/<HF_model_draft/repo_id>/model.pth --speculate_k <num_of_draft_tokens>`

Interactive Text Generation with Web UI

To run the Gradio app to interact with the model, use the following command. If you have not installed the Gradio library, you can install it using the command below:

pip install gradio

Now you can run the app with the following command:

python app.py --checkpoint_path <download_dir>/<HF_model/repo_id>/model.pth

To use speculative decoding, add the following arguments:

python app.py --checkpoint_path <download_dir>/<HF_model/repo_id>/model.pth --speculate_k <#_of_draft_tokens>

The web UI automatically detects if your model is multimodal and displays an image upload interface if it is. You can:

Upload images
Adjust temperature and other sampling parameters
Toggle speculative decoding on/off
Stream generated text in real-time

License

AMD Multimodal gpt-fast is released under the same license as the original GPTFast, BSD 3 license.

Acknowledgements

This project builds upon the original GPT-Fast by the PyTorch team and extends it with multimodal capabilities.

Name		Name	Last commit message	Last commit date
Latest commit History 263 Commits
data		data
media		media
multimodal		multimodal
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
constraints.txt		constraints.txt
conversation.py		conversation.py
evaluate.py		evaluate.py
model.py		model.py
quantize.py		quantize.py
requirements.txt		requirements.txt
setup.py		setup.py
time_profiler.py		time_profiler.py
tokenizer.py		tokenizer.py
tp.py		tp.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multimodal gpt-fast

Supported Models

Text Models

Multimodal Models

Qwen Vision-Language Models

Llava One-Vision Models

Llama-3.2-Vision-Instruct Models

Getting Started

Installation

Download and Convert Model Weights

Optional: Quantize Model Weights

Run inference

Benchmarking

Interactive Text Generation with Web UI

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

AMD-AIG-AIMA/gpt-fast

Folders and files

Latest commit

History

Repository files navigation

Multimodal gpt-fast

Supported Models

Text Models

Multimodal Models

Qwen Vision-Language Models

Llava One-Vision Models

Llama-3.2-Vision-Instruct Models

Getting Started

Installation

Download and Convert Model Weights

Optional: Quantize Model Weights

Run inference

Benchmarking

Interactive Text Generation with Web UI

License

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages