
autopack makes your Hugging Face models easy to run, share, and ship. It quantizes once and exports to multiple runtimes, with sensible defaults and an automatic flow that produces a readable summary. It supports HF, ONNX, and GGUF (llama.cpp) formats and can publish to the Hugging Face Hub in one shot.
About · Requirements · Setup · Building Instructions · Running · Detailed Usage · Q&A
autopack is a CLI that helps you quantize and package Hugging Face models into multiple useful formats in a single pass, with an option to publish artifacts to the Hub.
You have a 120B LLM and want to optimize it so that people (not corporations with clusters of B200s) can use it on their 8GB 2060? All you need to do is run:
autopack sentence-transformers/all-MiniLM-L6-v2
- Fast: generate multiple variants in one command.
- Practical: built on Transformers, bitsandbytes, ONNX, and llama.cpp.
- Portable: CPU- and GPU-friendly artifacts, good defaults.
- Python 3.9+
- PyTorch, Transformers, Hugging Face Hub
- Optional: bitsandbytes (4/8-bit), optimum[onnxruntime] (ONNX), llama.cpp (GGUF tools)
- GGUF export requires a built llama.cpp and
llama-quantize
in PATH. - Set
HUGGINGFACE_HUB_TOKEN
to publish, or pass--token
.
pip install autopack-grn
# ONNX export support
pip install 'autopack-grn[onnx]'
# GGUF export helpers (converter deps)
pip install 'autopack-grn[gguf]'
# llama.cpp runtime bindings (llama-cpp-python)
pip install 'autopack-grn[llama]'
# Everything for llama.cpp functionality (GGUF export + runtime)
pip install 'autopack-grn[gguf,llama]'
Note: for GGUF and llama.cpp functionality you also need the llama.cpp tools
(llama-quantize
, llama-cli
) available on your PATH
. You can build the
vendored copy and export PATH
as shown in
Vendored llama.cpp quick build.
pip install -e .
# Optional extras while developing
pip install -e '.[onnx]'
pip install -e '.[gguf]'
pip install -e '.[llama]'
pip install -e '.[gguf,llama]'
python -m build
autopack meta-llama/Llama-3-8B --output-format hf
Add ONNX and GGUF:
autopack meta-llama/Llama-3-8B --output-format hf onnx gguf --summary-json --skip-existing
GGUF only (with default presets Q4_K_M, Q5_K_M, Q8_0):
autopack meta-llama/Llama-3-8B --output-format gguf --skip-existing
Publish to Hub:
autopack publish out/llama3-4bit your-username/llama3-4bit --private \
--commit-message "Add 4-bit quantized weights"
Inspect a model id or local folder and print metadata (config, sizes, quantization hints) with suggestions for next steps. Now includes human-readable sizes, file summary (config/tokenizer presence), weight file counts, and top-5 largest files.
autopack scan <model_id_or_path> \
[--revision <rev>] [--trust-remote-code] [--local-files-only] \
[--resolve-cache] [--json] [--show-files] [--limit-files 50]
Examples:
# Remote model (lightweight, no weight download). Prints human-readable summary
autopack scan meta-llama/Llama-3-8B
# JSON output suitable for scripting
autopack scan meta-llama/Llama-3-8B --json
# Resolve a local snapshot to list files and sizes
autopack scan meta-llama/Llama-3-8B --resolve-cache --show-files --limit-files 20
# Scan a local folder
autopack scan ./tiny-gpt2
Run common HF quantization variants and optional ONNX/GGUF exports in one go, with a summary table and generated README in the output folder.
autopack [auto] <model_id_or_path> [-o <out_dir>] \
--output-format hf [onnx] [gguf] \
[--eval-dataset <dataset>[::<config>]] \
[--revision <rev>] [--trust-remote-code] [--device auto|cpu|cuda] \
[--no-bench] [--bench-prompt "..."] [--bench-max-new-tokens 16] \
[--bench-warmup 0] [--bench-runs 1] \
[--hf-variant bnb-4bit|bnb-8bit|int8-dynamic|bf16] \
[--hf-variants bnb-4bit bnb-8bit int8-dynamic bf16]
Key points:
- Default HF variants: bnb-4bit, bnb-8bit, int8-dynamic, bf16
- Add ONNX and/or GGUF via
--output-format
- If
-o/--output-dir
is omitted, the output folder defaults to the last path segment of the model id/path (e.g.,user/model
->model
). - Benchmarking is enabled by default in
auto
; use--no-bench
to disable. - If
--eval-dataset
is provided, perplexity is computed for each HF variant - If benchmarking is enabled, autopack measures actual Tokens/s per backend and replaces heuristic speedups with real Tokens/s and speedup vs bf16 in the summary and the generated README.
- For very large models, use
--hf-variant bf16
(single) or--hf-variants bf16 int8-dynamic
(subset) to reduce loads.
Produce specific formats with a chosen quantization strategy.
autopack quantize <model_id_or_path> [-o <out_dir>] \
--output-format hf [onnx] [gguf] \
[--quantization bnb-4bit|bnb-8bit|int8-dynamic|none] \
[--dtype auto|float16|bfloat16|float32] \
[--device-map auto|cpu] [--prune <0..0.95>] \
[--revision <rev>] [--trust-remote-code]
Upload an exported model folder to the Hugging Face Hub.
autopack publish <folder> <user_or_org/repo> \
[--private] [--token $HUGGINGFACE_HUB_TOKEN] \
[--branch <rev>] [--commit-message "..."] [--no-create]
Run standalone benchmarks on existing models/artifacts.
autopack bench <target> \
--backend hf [onnx] [gguf] \
[--prompt "Hello"] [--max-new-tokens 64] \
[--device auto] [--num-warmup 1] [--num-runs 3] \
[--trust-remote-code] [--llama-cli /path/to/llama-cli]
Notes:
- For HF,
target
can be a Hub id or local folder. For ONNX, pass the exported folder. For GGUF, pass a.gguf
file or a folder containing one. - ONNX benchmarking requires
optimum[onnxruntime]
. GGUF benchmarking requiresllama-cli
.
--trust-remote-code
: enable loading custom modeling code from Hub repos--revision
: branch/tag/commit to load--device-map
: set tocpu
to force CPU; defaults toauto
--dtype
: compute dtype for non-INT8 layers (applies to HF exports)--prune
: global magnitude pruning ratio across Linear layers (0..0.95)
hf
: Transformers checkpoint with tokenizer and configonnx
: ONNX export usingoptimum[onnxruntime]
for CausalLMgguf
: llama.cpp GGUF viaconvert_hf_to_gguf.py
andllama-quantize
- Converter resolution order:
--gguf-converter
if provided$LLAMA_CPP_CONVERT
env var- Vendored script:
third_party/llama.cpp/convert_hf_to_gguf.py
~/llama.cpp/convert_hf_to_gguf.py
or~/src/llama.cpp/convert_hf_to_gguf.py
- Quant presets: uppercase (e.g.,
Q4_K_M
). If omitted, autopack generatesQ4_K_M
,Q5_K_M
,Q8_0
by default. - Isolation: by default, conversion runs in an isolated
.venv
inside the output dir. Disable with--gguf-no-isolation
. - Architecture checks: pass
--gguf-force
to bypass the basic architecture guard. - Ensure
llama-quantize
is inPATH
(typically inthird_party/llama.cpp/build/bin
).
- Requires:
pip install 'optimum[onnxruntime]'
- Uses
ORTModelForCausalLM
; non-CausalLM models may not be supported in this version.
--eval-dataset
acceptsdataset
ordataset:config
(e.g.,wikitext-2-raw-v1
)--eval-text-key
controls which dataset column is used for text (default:text
)- Device selection is automatic (
cuda
if available, elsecpu
) - Only CausalLM architectures are supported for perplexity computation
- Uses a bounded sample count and expects a
text
field in the dataset
Single-variant run (bf16 only):
autopack meta-llama/Llama-3-8B --output-format hf --hf-variant bf16
Subset of variants:
autopack meta-llama/Llama-3-8B --output-format hf --hf-variants bf16 int8-dynamic
CPU-friendly int8 dynamic with pruning:
autopack quantize meta-llama/Llama-3-8B \
--output-format hf --quantization int8-dynamic --prune 0.2 --device-map cpu
BF16 only (no quantization):
autopack quantize meta-llama/Llama-3-8B \
--output-format hf --quantization none --dtype bfloat16
Override GGUF presets:
autopack meta-llama/Llama-3-8B \
--output-format gguf --gguf-quant Q5_K_M Q8_0
Auto with benchmarking (reports Tokens/s and real speedup vs bf16):
autopack sshleifer/tiny-gpt2 --output-format hf
Hello World (Transformers on CPU):
pip install autopack-grn
autopack sshleifer/tiny-gpt2 --output-format hf
python - <<'PY'
from transformers import AutoTokenizer, AutoModelForCausalLM
tok = AutoTokenizer.from_pretrained('tiny-gpt2/bf16')
m = AutoModelForCausalLM.from_pretrained('tiny-gpt2/bf16', device_map='cpu')
ids = tok('Hello world', return_tensors='pt').input_ids
out = m.generate(ids, max_new_tokens=8)
print(tok.decode(out[0]))
PY
Hello World (GGUF with llama.cpp):
autopack sshleifer/tiny-gpt2 --output-format gguf
./third_party/llama.cpp/build/bin/llama-cli -m tiny-gpt2/gguf/model-Q4_K_M.gguf -p "Hello world" -n 16
cd third_party/llama.cpp
cmake -S . -B build -DGGML_NATIVE=ON
cmake --build build -j
llama-quantize
not found: build llama.cpp and ensurebuild/bin
is inPATH
.- BitsAndBytes on Windows: currently not installed by default; prefer CPU/int8-dynamic flows.
- Custom code prompt: pass
--trust-remote-code
to avoid the interactive confirmation.
HUGGINGFACE_HUB_TOKEN
: token to publish to the HubLLAMA_CPP_CONVERT
: path toconvert_hf_to_gguf.py
PATH
: should include the directory withllama-quantize
Generates HF variants (4-bit, 8-bit, int8-dynamic, bf16) and prints a summary; GGUF/ONNX are opt-in.
autopack will create multiple useful presets by default (Q4_K_M, Q5_K_M, Q8_0).
For very large models (tens of GBs), prefer a minimal, resumable flow:
- Single variant: use
--hf-variant bf16
(or another) to avoid multiple loads - Avoid extra runs: add
--no-bench
- Resume-friendly: keep
--skip-existing
- CPU-safe: add
--device cpu
to skip GPU-only paths
Examples:
# BF16 only, no benchmarking, resume if partial outputs exist
autopack user/model-giant --output-format hf \
--hf-variant bf16 --no-bench --skip-existing
# CPU-focused subset
autopack user/model-giant --output-format hf \
--hf-variants bf16 int8-dynamic --device cpu --no-bench --skip-existing
License: Apache-2.0