Release v0.6.0 · EricLBuehler/mistral.rs

Dockerfiles (CUDA, CPU): https://github.com/EricLBuehler/mistral.rs/pkgs/container/mistral.rs
PyPi packages (no features, cuda, mkl, metal, accelerate)

🔥 Highlights from v0.6.0

🚀 Major Features

Llama 4 support and Qwen 3 / MoE / VL models, including DeepSeek and DeepCoder integrations
Multimodal prefix caching, paged attention scheduler improvements, and faster Metal/CUDA backends
Web chat app with chat history, file uploads, speech generation, and revamped tool-calling/search
Fast sampler and CPU FlashAttention with improved performance and accuracy
Metal and CUDA: major improvements in quantization (AFQ, ISQ), UQFF handling, and memory optimizations
MCP (Model Context Protocol): new server endpoints, docs, and integrated client
Vision and audio expansion: support for SIGLIP, Dia 1.6b TTS, conformer backbone (Phi-4MM), auto loaders, and vision tool prefixes

🧠 Inference Optimizations

Lightning-fast AFQ on CPU, optimized Qwen 3 MoE on Metal, and paged attention fixes
Unified FlashAttention backend and automatic method selection for ISQ
Metal precompilation support and reduced autorelease thrashing

🧰 Dev Improvements

Refactored engine architecture, KV cache, attention backends, and device mapping logic
Centralized dependency management and cleaner internal abstractions
Streamlined and faster LoRA support

🎉 Other

Revamped README, AGENTS.md, and new benchmarking scripts
Interactive mode now shows throughput, supports Gumbel sampling, and better runtime sampling controls
Expanded quant and GGUF support: AWQ, Qwen3 GGUF, and prequantized MLX compatibility

⸻

What's Changed

Fix handling of Metal fused attn head dims by @EricLBuehler in #1234
Support paged attn for vision model rust api by @EricLBuehler in #1235
[Breaking] Support setting HF cache path by @EricLBuehler in #1237
Support tool calling for DeepSeek models by @EricLBuehler in #1239
Server image processing refactor by @EricLBuehler in #1244
Optimized CUDA RoPE kernels by @EricLBuehler in #1247
Typo fix (add_speial_tokens to add_special_tokens) by @edwko in #1246
Fixes for UQFF + distributed layers by @EricLBuehler in #1250
Automatic agentic search integration (web_search_options) by @EricLBuehler in #1243
Format kernels by @EricLBuehler in #1251
Add quantize guards for UQFF deserialize by @EricLBuehler in #1252
Refactor cuBLASlt-related code by @EricLBuehler in #1253
Update deps, bump pyo3 version by @EricLBuehler in #1259
Faster cuda FP8 performance by @EricLBuehler in #1257
Rust 1.86 clippy by @EricLBuehler in #1260
Refactor engine arch by @EricLBuehler in #1262
Revamped LoRA support - removing the Ordering system by @EricLBuehler in #1263
Fast Metal-specific quantization method: AFQ by @EricLBuehler in #1264
Support prequantized models from MLX by @EricLBuehler in #1265
Automatic ISQ to select fastest & most accurate method by @EricLBuehler in #1266
Improved usage metrics by @EricLBuehler in #1267
Bump tokio from 1.44.1 to 1.44.2 by @dependabot in #1270
Gather MM ops in mistralrs-quant by @EricLBuehler in #1272
Improve performance of deepseek models by @guoqingbao in #1274
Implement Llama 4 by @EricLBuehler in #1268
Fixes for Llama 4 UQFF loading by @EricLBuehler in #1275
Support sharding for UQFF by @EricLBuehler in #1276
Fix bug for group-topk (group_limited_greedy) in deepseek models by @guoqingbao in #1278
Support the DeepCoder model by @EricLBuehler in #1279
Improved PagedAttn scheduling accuracy by @EricLBuehler in #1282
Fixes for scheduling image seqs with pagedattn by @EricLBuehler in #1283
update to llguidance 0.7.16 by @mmoskal in #1284
Update dependencies by @EricLBuehler in #1286
Much faster image inputs processing by @EricLBuehler in #1289
Add more SDPA head dims for much faster SIGLIP by @EricLBuehler in #1290
Show throughput in interactive mode by @EricLBuehler in #1291
Unify bitwise operations by @EricLBuehler in #1288
Multimodal prefix caching support! by @EricLBuehler in #1298
Interactive mode improvements by @EricLBuehler in #1299
Add the Qwen 3 and Qwen 3 MoE models by @EricLBuehler in #1285
Revamped and streaming web search support by @EricLBuehler in #1301
Handle vision messages or different tool call prefixes by @EricLBuehler in #1302
Simplify prefix cacher by @EricLBuehler in #1305
Use rustyline to handle non-ascii in interactive mode by @beeender in #1306
Add more tools for automatic search by @EricLBuehler in #1307
Fix CPU hogging in interactive mode by @beeender in #1309
Add Metal precompilation support by @EricLBuehler in #1311
Reduce thrashing of Metal autorelease by @EricLBuehler in #1313
make AdapterPaths and LoraAdapterPaths public by @Slowki in #1314
Refactor KV cache manager by @EricLBuehler in #1315
Add Audio and Speech model categories by @Slowki in #1317
Remove has_conv2d from vision model API by @EricLBuehler in #1318
Unified/automatic flash attention enabler by @EricLBuehler in #1319
Fix cublaslt 4d mask by @EricLBuehler in #1320
Qwen VL models fixes by @EricLBuehler in #1322
Fixes for all vision models by @EricLBuehler in #1323
Improved+faster LRU prefix cacher and sampler! by @EricLBuehler in #1321
Inplace ISQ support and default to mmap by @EricLBuehler in #1277
Fix typos by @omahs in #1329
Fix Idefics 3 arch chat templating by @EricLBuehler in #1330
Remove two spaces from PR comment by @szepeviktor in #1331
Add automatic vision loader type by @EricLBuehler in #1332
Add the Dia 1.6b TTS model! by @EricLBuehler in #1304
update llguidance to 0.7.20 by @Slowki in #1334
Add model category <> messages check by @EricLBuehler in #1335
Improve normalization integration test by @EricLBuehler in #1340
Fix streaming example print statement by @EricLBuehler in #1339
Fix normalization formula in comment by @EricLBuehler in #1338
Fix image_to_pixels for non-RGB images by @EricLBuehler in #1337
Fix typo in expect messages by @EricLBuehler in #1342
Don't use mmap on cuda by @EricLBuehler in #1336
Support AWQ format models by @guoqingbao in #1350
Fix uqff dummy layer ISQ application by @EricLBuehler in #1351
Disable immediate isq if write_uqff by @EricLBuehler in #1352
Fixes for cuda UQFF by @EricLBuehler in #1354
Refactor Option references for model paths by @EricLBuehler in #1347
Add a script for server benchmarking by @EricLBuehler in #1355
Optimized Metal qmv_fast path by @EricLBuehler in #1356
Fast sampler by @EricLBuehler in #1327
Fix metal parallel sampling by @EricLBuehler in #1357
Add immediate isq predicates for qwen3 by @EricLBuehler in #1358
Regressions fixes by @EricLBuehler in #1359
Revamped and smaller readme by @EricLBuehler in #1360
Add a web chat app by @EricLBuehler in #1362
Add chat history support to web chat app by @EricLBuehler in #1363
Refactor web chat, fix multichat image restore by @EricLBuehler in #1364
Fix repeated immediate isq init by @EricLBuehler in #1365
Fix missing vision weights in Mistral3 UQFF by @EricLBuehler in #1366
Rolling shard creation for uqff files by @EricLBuehler in #1367
Fix unstability during isq of afq by @EricLBuehler in #1368
Support web chat file uploading by @EricLBuehler in #1370
Add speech generation support to the web chat! by @EricLBuehler in #1373
Prefix caching for PagedAttention by @EricLBuehler in #1369
Metal PagedAttention accuracy improvements by @EricLBuehler in #1374
Handle images in paged attn scheduler by @EricLBuehler in #1375
Include schemas needed for chatcompletions endpoint by @matthewhaynesonline in #1353
Fix case where prefix cacher returns no toks by @EricLBuehler in #1377
Faster UQFF serialization by @EricLBuehler in #1379
Experimental AFQ on CPU support by @EricLBuehler in #1380
Add CPU flash attention by @EricLBuehler in #1382
Refactor attention backends by @EricLBuehler in #1384
Set MacOS thread affinity for cpu attn by @EricLBuehler in #1385
Faster Qwen 3 MoE support on Metal by @EricLBuehler in #1387
Fix PagedAttention block leaks by @EricLBuehler in #1388
Fix cuda build again by @EricLBuehler in #1389
Bump version to 0.6.0 by @EricLBuehler in #1390
Fewer .contiguous calls for qwen3 moe by @EricLBuehler in #1391
Allow speech models to accept batched inputs by @EricLBuehler in #1393
Ring distributed backend for Metal by @EricLBuehler in #1238
Add auto loader for vision/text detection by @EricLBuehler in #1402
Proposal: Create Mistral.rs Server Core Lib by @matthewhaynesonline in #1346
Support linear rope for llama3 by @EricLBuehler in #1408
Fix vllama4 uqff loading by @EricLBuehler in #1409
Handle receiver disconnects by @EricLBuehler in #1410
Fix Qwen3 MoE device mapping irregularities by @EricLBuehler in #1411
Fix interactive mode URL parsing by @EricLBuehler in #1412
Refactor auto device map by @EricLBuehler in #1413
Enable runtime sampling tweaks in interactive mode by @EricLBuehler in #1414
Gumbel sampling for fast sampler by @EricLBuehler in #1416
Improved CPU flash attention accuracy & performance by @EricLBuehler in #1417
Provide chat_templates to container users by @sempervictus in #1419
Faster cpu flash attn by @EricLBuehler in #1418
Web search improvements (bm25, web chat) by @EricLBuehler in #1420
Propely handle consecutive searches by @EricLBuehler in #1421
Update docs by @matthewhaynesonline in #1422
Better tool call detection logic by @EricLBuehler in #1424
Add web search hook callbacks by @EricLBuehler in #1426
Fix CUDA context switching, bind thread on CudaStorage drop by @EricLBuehler in #1428
Conditionally build seqlens tensors by @EricLBuehler in #1429
Add AGENTS.md by @EricLBuehler in #1430
Support QWen3 GGUF model by @guoqingbao in #1432
Improved paged attn prefix caching by @EricLBuehler in #1434
Temporary fix for qwen3 gguf tokenizer by @guoqingbao in #1433
Add tool callback support by @EricLBuehler in #1427
centralize crate dependencies by @EricLBuehler in #1438
Fix bug in tokenizer created with gguf metadata by @guoqingbao in #1440
Update deps by @EricLBuehler in #1441
Doc fixes by @EricLBuehler in #1442
Downgrade rustyline 16.0.0 -> 15.0.0 by @EricLBuehler in #1444
Support max_completion_tokens alias by @EricLBuehler in #1451
Add the conformer backbone (phi4mm audio) by @EricLBuehler in #1448
Fix offline cache issue for gguf models by @guoqingbao in #1452
Add MCP server endpoints by @EricLBuehler in #1453
MCP documentation pass by @EricLBuehler in #1455
Integrate an MCP client by @EricLBuehler in #1456

New Contributors

@edwko made their first contribution in #1246
@beeender made their first contribution in #1306
@Slowki made their first contribution in #1314
@omahs made their first contribution in #1329
@szepeviktor made their first contribution in #1331
@matthewhaynesonline made their first contribution in #1353
@sempervictus made their first contribution in #1419

Full Changelog: v0.5.0...v0.6.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.6.0

🔥 Highlights from v0.6.0

What's Changed

New Contributors

Contributors

Uh oh!