- Dockerfiles (CUDA, CPU): https://github.com/EricLBuehler/mistral.rs/pkgs/container/mistral.rs
- PyPi packages (no features, cuda, mkl, metal, accelerate)
π₯ Highlights from v0.6.0
π Major Features
- Llama 4 support and Qwen 3 / MoE / VL models, including DeepSeek and DeepCoder integrations
- Multimodal prefix caching, paged attention scheduler improvements, and faster Metal/CUDA backends
- Web chat app with chat history, file uploads, speech generation, and revamped tool-calling/search
- Fast sampler and CPU FlashAttention with improved performance and accuracy
- Metal and CUDA: major improvements in quantization (AFQ, ISQ), UQFF handling, and memory optimizations
- MCP (Model Context Protocol): new server endpoints, docs, and integrated client
- Vision and audio expansion: support for SIGLIP, Dia 1.6b TTS, conformer backbone (Phi-4MM), auto loaders, and vision tool prefixes
π§ Inference Optimizations
- Lightning-fast AFQ on CPU, optimized Qwen 3 MoE on Metal, and paged attention fixes
- Unified FlashAttention backend and automatic method selection for ISQ
- Metal precompilation support and reduced autorelease thrashing
π§° Dev Improvements
- Refactored engine architecture, KV cache, attention backends, and device mapping logic
- Centralized dependency management and cleaner internal abstractions
- Streamlined and faster LoRA support
π Other
- Revamped README, AGENTS.md, and new benchmarking scripts
- Interactive mode now shows throughput, supports Gumbel sampling, and better runtime sampling controls
- Expanded quant and GGUF support: AWQ, Qwen3 GGUF, and prequantized MLX compatibility
βΈ»
What's Changed
- Fix handling of Metal fused attn head dims by @EricLBuehler in #1234
- Support paged attn for vision model rust api by @EricLBuehler in #1235
- [Breaking] Support setting HF cache path by @EricLBuehler in #1237
- Support tool calling for DeepSeek models by @EricLBuehler in #1239
- Server image processing refactor by @EricLBuehler in #1244
- Optimized CUDA RoPE kernels by @EricLBuehler in #1247
- Typo fix (add_speial_tokens to add_special_tokens) by @edwko in #1246
- Fixes for UQFF + distributed layers by @EricLBuehler in #1250
- Automatic agentic search integration (
web_search_options
) by @EricLBuehler in #1243 - Format kernels by @EricLBuehler in #1251
- Add quantize guards for UQFF deserialize by @EricLBuehler in #1252
- Refactor cuBLASlt-related code by @EricLBuehler in #1253
- Update deps, bump pyo3 version by @EricLBuehler in #1259
- Faster cuda FP8 performance by @EricLBuehler in #1257
- Rust 1.86 clippy by @EricLBuehler in #1260
- Refactor engine arch by @EricLBuehler in #1262
- Revamped LoRA support - removing the Ordering system by @EricLBuehler in #1263
- Fast Metal-specific quantization method: AFQ by @EricLBuehler in #1264
- Support prequantized models from MLX by @EricLBuehler in #1265
- Automatic ISQ to select fastest & most accurate method by @EricLBuehler in #1266
- Improved usage metrics by @EricLBuehler in #1267
- Bump tokio from 1.44.1 to 1.44.2 by @dependabot in #1270
- Gather MM ops in mistralrs-quant by @EricLBuehler in #1272
- Improve performance of deepseek models by @guoqingbao in #1274
- Implement Llama 4 by @EricLBuehler in #1268
- Fixes for Llama 4 UQFF loading by @EricLBuehler in #1275
- Support sharding for UQFF by @EricLBuehler in #1276
- Fix bug for group-topk (group_limited_greedy) in deepseek models by @guoqingbao in #1278
- Support the DeepCoder model by @EricLBuehler in #1279
- Improved PagedAttn scheduling accuracy by @EricLBuehler in #1282
- Fixes for scheduling image seqs with pagedattn by @EricLBuehler in #1283
- update to llguidance 0.7.16 by @mmoskal in #1284
- Update dependencies by @EricLBuehler in #1286
- Much faster image inputs processing by @EricLBuehler in #1289
- Add more SDPA head dims for much faster SIGLIP by @EricLBuehler in #1290
- Show throughput in interactive mode by @EricLBuehler in #1291
- Unify bitwise operations by @EricLBuehler in #1288
- Multimodal prefix caching support! by @EricLBuehler in #1298
- Interactive mode improvements by @EricLBuehler in #1299
- Add the Qwen 3 and Qwen 3 MoE models by @EricLBuehler in #1285
- Revamped and streaming web search support by @EricLBuehler in #1301
- Handle vision messages or different tool call prefixes by @EricLBuehler in #1302
- Simplify prefix cacher by @EricLBuehler in #1305
- Use rustyline to handle non-ascii in interactive mode by @beeender in #1306
- Add more tools for automatic search by @EricLBuehler in #1307
- Fix CPU hogging in interactive mode by @beeender in #1309
- Add Metal precompilation support by @EricLBuehler in #1311
- Reduce thrashing of Metal autorelease by @EricLBuehler in #1313
- make
AdapterPaths
andLoraAdapterPaths
public by @Slowki in #1314 - Refactor KV cache manager by @EricLBuehler in #1315
- Add
Audio
andSpeech
model categories by @Slowki in #1317 - Remove has_conv2d from vision model API by @EricLBuehler in #1318
- Unified/automatic flash attention enabler by @EricLBuehler in #1319
- Fix cublaslt 4d mask by @EricLBuehler in #1320
- Qwen VL models fixes by @EricLBuehler in #1322
- Fixes for all vision models by @EricLBuehler in #1323
- Improved+faster LRU prefix cacher and sampler! by @EricLBuehler in #1321
- Inplace ISQ support and default to mmap by @EricLBuehler in #1277
- Fix typos by @omahs in #1329
- Fix Idefics 3 arch chat templating by @EricLBuehler in #1330
- Remove two spaces from PR comment by @szepeviktor in #1331
- Add automatic vision loader type by @EricLBuehler in #1332
- Add the Dia 1.6b TTS model! by @EricLBuehler in #1304
- update
llguidance
to0.7.20
by @Slowki in #1334 - Add model category <> messages check by @EricLBuehler in #1335
- Improve normalization integration test by @EricLBuehler in #1340
- Fix streaming example print statement by @EricLBuehler in #1339
- Fix normalization formula in comment by @EricLBuehler in #1338
- Fix image_to_pixels for non-RGB images by @EricLBuehler in #1337
- Fix typo in expect messages by @EricLBuehler in #1342
- Don't use mmap on cuda by @EricLBuehler in #1336
- Support AWQ format models by @guoqingbao in #1350
- Fix uqff dummy layer ISQ application by @EricLBuehler in #1351
- Disable immediate isq if write_uqff by @EricLBuehler in #1352
- Fixes for cuda UQFF by @EricLBuehler in #1354
- Refactor Option references for model paths by @EricLBuehler in #1347
- Add a script for server benchmarking by @EricLBuehler in #1355
- Optimized Metal
qmv_fast
path by @EricLBuehler in #1356 - Fast sampler by @EricLBuehler in #1327
- Fix metal parallel sampling by @EricLBuehler in #1357
- Add immediate isq predicates for qwen3 by @EricLBuehler in #1358
- Regressions fixes by @EricLBuehler in #1359
- Revamped and smaller readme by @EricLBuehler in #1360
- Add a web chat app by @EricLBuehler in #1362
- Add chat history support to web chat app by @EricLBuehler in #1363
- Refactor web chat, fix multichat image restore by @EricLBuehler in #1364
- Fix repeated immediate isq init by @EricLBuehler in #1365
- Fix missing vision weights in Mistral3 UQFF by @EricLBuehler in #1366
- Rolling shard creation for uqff files by @EricLBuehler in #1367
- Fix unstability during isq of afq by @EricLBuehler in #1368
- Support web chat file uploading by @EricLBuehler in #1370
- Add speech generation support to the web chat! by @EricLBuehler in #1373
- Prefix caching for PagedAttention by @EricLBuehler in #1369
- Metal PagedAttention accuracy improvements by @EricLBuehler in #1374
- Handle images in paged attn scheduler by @EricLBuehler in #1375
- Include schemas needed for chatcompletions endpoint by @matthewhaynesonline in #1353
- Fix case where prefix cacher returns no toks by @EricLBuehler in #1377
- Faster UQFF serialization by @EricLBuehler in #1379
- Experimental AFQ on CPU support by @EricLBuehler in #1380
- Add CPU flash attention by @EricLBuehler in #1382
- Refactor attention backends by @EricLBuehler in #1384
- Set MacOS thread affinity for cpu attn by @EricLBuehler in #1385
- Faster Qwen 3 MoE support on Metal by @EricLBuehler in #1387
- Fix PagedAttention block leaks by @EricLBuehler in #1388
- Fix cuda build again by @EricLBuehler in #1389
- Bump version to 0.6.0 by @EricLBuehler in #1390
- Fewer .contiguous calls for qwen3 moe by @EricLBuehler in #1391
- Allow speech models to accept batched inputs by @EricLBuehler in #1393
- Ring distributed backend for Metal by @EricLBuehler in #1238
- Add auto loader for vision/text detection by @EricLBuehler in #1402
- Proposal: Create Mistral.rs Server Core Lib by @matthewhaynesonline in #1346
- Support linear rope for llama3 by @EricLBuehler in #1408
- Fix vllama4 uqff loading by @EricLBuehler in #1409
- Handle receiver disconnects by @EricLBuehler in #1410
- Fix Qwen3 MoE device mapping irregularities by @EricLBuehler in #1411
- Fix interactive mode URL parsing by @EricLBuehler in #1412
- Refactor auto device map by @EricLBuehler in #1413
- Enable runtime sampling tweaks in interactive mode by @EricLBuehler in #1414
- Gumbel sampling for fast sampler by @EricLBuehler in #1416
- Improved CPU flash attention accuracy & performance by @EricLBuehler in #1417
- Provide chat_templates to container users by @sempervictus in #1419
- Faster cpu flash attn by @EricLBuehler in #1418
- Web search improvements (bm25, web chat) by @EricLBuehler in #1420
- Propely handle consecutive searches by @EricLBuehler in #1421
- Update docs by @matthewhaynesonline in #1422
- Better tool call detection logic by @EricLBuehler in #1424
- Add web search hook callbacks by @EricLBuehler in #1426
- Fix CUDA context switching, bind thread on CudaStorage drop by @EricLBuehler in #1428
- Conditionally build seqlens tensors by @EricLBuehler in #1429
- Add AGENTS.md by @EricLBuehler in #1430
- Support QWen3 GGUF model by @guoqingbao in #1432
- Improved paged attn prefix caching by @EricLBuehler in #1434
- Temporary fix for qwen3 gguf tokenizer by @guoqingbao in #1433
- Add tool callback support by @EricLBuehler in #1427
- centralize crate dependencies by @EricLBuehler in #1438
- Fix bug in tokenizer created with gguf metadata by @guoqingbao in #1440
- Update deps by @EricLBuehler in #1441
- Doc fixes by @EricLBuehler in #1442
- Downgrade rustyline 16.0.0 -> 15.0.0 by @EricLBuehler in #1444
- Support max_completion_tokens alias by @EricLBuehler in #1451
- Add the conformer backbone (phi4mm audio) by @EricLBuehler in #1448
- Fix offline cache issue for gguf models by @guoqingbao in #1452
- Add MCP server endpoints by @EricLBuehler in #1453
- MCP documentation pass by @EricLBuehler in #1455
- Integrate an MCP client by @EricLBuehler in #1456
New Contributors
- @edwko made their first contribution in #1246
- @beeender made their first contribution in #1306
- @Slowki made their first contribution in #1314
- @omahs made their first contribution in #1329
- @szepeviktor made their first contribution in #1331
- @matthewhaynesonline made their first contribution in #1353
- @sempervictus made their first contribution in #1419
Full Changelog: v0.5.0...v0.6.0