Skip to content

v0.6.0

Latest
Compare
Choose a tag to compare
@EricLBuehler EricLBuehler released this 10 Jun 23:28
· 26 commits to master since this release

πŸ”₯ Highlights from v0.6.0

πŸš€ Major Features

  • Llama 4 support and Qwen 3 / MoE / VL models, including DeepSeek and DeepCoder integrations
  • Multimodal prefix caching, paged attention scheduler improvements, and faster Metal/CUDA backends
  • Web chat app with chat history, file uploads, speech generation, and revamped tool-calling/search
  • Fast sampler and CPU FlashAttention with improved performance and accuracy
  • Metal and CUDA: major improvements in quantization (AFQ, ISQ), UQFF handling, and memory optimizations
  • MCP (Model Context Protocol): new server endpoints, docs, and integrated client
  • Vision and audio expansion: support for SIGLIP, Dia 1.6b TTS, conformer backbone (Phi-4MM), auto loaders, and vision tool prefixes

🧠 Inference Optimizations

  • Lightning-fast AFQ on CPU, optimized Qwen 3 MoE on Metal, and paged attention fixes
  • Unified FlashAttention backend and automatic method selection for ISQ
  • Metal precompilation support and reduced autorelease thrashing

🧰 Dev Improvements

  • Refactored engine architecture, KV cache, attention backends, and device mapping logic
  • Centralized dependency management and cleaner internal abstractions
  • Streamlined and faster LoRA support

πŸŽ‰ Other

  • Revamped README, AGENTS.md, and new benchmarking scripts
  • Interactive mode now shows throughput, supports Gumbel sampling, and better runtime sampling controls
  • Expanded quant and GGUF support: AWQ, Qwen3 GGUF, and prequantized MLX compatibility

βΈ»

What's Changed

New Contributors

Full Changelog: v0.5.0...v0.6.0