Still WIP and in very early stage. A tutorial on LLM serving using MLX for system engineers. The codebase is solely (almost!) based on MLX array/matrix APIs without any high-level neural network APIs, so that we can build the model serving infrastructure from scratch and dig into the optimizations.
The goal is to learn the techniques behind efficiently serving a large language model (i.e., Qwen2 models).
Why MLX: nowadays it's easier to get a macOS-based local development environment than setting up an NVIDIA GPU.
Why Qwen2: this was the first LLM I've interacted with -- it's the go-to example in the vllm documentation. I spent some time looking at the vllm source code and built some knowledge around it.
The tiny-llm book is available at https://skyzh.github.io/tiny-llm/. You can follow the guide and start building.
You may join skyzh's Discord server and study with the tiny-llm community.
Week + Chapter | Topic | Code | Test | Doc |
---|---|---|---|---|
1.1 | Attention | ✅ | ✅ | ✅ |
1.2 | RoPE | ✅ | ✅ | ✅ |
1.3 | Grouped Query Attention | ✅ | 🚧 | 🚧 |
1.4 | RMSNorm and MLP | ✅ | 🚧 | 🚧 |
1.5 | Transformer Block | ✅ | 🚧 | 🚧 |
1.6 | Load the Model | ✅ | 🚧 | 🚧 |
1.7 | Generate Responses (aka Decoding) | ✅ | ✅ | 🚧 |
2.1 | KV Cache | ✅ | 🚧 | 🚧 |
2.2 | Quantized Matmul and Linear - CPU | ✅ | 🚧 | 🚧 |
2.3 | Quantized Matmul and Linear - GPU | ✅ | 🚧 | 🚧 |
2.4 | Flash Attention and Other Kernels | 🚧 | 🚧 | 🚧 |
2.5 | Continuous Batching | 🚧 | 🚧 | 🚧 |
2.6 | Speculative Decoding | 🚧 | 🚧 | 🚧 |
2.7 | Prompt/Prefix Cache | 🚧 | 🚧 | 🚧 |
3.1 | Paged Attention - Part 1 | 🚧 | 🚧 | 🚧 |
3.2 | Paged Attention - Part 2 | 🚧 | 🚧 | 🚧 |
3.3 | Prefill-Decode Separation | 🚧 | 🚧 | 🚧 |
3.4 | Scheduler | 🚧 | 🚧 | 🚧 |
3.5 | Parallelism | 🚧 | 🚧 | 🚧 |
3.6 | AI Agent | 🚧 | 🚧 | 🚧 |
3.7 | Streaming API Server | 🚧 | 🚧 | 🚧 |
Other topics not covered: quantized/compressed kv cache