tiny-llm - LLM Serving in a Week

Still WIP and in very early stage. A tutorial on LLM serving using MLX for system engineers. The codebase is solely (almost!) based on MLX array/matrix APIs without any high-level neural network APIs, so that we can build the model serving infrastructure from scratch and dig into the optimizations.

The goal is to learn the techniques behind efficiently serving a large language model (i.e., Qwen2 models).

Why MLX: nowadays it's easier to get a macOS-based local development environment than setting up an NVIDIA GPU.

Why Qwen2: this was the first LLM I've interacted with -- it's the go-to example in the vllm documentation. I spent some time looking at the vllm source code and built some knowledge around it.

Book

The tiny-llm book is available at https://skyzh.github.io/tiny-llm/. You can follow the guide and start building.

Community

You may join skyzh's Discord server and study with the tiny-llm community.

Roadmap

Week + Chapter	Topic	Code	Test	Doc
1.1	Attention	✅	✅	✅
1.2	RoPE	✅	✅	✅
1.3	Grouped Query Attention	✅	🚧	🚧
1.4	RMSNorm and MLP	✅	🚧	🚧
1.5	Transformer Block	✅	🚧	🚧
1.6	Load the Model	✅	🚧	🚧
1.7	Generate Responses (aka Decoding)	✅	✅	🚧
2.1	KV Cache	✅	🚧	🚧
2.2	Quantized Matmul and Linear - CPU	✅	🚧	🚧
2.3	Quantized Matmul and Linear - GPU	✅	🚧	🚧
2.4	Flash Attention and Other Kernels	🚧	🚧	🚧
2.5	Continuous Batching	🚧	🚧	🚧
2.6	Speculative Decoding	🚧	🚧	🚧
2.7	Prompt/Prefix Cache	🚧	🚧	🚧
3.1	Paged Attention - Part 1	🚧	🚧	🚧
3.2	Paged Attention - Part 2	🚧	🚧	🚧
3.3	Prefill-Decode Separation	🚧	🚧	🚧
3.4	Scheduler	🚧	🚧	🚧
3.5	Parallelism	🚧	🚧	🚧
3.6	AI Agent	🚧	🚧	🚧
3.7	Streaming API Server	🚧	🚧	🚧

Other topics not covered: quantized/compressed kv cache

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

README.md

README.md

tiny-llm - LLM Serving in a Week

Book

Community

Roadmap

Files

README.md

Latest commit

History

README.md

File metadata and controls

tiny-llm - LLM Serving in a Week

Book

Community

Roadmap