Skip to content

Files

Latest commit

 

History

History
114 lines (75 loc) · 6.06 KB

README.md

File metadata and controls

114 lines (75 loc) · 6.06 KB

tiny-llm - LLM Serving in a Week

CI (main)

Still WIP and in very early stage. A tutorial on LLM serving using MLX for system engineers. The codebase is solely (almost!) based on MLX array/matrix APIs without any high-level neural network APIs, so that we can build the model serving infrastructure from scratch and dig into the optimizations.

The goal is to learn the techniques behind efficiently serving a large language model (i.e., Qwen2 models).

Why MLX: nowadays it's easier to get a macOS-based local development environment than setting up an NVIDIA GPU.

Why Qwen2: this was the first LLM I've interacted with -- it's the go-to example in the vllm documentation. I spent some time looking at the vllm source code and built some knowledge around it.

Book

The tiny-llm book is available at https://skyzh.github.io/tiny-llm/. You can follow the guide and start building.

Community

You may join skyzh's Discord server and study with the tiny-llm community.

Join skyzh's Discord Server

Roadmap

Week + Chapter Topic Code Test Doc
1.1 Attention
1.2 RoPE
1.3 Grouped Query Attention 🚧 🚧
1.4 RMSNorm and MLP 🚧 🚧
1.5 Transformer Block 🚧 🚧
1.6 Load the Model 🚧 🚧
1.7 Generate Responses (aka Decoding) 🚧
2.1 KV Cache 🚧 🚧
2.2 Quantized Matmul and Linear - CPU 🚧 🚧
2.3 Quantized Matmul and Linear - GPU 🚧 🚧
2.4 Flash Attention and Other Kernels 🚧 🚧 🚧
2.5 Continuous Batching 🚧 🚧 🚧
2.6 Speculative Decoding 🚧 🚧 🚧
2.7 Prompt/Prefix Cache 🚧 🚧 🚧
3.1 Paged Attention - Part 1 🚧 🚧 🚧
3.2 Paged Attention - Part 2 🚧 🚧 🚧
3.3 Prefill-Decode Separation 🚧 🚧 🚧
3.4 Scheduler 🚧 🚧 🚧
3.5 Parallelism 🚧 🚧 🚧
3.6 AI Agent 🚧 🚧 🚧
3.7 Streaming API Server 🚧 🚧 🚧

Other topics not covered: quantized/compressed kv cache