Add the doc of mlp block (#13)

Connor1996 · web-flow · commit 9bbd859a97c9 · 2025-05-16T14:54:41.000+08:00
* add the doc of mlp

Signed-off-by: Connor1996 &lt;zbk602423539@gmail.com&gt;

* update index

Signed-off-by: Connor1996 &lt;zbk602423539@gmail.com&gt;

---------

Signed-off-by: Connor1996 &lt;zbk602423539@gmail.com&gt;
diff --git a/README.md b/README.md
@@ -29,7 +29,7 @@ You may join skyzh's Discord server and study with the tiny-llm community.
 | 1.1            | Attention                                                   | ✅    | ✅   | ✅  |
 | 1.2            | RoPE                                                        | ✅    | ✅   | ✅  |
 | 1.3            | Grouped Query Attention                                     | ✅    | ✅   | ✅  |
-| 1.4            | RMSNorm and MLP                                             | ✅    | 🚧   | 🚧  |
+| 1.4            | RMSNorm and MLP                                             | ✅    | ✅   | ✅  |
 | 1.5            | Transformer Block                                           | ✅    | 🚧   | 🚧  |
 | 1.6            | Load the Model                                              | ✅    | 🚧   | 🚧  |
 | 1.7            | Generate Responses (aka Decoding)                           | ✅    | ✅   | 🚧  |
diff --git a/book/src/glossary.md b/book/src/glossary.md
@@ -4,5 +4,10 @@
 - [Multi Head Attention](./week1-01-attention.md)
 - [Linear](./week1-01-attention.md)
 - [Rotary Positional Encoding](./week1-02-positional-encodings.md)
+- [Grouped Query Attention](./week1-03-gqa.md)
+- [RMSNorm](./week1-04-rmsnorm-and-mlp.md)
+- [SiLU](./week1-04-rmsnorm-and-mlp.md)
+- [SwiGLU](./week1-04-rmsnorm-and-mlp.md)
+- [MLP](./week1-04-rmsnorm-and-mlp.md)
 
 {{#include copyright.md}}
diff --git a/book/src/week1-01-attention.md b/book/src/week1-01-attention.md
@@ -100,7 +100,7 @@ src/tiny_llm/attention.py
 Implement `MultiHeadAttention`. The layer takes a batch of vectors, maps it through the K, V, Q weight matrixes, and use the attention function we implemented in task 1 to compute the result. The output needs to be mapped using the O
 weight matrix.
 
-You will also need to implement the `linear` function first. For `linear`, it takes a tensor of the shape `N.. x I`, a weight matrix of the shape `O x I`, and a bias vector of the shape `O`. The output is of the shape `N.. x O`. `I` is the input dimension and `O` is the output dimension.
+You will also need to implement the `linear` function in `basics.py` first. For `linear`, it takes a tensor of the shape `N.. x I`, a weight matrix of the shape `O x I`, and a bias vector of the shape `O`. The output is of the shape `N.. x O`. `I` is the input dimension and `O` is the output dimension.
 
 For the `MultiHeadAttention` layer, the input tensors `query`, `key`, `value` have the shape `N x L x E`, where `E` is the dimension of the
 embedding for a given token in the sequence. The `K/Q/V` weight matrixes will map the tensor into key, value, and query
@@ -123,9 +123,9 @@ H is num_heads
 D is head_dim
 L is seq_len, in PyTorch API it's S (source len)
 
-W_q/W_k/W_v: E x (H x D)
+w_q/w_k/w_v: E x (H x D)
 output/input: N x L x E
-W_o: (H x D) x E
+w_o: (H x D) x E
 ```
 
 At the end of the day, you should be able to pass the following tests:
diff --git a/book/src/week1-04-rmsnorm-and-mlp.md b/book/src/week1-04-rmsnorm-and-mlp.md
@@ -12,7 +12,7 @@ In day 4, we will implement two crucial components of the Qwen2 Transformer arch
 
 ## Task 1: Implement `RMSNorm`
 
-You will need to implement the `RMSNorm` layer in:
+In this task, we will implement the `RMSNorm` layer.
 
 ```
 src/tiny_llm/layer_norm.py
@@ -55,6 +55,64 @@ pdm run test -k week_1_day_4_task_1 -v
 
 ## Task 2: Implement the MLP Block
 
-TBD...
+In this task, we will implement the MLP block named `Qwen2MLP`.
+
+```
+src/tiny_llm/qwen2_week1.py
+```
+
+The original Transformer model utilized a simple Feed-Forward Network (FFN) within each block. This FFN typically consisted of two linear transformations with a ReLU activation in between, applied position-wise.
+
+Modern Transformer architectures, including Qwen2, often employ more advanced FFN variants for improved performance. Qwen2 uses a specific type of Gated Linear Unit (GLU) called SwiGLU.
+
+**📚 Readings**
+* [Attention is All You Need (Transformer Paper, Section 3.3 "Position-wise Feed-Forward Networks")](https://arxiv.org/abs/1706.03762)
+* [GLU Paper(Language Modeling with Gated Convolutional Networks)](https://arxiv.org/pdf/1612.08083)
+* [SilU(Swish) activation function](https://arxiv.org/pdf/1710.05941)
+* [SwiGLU Paper(GLU Variants Improve Transformer)](https://arxiv.org/abs/2002.05202v1)
+* [PyTorch SiLU documentation](https://pytorch.org/docs/stable/generated/torch.nn.SiLU.html)
+* [Qwen2 layers implementation in mlx-lm (includes MLP)](https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/models/qwen2.py)
+
+Essientially, SwiGLU is a combination of GLU and the SiLU (Sigmoid Linear Unit) activation function:
+-  GLU is a gating mechanism that allows the model to learn which parts of the input to focus on. It typically involves an element-wise product of two linear projections of the input, one of which might be passed through an activation function. Compared to ReLU used in the original FFN, GLU can help the model learn more complex relationships in the data, deciding which features to keep and which to discard.
+-  SiLU (Sigmoid Linear Unit) is a smooth, non-monotonic activation function that has been shown to perform well in various deep learning tasks. Compared to ReLU and sigmoid used in GLU, it is fully differentiable without the zero-gradient “dead zones”, retains non-zero output even for negative inputs.
+
+You need to implement the `silu` function in `basics.py` first. For `silu`, it takes a tensor of the shape `N.. x I` and returns a tensor of the same shape.
+The `silu` function is defined as:
+$$
+\text{SiLU}(x) = x * \text{sigmoid}(x) = \frac{x}{1 + e^{-x}}
+$$
+
+
+Then implement `Qwen2MLP`. The structure for Qwen2's MLP block is:
+*  A gate linear projection ($W_{gate}$).
+*  An up linear projection ($W_{up}$).
+*  A SiLU activation function applied to the output of $W_{gate}$.
+*  An element-wise multiplication of the SiLU-activated $W_{gate}$ output and the $W_{up}$ output. This forms the "gated" part.
+*  A final down linear projection ($W_{down}$).
+
+This can be expressed as:
+$$
+\text{MLP}(x) = (\text{SiLU}(W_{gate}(x)) \odot W_{up}(x))W_{down}
+$$
+Where $\odot$ denotes element-wise multiplication. All linear projections in Qwen2's MLP are typically implemented without bias.
+
+```
+N.. is zero or more dimensions for batches
+E is hidden_size (embedding dimension of the model)
+I is intermediate_size (dimension of the hidden layer in MLP)
+L is the sequence length
+
+input: N.. x L x E
+w_gate: I x E
+w_up: I x E
+w_down: E x I
+output: N.. x L x E
+```
+
+You can test your implementation by running:
+```bash
+pdm run test -k week_1_day_4_task_2 -v
+```
 
 {{#include copyright.md}}
diff --git a/src/tiny_llm/__init__.py b/src/tiny_llm/__init__.py
@@ -4,6 +4,6 @@
 from .layer_norm import *
 from .positional_encoding import *
 from .quantize import *
-from .qwen2_week1 import Qwen2ModelWeek1
+from .qwen2_week1 import *
 from .generate import *
 from .qwen2_week2 import Qwen2ModelWeek2
diff --git a/src/tiny_llm_ref/__init__.py b/src/tiny_llm_ref/__init__.py
@@ -6,5 +6,5 @@
 from .quantize import *
 from .generate import *
 from .kv_cache import *
-from .qwen2_week1 import Qwen2ModelWeek1
+from .qwen2_week1 import *
 from .qwen2_week2 import Qwen2ModelWeek2
diff --git a/tests/test_layer_norm.py b/tests/test_layer_norm.py
@@ -9,7 +9,9 @@
 @pytest.mark.parametrize("target", ["torch", "mlx"])
 @pytest.mark.parametrize("stream", AVAILABLE_STREAMS, ids=AVAILABLE_STREAMS_IDS)
 @pytest.mark.parametrize("precision", PRECISIONS, ids=PRECISION_IDS)
-def test_rms_norm_week_1_day_4_task_1(stream: mx.Stream, precision: np.dtype, target: str):
+def test_rms_norm_week_1_day_4_task_1(
+    stream: mx.Stream, precision: np.dtype, target: str
+):
     SIZE = 100
     SIZE_Y = 111
     with mx.stream(stream):
diff --git a/tests/test_qwen2_mlp.py b/tests/test_qwen2_mlp.py
@@ -0,0 +1,55 @@
+import mlx.core as mx
+import pytest
+from mlx_lm.models import qwen2
+import numpy as np
+
+from .tiny_llm_base import *
+from .utils import *
+
+# Define different dimension parameters for testing
+DIM_PARAMS = [
+    {"batch_size": 1, "seq_len": 5, "dim": 4, "hidden_dim": 8, "id": "small_dims"},
+    {"batch_size": 2, "seq_len": 16, "dim": 32, "hidden_dim": 64, "id": "large_dims"},
+    {
+        "batch_size": 1,
+        "seq_len": 1,
+        "dim": 128,
+        "hidden_dim": 256,
+        "id": "single_token",
+    },
+]
+DIM_PARAMS_IDS = [d["id"] for d in DIM_PARAMS]
+
+
+@pytest.mark.parametrize("stream", AVAILABLE_STREAMS, ids=AVAILABLE_STREAMS_IDS)
+@pytest.mark.parametrize("precision", PRECISIONS, ids=PRECISION_IDS)
+@pytest.mark.parametrize("dims", DIM_PARAMS, ids=DIM_PARAMS_IDS)
+def test_qwen2_mlp_week_1_day_4_task_2(
+    stream: mx.Stream, precision: np.dtype, dims: dict
+):
+    BATCH_SIZE, SEQ_LEN, DIM, HIDDEN_DIM = (
+        dims["batch_size"],
+        dims["seq_len"],
+        dims["dim"],
+        dims["hidden_dim"],
+    )
+
+    with mx.stream(stream):
+        mx_precision = np_type_to_mx_type(precision)
+        x = mx.random.uniform(shape=(BATCH_SIZE, SEQ_LEN, DIM)).astype(mx_precision)
+        w_gate = mx.random.uniform(shape=(HIDDEN_DIM, DIM)).astype(mx_precision)
+        w_up = mx.random.uniform(shape=(HIDDEN_DIM, DIM)).astype(mx_precision)
+        w_down = mx.random.uniform(shape=(DIM, HIDDEN_DIM)).astype(mx_precision)
+
+        user_mlp = Qwen2MLP(
+            dim=DIM, hidden_dim=HIDDEN_DIM, w_gate=w_gate, w_up=w_up, w_down=w_down
+        )
+        user_output = user_mlp(x)
+
+        reference_mlp = qwen2.MLP(dim=DIM, hidden_dim=HIDDEN_DIM)
+        reference_mlp.gate_proj.weight = w_gate
+        reference_mlp.up_proj.weight = w_up
+        reference_mlp.down_proj.weight = w_down
+        reference_output = reference_mlp(x)
+
+        assert_allclose(user_output, reference_output, precision)
diff --git a/tests/utils.py b/tests/utils.py
@@ -31,12 +31,13 @@ def assert_allclose(
         atol = atol or 1.0e-5
     assert a.shape == b.shape, f"shape mismatch: {a.shape} vs {b.shape}"
     if not np.allclose(a, b, rtol=rtol, atol=atol):
-        print("a=", a)
-        print("b=", b)
-        diff = np.invert(np.isclose(a, b, rtol=rtol, atol=atol))
-        print("diff_a=", a * diff)
-        print("diff_b=", b * diff)
-        assert False, f"result mismatch"
+        with np.printoptions(precision=3, suppress=True):
+            print("a=", a)
+            print("b=", b)
+            diff = np.invert(np.isclose(a, b, rtol=rtol, atol=atol))
+            print("diff_a=", a * diff)
+            print("diff_b=", b * diff)
+            assert False, f"result mismatch"
 
 
 def np_type_to_mx_type(np_type: np.dtype) -> mx.Dtype: