This repo is a constant WIP.
It was initially a "package like" version following the great
LLM from scratch repo/book structure from @rasbt.
Little by little, it served me as a base for verbose re-implementations of different architectures and research papers. Put simply, LLM stuff that piques my interest for experiments and learning.
- WIP: Reinforcement Pretraining (RPT) from scratch
- Qwen GSPO (Group Sequence Policy Optimization)
- Moonshot.ai's standalone QK-Clip technique (from MuonClip) and own Magnitude-QK-Clip variant
- RLVR Reasoning with GRPO from scratch (slow atm (no KVcache) + unfinished readme)
- Vision Transformer (ViT) from scratch
- RLHF with GRPO from scratch
- Gemma 3 architecture from scratch
- DeepSeek V3, R1 architecture from scratch
- Mixture of Experts (MoE) from scratch
(did I mention "from scratch"?)
More details in each subfolder's README.md
- GPT* (modified for attention masks
$^1$ ):- MHA
- Layer Norm
- FFN
- GeLU
- GPT to Llama 3.2 from scratch*:
- GQA
- RoPE + YaRN
- RMS Norm
- SwiGLU
- Llama 3.2 to DeepSeek V3, R1 from scratch:
- MLA
- MTP
- DeepSeek MoE
- Llama 3.2 to Gemma 3 from scratch (text-only):
- GeGLU
- Local/Global attention
- SWA
- QK norm
- Logit softcapping (Gemma 2, kept for reference)
- GPT to Vision Transformer (ViT) from scratch:
- Image encoding: Image patches + learnable CLS token + positional encoding
- Full Attention
- Image Classification head
- ViT↔LLM adapter for multimodal alignment/fine-tuning
- Mixture of Experts (MoE) from scratch:
- Sparse MoE with classic auxiliary loss + z router loss
- DeepSeek MoE variant: fine-grained + shared expert isolation + auxiliary loss-free load balancing
- GPT Fine-tuning (SFT):
- classifier (method: retrieval of the hidden state for the last real token)
- instruction*
- Alignment:
- DPO* (with cDPO for noisy labels), step by step
- RLHF with GRPO from scratch
- RLVR Reasoning with GRPO from scratch (working but slow)
- Qwen GSPO (transition from the GRPO implementation)
- Common:
- QK-Clip (Query-Key clipping) from Moonshot.ai's MuonClip, alternative to logit softcapping and QK norm.
- DyT (Dynamic Tanh, normalization free (Zhu et al, 2025) alternative to RMSNorm, LayerNorm)
- RoPE + YaRN (NTK aware + by-part/wavelength scaling)
- LoRA*
[prefix]_engine.py
,engine.py
functions for training logicdataset.py
functions for preprocessing data
* Already covered by @rasbt, my code is similar.
I implemented it mainly for SFT or RLHF related tasks to ensure the model doesn't attend from/to padding tokens + can be
used for custom losses as a mask (For CE loss, Pytorch built-in function with no_loss/ignore_index=-100 tokens is
faster).
It's not a problem for pretraining or inference (unless batching is desired) which were the main use cases of the
original GPT-2.
- non hardcoded cuda devices
- vectorize MoE dispatching while keeping the code readable
- reorganize activation and normalization functions in dedicated modules
- nested TODOs
- Confusing names: model attn_mask arg (padding tokens only) and attention_mask used as loss mask for alignment
- GRPO:
- add process supervision
- GRPO iterative RL variant (continuous learning of
$r_{\phi}$ )