LLM Quest: Architectures, techniques, research papers from scratch

This repo is a constant WIP.
It was initially a "package like" version following the great LLM from scratch repo/book structure from @rasbt.

Little by little, it served me as a base for verbose re-implementations of different architectures and research papers. Put simply, LLM stuff that piques my interest for experiments and learning.

Latest

WIP: Reinforcement Pretraining (RPT) from scratch
Qwen GSPO (Group Sequence Policy Optimization)
Moonshot.ai's standalone QK-Clip technique (from MuonClip) and own Magnitude-QK-Clip variant
RLVR Reasoning with GRPO from scratch (slow atm (no KVcache) + unfinished readme)
Vision Transformer (ViT) from scratch
RLHF with GRPO from scratch
Gemma 3 architecture from scratch
DeepSeek V3, R1 architecture from scratch
Mixture of Experts (MoE) from scratch

_{^{(did I mention "from scratch"?)}}

Content

More details in each subfolder's README.md

GPT* (modified for attention masks $^1$):
- MHA
- Layer Norm
- FFN
- GeLU

GPT to Llama 3.2 from scratch*:
- GQA
- RoPE + YaRN
- RMS Norm
- SwiGLU

Llama 3.2 to DeepSeek V3, R1 from scratch:
- MLA
- MTP
- DeepSeek MoE

Llama 3.2 to Gemma 3 from scratch (text-only):
- GeGLU
- Local/Global attention
- SWA
- QK norm
- Logit softcapping (Gemma 2, kept for reference)

GPT to Vision Transformer (ViT) from scratch:
- Image encoding: Image patches + learnable CLS token + positional encoding
- Full Attention
- Image Classification head
- ViT↔LLM adapter for multimodal alignment/fine-tuning

Mixture of Experts (MoE) from scratch:
- Sparse MoE with classic auxiliary loss + z router loss
- DeepSeek MoE variant: fine-grained + shared expert isolation + auxiliary loss-free load balancing

GPT Fine-tuning (SFT):
- classifier (method: retrieval of the hidden state for the last real token)
- instruction*

Alignment:
- DPO* (with cDPO for noisy labels), step by step
- RLHF with GRPO from scratch
- RLVR Reasoning with GRPO from scratch (working but slow)
- Qwen GSPO (transition from the GRPO implementation)

Common:
- QK-Clip (Query-Key clipping) from Moonshot.ai's MuonClip, alternative to logit softcapping and QK norm.
- DyT (Dynamic Tanh, normalization free (Zhu et al, 2025) alternative to RMSNorm, LayerNorm)
- RoPE + YaRN (NTK aware + by-part/wavelength scaling)
- LoRA*
- [prefix]_engine.py, engine.py functions for training logic
- dataset.py functions for preprocessing data

* Already covered by @rasbt, my code is similar.

$^1$ The original GPT-2 implementation, at the time, didn't have attention masks but only causal masks (in OpenAI's code, they call the actual causal masks "attention mask" which adds confusion to the terminology).
I implemented it mainly for SFT or RLHF related tasks to ensure the model doesn't attend from/to padding tokens + can be used for custom losses as a mask (For CE loss, Pytorch built-in function with no_loss/ignore_index=-100 tokens is faster).
It's not a problem for pretraining or inference (unless batching is desired) which were the main use cases of the original GPT-2.

potential TODOs

non hardcoded cuda devices
vectorize MoE dispatching while keeping the code readable
reorganize activation and normalization functions in dedicated modules
nested TODOs
Confusing names: model attn_mask arg (padding tokens only) and attention_mask used as loss mask for alignment
GRPO:
- add process supervision
- GRPO iterative RL variant (continuous learning of $r_{\phi}$)

Name		Name	Last commit message	Last commit date
Latest commit History 190 Commits
llm_quest		llm_quest
.gitignore		.gitignore
README.md		README.md
config.py		config.py
gpt_download.py		gpt_download.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Quest: Architectures, techniques, research papers from scratch

Latest

Content

potential TODOs

About

Uh oh!

Releases

Packages

Uh oh!

Languages

casinca/LLM-quest

Folders and files

Latest commit

History

Repository files navigation

LLM Quest: Architectures, techniques, research papers from scratch

Latest

Content

potential TODOs

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages