Skip to content

Verbose implementations of LLMs architectures, techniques and research papers from scratch. RLHF, DeepSeekV3, ViT, Gemma3, MoE...

Notifications You must be signed in to change notification settings

casinca/LLM-quest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Quest: Architectures, techniques, research papers from scratch

This repo is a constant WIP.
It was initially a "package like" version following the great LLM from scratch repo/book structure from @rasbt.

Little by little, it served me as a base for verbose re-implementations of different architectures and research papers. Put simply, LLM stuff that piques my interest for experiments and learning.

Latest

  • WIP: Reinforcement Pretraining (RPT) from scratch
  • Qwen GSPO (Group Sequence Policy Optimization)
  • Moonshot.ai's standalone QK-Clip technique (from MuonClip) and own Magnitude-QK-Clip variant
  • RLVR Reasoning with GRPO from scratch (slow atm (no KVcache) + unfinished readme)
  • Vision Transformer (ViT) from scratch
  • RLHF with GRPO from scratch
  • Gemma 3 architecture from scratch
  • DeepSeek V3, R1 architecture from scratch
  • Mixture of Experts (MoE) from scratch

(did I mention "from scratch"?)

 

Content

More details in each subfolder's README.md

  • GPT* (modified for attention masks $^1$):
    • MHA
    • Layer Norm
    • FFN
    • GeLU

 

  • GPT to Llama 3.2 from scratch*:
    • GQA
    • RoPE + YaRN
    • RMS Norm
    • SwiGLU

 

  • Llama 3.2 to DeepSeek V3, R1 from scratch:
    • MLA
    • MTP
    • DeepSeek MoE

 

  • Llama 3.2 to Gemma 3 from scratch (text-only):
    • GeGLU
    • Local/Global attention
    • SWA
    • QK norm
    • Logit softcapping (Gemma 2, kept for reference)

 

  • GPT to Vision Transformer (ViT) from scratch:
    • Image encoding: Image patches + learnable CLS token + positional encoding
    • Full Attention
    • Image Classification head
    • ViT↔LLM adapter for multimodal alignment/fine-tuning

 

  • Mixture of Experts (MoE) from scratch:
    • Sparse MoE with classic auxiliary loss + z router loss
    • DeepSeek MoE variant: fine-grained + shared expert isolation + auxiliary loss-free load balancing

 

  • GPT Fine-tuning (SFT):
    • classifier (method: retrieval of the hidden state for the last real token)
    • instruction*

 

  • Alignment:
    • DPO* (with cDPO for noisy labels), step by step
    • RLHF with GRPO from scratch
    • RLVR Reasoning with GRPO from scratch (working but slow)
    • Qwen GSPO (transition from the GRPO implementation)

 

  • Common:
    • QK-Clip (Query-Key clipping) from Moonshot.ai's MuonClip, alternative to logit softcapping and QK norm.
    • DyT (Dynamic Tanh, normalization free (Zhu et al, 2025) alternative to RMSNorm, LayerNorm)
    • RoPE + YaRN (NTK aware + by-part/wavelength scaling)
    • LoRA*
    • [prefix]_engine.py, engine.py functions for training logic
    • dataset.py functions for preprocessing data

 

* Already covered by @rasbt, my code is similar.

$^1$ The original GPT-2 implementation, at the time, didn't have attention masks but only causal masks (in OpenAI's code, they call the actual causal masks "attention mask" which adds confusion to the terminology).
I implemented it mainly for SFT or RLHF related tasks to ensure the model doesn't attend from/to padding tokens + can be used for custom losses as a mask (For CE loss, Pytorch built-in function with no_loss/ignore_index=-100 tokens is faster).
It's not a problem for pretraining or inference (unless batching is desired) which were the main use cases of the original GPT-2.

 

potential TODOs

  • non hardcoded cuda devices
  • vectorize MoE dispatching while keeping the code readable
  • reorganize activation and normalization functions in dedicated modules
  • nested TODOs
  • Confusing names: model attn_mask arg (padding tokens only) and attention_mask used as loss mask for alignment
  • GRPO:
    • add process supervision
    • GRPO iterative RL variant (continuous learning of $r_{\phi}$)

About

Verbose implementations of LLMs architectures, techniques and research papers from scratch. RLHF, DeepSeekV3, ViT, Gemma3, MoE...

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages