my collection of paper implementations and experiments. built to be modular, easy to extend, and experiment with.
- Dynamic Tanh
- Multi-head Latent Attention
- nGPT: Normalized Transformer
- Differential Transformer
- Rotary Embeddings
- Attention with Linear Biases
- Llama
- GPT
- Swap config structure to importing modules instead?