I'm developing this course to contain everything you need to:
- Join elite labs like OpenAI, Google, or MIT
- Independently publish groundbreaking open-source research
- Build world-class models
Follow along with course walkthroughs, video tutorials, and explanations.
For advanced learners, check out the Speedruns: Research / engineering / optimization challenges that help you:
- Contribute to open-source
- Build real skills by doing
Start with the Beginner Python Course to get up to speed.
Make a copy of the notebooks: Open notebook → File → Save a copy to Dive
-
Intro course - Deep Learning by Professor Bryce - YouTube
-
PyTorch Fundamentals: From Linear Layers & Weight Intuition to LayerNorm, Variance, and Custom ML Blocks - Google Colab - YouTube - Bilibili
-
Code Softmax, Cross-Entropy, and Gradients — From Scratch (No Torch) (In development) - Googe Colab
-
Chain Rule & Backpropagation From Scratch Google Colab
-
Comparing MatMul: PyTorch Native vs Tiling vs Quantization (In development) - Google Colab
-
Make Matrix Multiply 3x Faster by Padding Size to Power of 2 - Google Colab
-
How Matrix Shape Affects Performance on Nvidia T4 Tensor Cores - (in development) - Google Colab
-
TODO: how to optimize matmuls on specific GPUs
-
Experimenting With Small Character-Level LLM: Hyperparameters, Optimization, and Model Scaling - Paper - Google Colab
-
Train a Small LLM From Scratch In 50 Min - Google Colab
-
Simplest diffusion model to generate points on a circle - Google Colab
-
Code & train a small diffusion model to calculate A mod B - Google Colab
-
Understand Simple Autoencoder - Google Colab
I had no idea autoencoders are so quick to train, a few seconds for autoencoder of numbers (0-10,000):
Encoder takes a number (56) -> vector embedding [0.3, 0.7, 0.42,...] -> decoder aims to predict the encoded number (56) from the vector embedding - these vector embeddings contain rich representation of the encoded number (token, sentence,...) that can be used in a models like LLMs, diffusion,...
I'm figuring out autoencoders as I think LLMs should process sentences, not tokens, as sentences can represent infinite number of concepts, as opposed to limited token vocabulary (usually about 150K)
Predicting over infinite distribution requires diffusion models (like seemingly infinite number of possible images), as autoregressive would just predict the blury average of the image, sentence, without any meaning.
Also diffusion model allows us to have truly unified training in the same latent space for visual and text data.
-
TMA (Tensor Memory Accelerator) alignment for fast memory on Hopper GPUs (DeepSeek's speed) - Google Colab
-
High-Performance GPU Matrix Multiplication on H800, H100 & H200 from Scratch - Google Colab
- Looking for patterns in trained neural network weights - Google Colab - Preview PDF Analysis In development