A 39M (lil) parameter model trained on ~8B tokens, on 2xA100 for approximately 2 hours. More details below.
What I cannot create, I do not understand - Richard Feynman
Simply understanding the model architecture is not enough to fully grasp how these models are trained. This project is the outcome of this realization and the frustration on how abstractions limit our learning process (eg. huggingface transformers) at least when we are starting out. The best thing to do is to implement everything from scratch, within minimal abstraction. Well, this is what this project does. With this project, I plan to add everything(code + my notes) from training tokenizers to the post-tranining phases. One may consider it as a roadmap, but it might not be enough and at the end you will have your own roadmap, so just consider it as an outline or introduction to training Large Language Models.
You should have basic understanding of how transformer model works. A great way to start is by watching and implementing yourself Karpathy's zero to hero series til part 5. Afterwards, you can take a look at Jay Alammar's The Illustrated Transformer, and then visit Karpathy's Let's build GPT: from scratch, in code, spelled out.. This is just my recommendation, please make sure to visit them in any order as per your need.
The architecture differs from transformers architecture in that it uses.
- RMSNorm instead of LayerNorm
- Rotary Positional Embedding instead of Absolute Positional Embedding
- SwiGLU activations instead of ReLU
- Grouped Query Attention instead of Multi-head Attention
Finally, the architecture becomes similar to what is used in Llama 3 models.
Attribute | vocab_size |
d_model |
n_layers |
max_seq_len |
q_heads |
kv_heads |
max_batch_size |
---|---|---|---|---|---|---|---|
Default Value | 2**13 |
512 |
12 |
512 |
16 |
8 |
32 |
This is the first step in training LM. As LMs can't take text as an input we need to convert text to numbers. We build our own vocabulary to map tokens to numbers. A great way understand the whole concept is to watch karpathy's Let's build the GPT Tokenizer. You might need some knowledge about unicode and utf-8 to completely grasp the concept in detail for which you can look at my notes on Tokenizers. In this project, We train huggingface tokenizer (this is the only abstraction that we use) to train our tokenizer. It was trained on 0.1% of OpenWebText. Recommended way would be to train the tokenizer on diverse and large dataset to get the best compression rate. For simplicity, I wanted my model to just be able to converse well, I opted for this small subset of the dataset which you can find here
As described above, the architecture deviates from original transformer model. A couple of changes are:
Please read this paper Root Mean Square Layer Normalization, A simple conclusion from the paper is that we don't need to calculate the mean across layers while performing normalization as we do in Layer Normalization, just maintaining the variation((scaling)) is sufficient.
Instead of adding extra positional embedding to our token embeddings, we simply rotate our token embeddings. I would first recommend watching this video RoPE (Rotary positional embeddings) explained, then read the paper ROFORMER and finally look at my notes on RoPE where I explain ROPE with respect to the code that we use in this project.
Take a look at this simple and straightforward blog on SwiGLU: GLU Variants Improve Transformer (2020)
Instead of using multiple heads in our attention, we simply divide K and V to groups and repeat those K,V to q_heas/kv_heads times, and then perform attention. Why? since K and V are repeated, the data movement within GPU is minimized cause it is the most expensive task and is a bottleneck to our training. To understand better, take a look at this video Variants of Multi-head attention and then read my notes on Grouped Query Attention
The model was trained OpenWebText, which is close to 10 billion tokens according to our tokenizer, but the model was only trained on ~8B tokens (credits ran out :( ).
It was trained on 2XA100 for approximately 2.5 hours.
This is the specification of machine that I used. GPU was rented from Tensordock
Category | Details |
---|---|
Storage | 300 GB |
vCPUs | 40 AMD EPYC 7513 vCPUs |
RAM | 80 GB RAM |
GPU | 2x A100 SXM4 80 GB |
Compute Price | $2.400000/hour |
Storage Price | $0.075000/hour |
Total Price | $2.475000/hour |
Input
Bernie Sanders of Vermont would seek a recount. The delay postponed a definitive answer to whether Clinton had made a clean sweep of five big primaries on
Output
Bernie Sanders of Vermont would seek a recount. The delay postponed a definitive answer to whether Clinton had made a clean sweep of five big primaries on opening day of the Democratic nominating process.\n\nIn response, Sanders theorized that it was possible for her to then-choice’s hand to escalate into a “unprecedented vote” to take the nomination. However, his standing to refrain from carrying coal companies in the wilderness at the time of her Nov. 8 pick-ing defeat surprised the race.\n\nTrump said Wednesday morning that he will back Clinton to replace the incumbent U.S. senator who ran against Bernie Sanders on Tuesday, in a 33-16 historic win over Hillary Clinton. Though, given that both Hillary Clinton and Bernie Sanders enjoyed a restricted number of fallen out of the race on the set of their Iowa primary
Input
The latest Marvel movie has shattered box office records, grossing over $1 billion worldwide in just two weeks. Fans have praised the
Output
The latest Marvel movie has shattered box office records, grossing over $1 billion worldwide in just two weeks. Fans have praised the icons like Roc Nation and Meet The Press for the budgetary ramifications Ain’t Not a breakout promotion.\n\nIn the second week of December, Marvel announced Monday that various Marvel games and Daredevil: The Desolation of holding off it would leave Friday to Evil Geniuses. The Daredevil announced Monday that The Clone Wars is now open and ready for release in late June.
git clone https://github.com/CohleM/lilLM.git
pip install -r requirements.txt
I plan to make this more startforward by adding commandline arguments, but for now please follow the steps described
Download the data from here and convert it to jsonl format, open the train_custom_tokenizer.py
file and replace the file_path with your path/to/your_jsonl_file and then
python train_custom_tokenizer.py
Tokenizer will be stored in /model/tokenizer
.
python data/pretraining/process.py --tokenizer_path='/home/user/lilLM/model/tokenizer'
Make sure to replace the tokenizer_path
with correct path
It will download the OpenWebText dataset from huggingface and tokenize the whole dataset using our tokenizer saved in /model/tokenizer
and save tokenized files as train.bin
and val.bin
.These are the binary files for our tokenized dataset. train.bin
results in ~20GB. The reason for tokenizing it beforehand is because we want to maximize our GPU utilization. Since tokenization is a CPU bound task, we can do it before hand while allowing our GPU train more tokens during training.
If you have Nx GPU per node run.
torchrun --standalone --nproc_per_node=2 pretrain.py
If you only have one GPU run,
python pretrain.py
Please also take a look at default config parameters in model/config.py
and in pretrain.py
- Finetune using SFT and DPO
- Add Mixture of Experts (MoE)
- Add Inference file