~ tinyllama ~

Model classes and pre-training utilities for a tiny version of Llama in PyTorch.

Installation

pip install tinyllama

Parsing

# ".txt" files
from tinyllama.readers import get_text
corpus = get_text("./txt_path")

# ".pdf" files
from tinyllama.readers import get_pdf_text
corpus = get_pdf_text("./pdf_path")

Pre-training a model

Initializing a tokenizer

With a simple character-level tokenizer:

from tinyllama.tokenizers import CharacterTokenizer
tokenizer = CharacterTokenizer()

To turn a corpus into tokens:

tokens  = tokenizer.tokenize(corpus)

Initializing a Llama model

from tinyllama import Llama
model = Llama(context_window=500, emb_dim=10, n_heads=2, n_blocks=2, vocab_size=tokenizer.vocab_size)

Multi-Query attention

Multi-query attention allows for a reduction in the number of queries and keys inside a multi-head attention block, reducing the number of parameters in the process and having the heads share queries and keys instead.

model = Llama(context_window=500, emb_dim=10, n_heads=2, n_blocks=2, gq_ratio=1/2, vocab_size=tokenizer.vocab_size)

The parameter gq_ratio represents the ratio $\frac{number \ of \ queries/keys}{number \ of \ heads}$, 1/2 means dividing the number of queries and keys by 2. The default value is set to 1.

Launching a pre-training job

from tinyllama import TrainConfig, Trainer
TrainConfig = TrainConfig(batch_size=32, epochs=64, lr=1e-3, log_interval=50)
Trainer = Trainer(TrainConfig)
Trainer.run(model, tokens)

Logs are disabled by default, to activate set environment variable DISABLE_LOGS to 0 with DISABLE_LOGS=0 python3 file.py.

Insight

Insight class runs a training job on a clone model and returns information related to the training state.

To disable cloning, set tune_on_clone to False, you can set a custom training configuration for tuning with the argument TUNE_CONFIG = TrainConfig(..).

Gradients

Returns a histogram representing the distribution of the gradients with mean, standard deviation, and saturation.

A high saturation is an indication that the model is not learning, very low saturation ≈0% indicates that it's learning way too much (not very good).

Activations (SwiGLU layers)

Note that a training job is necessary, you don't want to keep those values in memory since you need to store the tensors at each forward pass. Before training, those values are hooked and then retrieved.

from tinyllama.insight import SwigluInsight, SwigluPath

SwigluInsight_ = SwigluInsight(track_direction=SwigluPath.BACKWARD)
SwigluInsight_.run(model, tokens)

If your model is learning correctly, saturation should stabilize as you go deeper into the layers. We've got only three SwiGLU activation functions for the moment, so such an effect will be difficult to notice.

We could improve the above, the last activation layer is still saturated though.

By default, track_direction is set to SwigluPath.BACKWARD. If you want to look at the forward activation, set it to SwigluPath.FORWARD.

Parameters

from tinyllama.insight import GradInsight
GradInsight_ = GradInsight(num_params_to_track=1500)
GradInsight_.run(model)

This is an example of a high saturation, also we don't see a well-rounded distribution.

What a good distribution of gradients should approximately be:

To avoid clutter, the legend is disabled. If you're tracking a small number of parameters, set argument show_params_name to True.

Gradient over data ratio $\frac{l_r \cdot grad}{data}$

Returns a plot representing the gradient/data ratio in each step of the training.

from tinyllama.insight import GdrInsight
GdrInsight = GdrInsight(num_params_to_track=50, num_iters=1500)
GdrInsight.run(model, tokens)

Ratios should stabilize as training goes, high values mean the network is learning way too fast (not good) while low values mean that it's learning way too slow (not good as well). Usually, you want to observe values in the 1e-2 ~ 1e-3 range.

Below is an example that shows a model hardly learning from the data:

Through adjustments on some hyperparameters and increasing the volume of the data, we improved the learning quality of the model:

To avoid clutter, the legend is disabled. If you're tracking a small number of parameters, set argument show_params_name to True.

Learning rate

Returns a plot representing the loss for each learning rate, the scale for the argument start and end is logarithmic.

from tinyllama.insight import LrInsight                                                                                                         
LrInsight_ = LrInsight(start=-5, end=0, n_lrs=50)
LrInsight_.run(model, tokens)

For each lr, we set an epoch of 1. Feel free to change it with the argument epochs_for_each.

Hyperparameter tuning

Plots and returns a tuple containing (1) training data points and the associated loss (evaluated with training) and (2) testing data points and their estimated loss (evaluated with a Gaussian process).

To disable plots, set the environment variable DISABLE_PLOT to 0.

from tinyllama.gptuner import GPTuneConfig, GPTune
GPTuneConfig = GPTuneConfig(max_num_training_samples=100, hyperparams_to_tune=["emb_dim", "n_heads"], l_bounds=[10, 2], u_bounds=[50, 5], max_num_evaluations=500)
GPTune = GPTune(GPTuneConfig)
XY_train, XY_test = GPTune.run(model, tokens, TrainConfig)

GPTune predicts the loss of different hyperparameter configurations without running full training cycles. It uses a Gaussian process model that learns from a small set of evaluated training samples to estimate performance across the entire hyperparameter space.

max_num_training_samples: sets the number of training samples, more training samples means better overall coverage of the space which will lead to better precision. The samples are extracted using a Latin hypercube, depending on how the space is constrained (intervals where hyperparameters lie), there'll be a maximum number of samples that can fit into the space.

l_bounds: sets the lower bounds of each hyperparameter, following the order of hyperparams_to_tune.

u_bounds: sets the upper bounds of each hyperparameter, following the order of hyperparams_to_tune.

hyperparams_to_tune: sets the hyperparameters to tune, the others are extracted from the model.

hyperparams_to_plot: sets the hyperparameters to plot, it must be of length <= 2 and a subset of hyperparams_to_tune.

max_num_evaluation_samples: sets the numbers of evaluations, the same observation concerning the constrained space in which the number of integer samples is finite.

The number of hyparameters needs to be <= 2 to get a plot, if you still want to get a plot of a subset, use hyperparams_to_plot argument to the list of hyperparameters that you want to plot.

from tinyllama.gptuner import GPTuneConfig, GPTune
GPTuneConfig = GPTuneConfig(max_num_training_samples=100, hyperparams_to_tune=[""emb_dim"", "n_heads", "context_window"], hyperparams_to_plot=["epochs", "n_heads"] l_bounds=[10, 2, 150], u_bounds=[50, 5, 250], max_num_evaluations=500)
GPTune = GPTune(GPTuneConfig)
GPTune.run(model, tokens, TrainConfig)

You can also have 1D plots.

from tinyllama.gptuner import GPTuneConfig, GPTune
GPTuneConfig = GPTuneConfig(max_num_training_samples=100, hyperparams_to_tune=["epochs", "n_heads", "context_window"], hyperparams_to_plot=["n_heads"] l_bounds=[10, 2, 150], u_bounds=[50, 5, 250], max_num_evaluations=500)
GPTune = GPTune(GPTuneConfig)
GPTune.run(model, tokens, TrainConfig)

Generating

Generates a response to a prompt.

from tinyllama import generate
# kv_cache is set to True by default
generate(model, prompt, max_tokens=900, kv_cache=True)

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
tests		tests
tinyllama		tinyllama
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

~ tinyllama ~

Installation

Parsing

Pre-training a model

Initializing a tokenizer

Initializing a Llama model

Multi-Query attention

Launching a pre-training job

Insight

Gradients

Activations (SwiGLU layers)

Parameters

Gradient over data ratio $\frac{l_r \cdot grad}{data}$

Learning rate

Hyperparameter tuning

Generating

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

miftahmoha/tinyllama

Folders and files

Latest commit

History

Repository files navigation

~ tinyllama ~

Installation

Parsing

Pre-training a model

Initializing a tokenizer

Initializing a Llama model

Multi-Query attention

Launching a pre-training job

Insight

Gradients

Activations (SwiGLU layers)

Parameters

Gradient over data ratio $\frac{l_r \cdot grad}{data}$

Learning rate

Hyperparameter tuning

Generating

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages