Skip to content

Torchsmith is a minimalist library that focuses on understanding generative AI by building it using primitive PyTorch operations

License

Notifications You must be signed in to change notification settings

ankitdhall/torchsmith

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

torchsmith

License Test Coverage

Understanding Generative AI By Building It From Scratch

πŸ”₯ Why Torchsmith?

Torchsmith is a minimalist library that focuses on understanding by building. Torchsmith builds multimodal modern generative AI, such as VAEs, VQVAEs, autoregressive and diffusion models trained on image, text, and image-text pairs.

Torchsmith is built using basic PyTorch operations, without relying on high-level abstractions.

Here you will find a bare-bones implementation of various building blocks of modern-day machine learning such as attention, positional encoding, transformers, learning rate schedulers, various text and image tokenizers, to name a few.

Torchsmith was inspired by Berkeley's CS294-158.

🎬 Torchsmith In Action

Here is a quick highlight reel, for more experiments see the experiments.

Image Generation

Image-space diffusion with U-Net and latent diffusion with Diffusion Transformer (DiT) on the CIFAR 10 dataset.

Original CIFAR 10 images Unconditional samples from UNet via image diffusion Class
    conditioned samples from Diffusion Transformer via image diffusion

Fig. (Left) Original images from CIFAR 10 dataset. (Center) Unconditional samples generated by UNet using image-space diffusion. (Right) Class conditional samples generated by DiT using latent diffusion.

Image-Text Generation

GPT2-style image-text generation decoder trained on the Colored MNIST dataset. The model can be used to generate:

  1. Unconditional image-text pairs
  2. Text-conditioned images
  3. Image-conditioned texts

Image completion followed by text generation

Fig. For each image in the left-most column, only the non-darkened pixels are given as input to the GPT2 image-text model. To generate a sample, the model performs image completion followed by text captioning the contents of the completed image. Each row then shows the 4 samples i.e. 4 image-text pairs generated by the model.

Text Generation

GPT2-style text generation decoder trained on Eminem lyrics:

Sample #1
What, Im gonna get stay back to exam to stay
My dogs are crazy after nice playin to death
You know that wasnt spakin four
Thats my daughter, you wanna get score
Play if youre that videos
Oh, whoa, cause I wanna get so f***** circus
And someone and

Sample #2
And its feelin youse right!
Wouldnt f*** you to your kill: But you talk about me
Its about to have you still smack me
You talk to you think thats the shit you aint got some brains thats even you
When you hear on the floor, Im hardly startin to kill you!

Representation Learning with VAEs

Variational Autoencoder (VAE) with convolutional encoder and decoder to learn rich latent representations on Colored MNIST and SVHN datasets.

Colored MNIST
    samples generated after linearly interpolating between a starting and
    endpoint SVHN
    samples generated after linearly interpolating between a starting and
    endpoint

Fig. Walking the VAE latent space. Samples generated after linearly interpolating between a starting and endpoint in the latent space. Each row represents walking in the latent space with the left-most column as the starting point and the right-most column as the endpoint. (left) Colored MNIST dataset (right) SVHN dataset.

Discrete Representation Learning with VQVAEs With Autoregressive Prior

Vector Quantized-Variational Autoencoders (VQVAEs) are trained from scratch to learn discrete latent representations of images. An autoregressive transformer is then trained on the learned discrete latent representations as a prior on top of the VQVAE. This is then used to sample from the latent space and decode to image space (using the VQVAE's decoder).

This can be done end-to-end as a 2-step process:

  • Step 1: learning of the discrete latent representation via the VQVAE.
  • Step 2: This is then followed by autoregressive modeling of the prior using a GPT2 on top of the VQVAE's learned latent representation.

SVHN pairs: (original, reconstructed) pairs after epoch 1 SVHN pairs: (original, reconstructed) pairs after epoch 10 SVHN pairs: (original, reconstructed) pairs after epoch 20

Fig. SVHN Dataset. Step 1: VQVAE. 50 (original image, reconstructed) pairs at the end of epoch 1 (left), epoch 10 (center) and epoch 20 (right) (out of total 20 epochs). The reconstructed image is obtained by passing the original image through the VQVAE (encoder followed by decoder). One can notice how the reconstructed image gets closer to the original image as the training progresses.

SVHN pairs: (original, reconstructed) pairs after epoch 1 SVHN pairs: (original, reconstructed) pairs after epoch 10 SVHN pairs: (original, reconstructed) pairs after epoch 20

Fig. SVHN Dataset. Step 2: Autoregressive modeling of the prior using GPT2. 100 unconditional samples generated at the end of epoch 1 (left), epoch 10 (center) and epoch 20 (right) (out of total 20 epochs). Sampling is performed by the autoregressive model in latent space. The VQVAE from step 1 is then used to decode the latent space and decode to image space. One can notice the improvement in the quality of the generated image as the training progresses.


✨ Features

  • 🧠 Modality Agnostic Autoregression Trainer

    A modular and minimal model trainer that works seamlessly across:

    • Text generation
    • Image generation
    • Multi-modal tasks: Text-to-Image and Image-to-Text
  • 🎨 Diffusion Trainer For Images

    • Image Diffusion (diffusion in pixel space)
    • Latent Diffusion (diffusion in latent space e.g. using VQ-VAE as encoder)
  • πŸ” VAE and VQVAEs For Images

    • Convolutional Variational Autoencoder to learn a meaningful and compact latent space. The VAE can also be used to generate new samples.
    • VQVAEs learn discrete latent representations allowing modeling of the prior using an autoregressive model which is well-suited (for generating in) the discrete representation space.
    • End-to-end VQVAE with autoregressive prior trained on the learned discrete latent representations.
  • πŸ€– Implementation Of Modern Generative Architectures From The Ground Up

    • Modality agnostic GPT2
    • UNet for images
    • DiT (Diffusion Transformer) for images
  • 🧾 Tokenizers For Text

    • Character-level tokenizer
    • Word-level tokenizer
    • Byte Pair Encoding (BPE)
      • Efficiently implemented using Python iterators and joblib
  • πŸ› οΈ Primitive Operations

    • Core components like attention, MLPs, positional encodings, and layer norms are implemented from first principles
    • No reliance on torch.nn.Transformer or other high-level attention blocks
  • πŸ§ͺ Test-Driven Development

    • Full suite of unit tests ensuring correctness and stability
    • Easy to extend and experiment without breaking things
  • πŸ“‹ Best Practices For Code Development

    • Uses uv for dependency management
    • Enforces style and formatting with pre-commit and ruff

Experiments

[Autoregression] Colored MNIST With Text Labels

This experiment involves using GPT2-style decoder trained with autoregression. In addition to the results shown in this section, here are detailed results:

Unconditional image-text pair generation

Fig. Unconditional image-text pair generation.

Image generations conditioned on the given text

Fig. Generated images conditioned on text "dark red {DIGIT} on light cyan" where DIGIT takes on values of the digit names from 1-9 (e.g. "dark red one on light cyan", etc.). The model conditions the generated image on the text.

Text completion followed by image generation conditioned on the text

Fig. Generated images conditioned on text "plain red {DIGIT}" where DIGIT takes on values of the digit names from 1-9 (e.g. "plain red one", "plain red two", etc.). Note that the model first completes the text to determine the background color ("plain red one" -> "plain red one on dark green") and then goes on to generate the corresponding image.

[Diffusion] CIFAR 10

In addition to the results shown in this section, here are detailed results:

UNet Experiments

CIFAR 10
    generated images after 4 denoising steps CIFAR 10
    generated images after 16 denoising steps CIFAR 10
    generated images after 64 denoising steps CIFAR 10
    generated images after 256 denoising steps CIFAR 10
    generated images after 512 denoising steps

Fig. CIFAR 10 samples generated after 4, 16, 64, 256, and 512 denoising steps (from left to right). Notice how the details in the generated samples are proportional to the number of denoising steps.

CIFAR 10
    unconditionally generated images after epoch 1 CIFAR 10
    unconditionally generated images after epoch 5 CIFAR 10
    unconditionally generated images after epoch 15 CIFAR 10
    unconditionally generated images after epoch 30 CIFAR 10
    unconditionally generated images after epoch 60

Fig. Unconditional CIFAR 10 samples generated after epoch 1, 5, 15, 30 and 60 (from left to right). Sampled images consist only of high-frequency noise as seen after epoch 1 but details begin to appear and images become coherent as the training progresses.

Diffusion Transformer (DiT) Experiments

CIFAR 10
    generated class-conditioned images using no classifier free guidance CIFAR 10
    generated class-conditioned images using classifier free guidance 1.0 CIFAR 10
    generated class-conditioned images using classifier free guidance 3.0 CIFAR 10
    generated class-conditioned images using classifier free guidance 5.0 CIFAR 10
    generated class-conditioned images using classifier free guidance 7.5

Fig. Class-conditioned CIFAR 10 samples generated with no classifier free guidance (CFG), classifier free guidance with weight 1.0, 3.0, 5.0 and 7.5 (from left to right). Larger values of the CFG weight improve the faithfulness of the generated image to the class. With too large weights, the samples begin to lose diversity and collapse to similar looking samples.

CIFAR 10
    unconditionally generated images after epoch 1 CIFAR 10
    unconditionally generated images after epoch 3 CIFAR 10
    unconditionally generated images after epoch 5 CIFAR 10
    unconditionally generated images after epoch 30 CIFAR 10
    unconditionally generated images after epoch 60

Fig. Unconditional CIFAR 10 samples generated after epoch 1, 3, 5, 30 and 60 (from left to right). Sampled images become coherent and detailed as the training progresses.

Learning rate scheduler Training and Testing losses

Fig. (left) Cosine learning rate scheduler with warmup. (right) Train and test losses.

[Autoregression] Poetry

Here are some samples that are generated by GPT2 model trained on the poetry dataset:

Sample #1
Long Pystand Pelite: 191[Dweven
Pestate known that they recicish or latt.
Untrunter-vain-his and-lifes.

Sample #2
With lover blowdes and stake a the shade unto more,
The gively . This daight that I dispure.
Revaliants stempeted golding of t

Sample #3
Dramters my not be and mensuck'd withat might and of this,
Then with an shike of sturn
Of wift be my kimery arm thing fair dre

[Autoencoders] VAE

Colored MNIST

The Colored MNIST dataset is modeled using a convolutional VAE with a 32-dimensional latent space.

Pairs of original
    image (from the dataset) and reconstructed image Colored MNIST
    samples generated after linearly interpolating between a starting and
    endpoint Colored MNIST
    samples after 20 epochs

Fig. (left) 50 pairs consisting of the original image (from the Colored MNIST dataset) and reconstructed image. Reconstruction is performed by first encoding and then decoding the original image. (center) Colored MNIST samples generated after linearly interpolating between a starting and endpoint in the latent space. Each row represents walking in the latent space with the left-most column as the starting point and the right-most column as the endpoint. (right) Colored MNIST samples generated by randomly sampling the latent space followed by decoding.

SVHN

The Street View House Numbers dataset is modeled using a convolutional VAE with a 16-dimensional latent space.

Pairs of original
    image (from the dataset) and reconstructed image SVHN
    samples generated after linearly interpolating between a starting and
    endpoint SVHN
    samples after 10 epochs

Fig. (left) 50 pairs consisting of the original image (from the SVHN dataset) and reconstructed image. Reconstruction is performed by first encoding and then decoding the original image. (center) SVHN samples generated after linearly interpolating between a starting and endpoint in the latent space. Each row represents walking in the latent space with the left-most column as the starting point and the right-most column as the endpoint. (right) SVHN samples generated by randomly sampling the latent space followed by decoding.

[Autoencoders] VQVAE

Read the preliminaries here.

SVHN

See the section for SVHN here.

SVHN
pairs: (original, reconstructed) pairs Unconditional samples generated using the autoregressive prior

Fig. SVHN Dataset. (Left) Step 1: VQVAE. 50 (original image, reconstructed) pairs. The reconstructed image is obtained by passing the original image through the VQVAE (encoder followed by decoder). (Right) Step 2: Autoregressive GPT2 trained on VQVAE's learned latent representation. 100 unconditional samples generated by the autoregressive model which are then decoded from the learned discrete latent representations to the image space.

CIFAR-10

CIFAR-10 pairs: (original, reconstructed) pairs after epoch 1 CIFAR-10 pairs: (original, reconstructed) pairs after epoch 10 CIFAR-10 pairs: (original, reconstructed) pairs after epoch 20

Fig. CIFAR-10 Dataset. Step 1: VQVAE. 50 (original image, reconstructed) pairs at the end of epoch 1 (left), epoch 10 (center) and epoch 20 (right) (out of total 20 epochs). The reconstructed image is obtained by passing the original image through the VQVAE (encoder followed by decoder). One can notice how the reconstructed image gets closer to the original image as the training progresses.

CIFAR-10 pairs: (original, reconstructed) pairs after epoch 1 CIFAR-10 pairs: (original, reconstructed) pairs after epoch 10 CIFAR-10 pairs: (original, reconstructed) pairs after epoch 20

Fig. CIFAR-10 Dataset. Step 2: Autoregressive modeling of the prior using GPT2. 100 unconditional samples generated at the end of epoch 1 (left), epoch 10 (center) and epoch 20 (right) (out of total 20 epochs). Sampling is performed by the autoregressive model in latent space. The VQVAE from step 1 is then used to decode the latent space and decode to image space. One can notice the improvement in the quality of the generated image as the training progresses.

CIFAR-10
pairs: (original, reconstructed) pairs Unconditional samples generated using the autoregressive prior

Fig. CIFAR-10 Dataset. (Left) Step 1: VQVAE. 50 (original image, reconstructed) pairs. The reconstructed image is obtained by passing the original image through the VQVAE (encoder followed by decoder). (Right) Step 2: Autoregressive GPT2 trained on VQVAE's learned latent representation. 100 unconditional samples generated by the autoregressive model which are then decoded from the learned discrete latent representations to the image space.


πŸ“¦ Installation

# Clone the repo
git clone https://github.com/ankitdhall/torchsmith.git

# Option 1: Install using uv
uv pip install ./torchsmith

# Option 2: Install using pip
pip install ./torchsmith

🧰 Usage Examples

πŸ’ͺ Training

Training scripts for various datasets and models can be found in the examples/ directory.

For example:

python examples/train_eminem.py

🎨 Generating Samples

import huggingface_hub

from torchsmith.models.gpt2.decoder import GPT2Decoder
from torchsmith.tokenizers.mnist_tokenizer import ColoredMNISTImageAndTextTokenizer
from torchsmith.tokenizers.mnist_tokenizer import (
    colored_mnist_with_text_conditioned_on_text,
)
from torchsmith.utils.plotting import plot_images
from torchsmith.utils.pytorch import get_device

text = "plain orange seven on dark blue"
num_samples = 4

tokenizer = ColoredMNISTImageAndTextTokenizer()
path_to_weights = huggingface_hub.hf_hub_download(
    "ankitdhall/colored_mnist_with_text_gpt2", filename="model.pth"
)
transformer = GPT2Decoder.load_model(path_to_weights).to(get_device())

decoded_images, decoded_texts = colored_mnist_with_text_conditioned_on_text(
    num_samples=num_samples,
    text=text,
    tokenizer=tokenizer,
    transformer=transformer,
)
plot_images(
    decoded_images, titles=decoded_texts, max_cols=int(num_samples ** 2)
)

πŸ§ͺ Test-Driven Development

Torchsmith strives for test-driven development and uses pytest for testing. The tests cover most of the codebase.

First install packages needed for testing:

# Option 1: Install using uv
uv pip install ./torchsmith[testing]

# Option 2: Install using pip
pip install ./torchsmith[testing]

To run the tests:

pytest tests/

To generate and save (saved to .badges/) the test coverage badge run:

./scripts/generate_badges.sh

πŸ§‘β€πŸ’» Contributing

When contributing changes, please:

  • Add tests for new features, improvements and bug-fixes.
  • Follow the existing coding style.

Torchsmith uses pre-commit hooks to ensure clean and consistent code:

pre-commit install
pre-commit run --all-files

🚧 TODOs

  • Experiment with learning rate schedulers
  • No-code way to train using declarative YAML config
  • Experiment with VAEs
  • Experiment with VQ-VAEs
  • Support LoRA and fine-tuning utilities
  • Find bigger GPUs and extend to larger datasets
  • Experiment with different positional embeddings

Acknowledgements

Inspired by:

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages