Skip to content

A from-scratch implementation of the Transformer model using NumPy (and optional PyTorch tensors for data structures). This project includes the architecture and forward pass of a Transformer model, based on the 'Attention is All You Need' paper.

Notifications You must be signed in to change notification settings

aaditey932/transformer-from-scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 

Repository files navigation

🚀 CoreTransformer - A Transformer Model from Scratch

This repository contains a from-scratch implementation of the Transformer architecture using NumPy. The model is inspired by the paper "Attention is All You Need" and implements multi-head self-attention, positional encoding, feed-forward networks, and masking.


📌 Features

✔️ Built from Scratch – No deep learning frameworks (except NumPy)
✔️ Multi-Head Self-Attention – Implements scaled dot-product attention
✔️ Positional Encoding – Adds positional information to embeddings
✔️ Feed-Forward Networks – Fully connected layers with activation
✔️ Encoder-Decoder Architecture – Implements a full Transformer
✔️ Custom Masking Mechanisms – Supports padding & look-ahead masking


⚡ Installation

Clone the repository:

git clone https://github.com/yourusername/transformer-from-scratch.git
cd transformer-from-scratch

Install dependencies:

pip install numpy

🚀 Usage

1️⃣ Run the Transformer Model

To test the Transformer, run:

python main.py

2️⃣ Example: Creating an Encoder

from transformer import TransformerEncoder

encoder = TransformerEncoder(vocab_size=10000, d_model=512, num_heads=8, num_layers=6, d_ff=2048)
x = np.random.randint(0, 10000, (2, 10))  # Batch of 2 sentences, 10 words each
encoded_output = encoder.forward(x)
print(encoded_output.shape)  # Expected output: (2, 10, 512)

🏗️ Code Structure

📂 transformer_from_scratch/
┣ 📜 transformer.py – Implements Transformer, Encoder, Decoder
┣ 📜 attention.py – Implements Multi-Head Self-Attention
┣ 📜 feedforward.py – Implements Feed-Forward Network
┣ 📜 positional_encoding.py – Implements Positional Encoding
┣ 📜 masks.py – Implements padding & look-ahead masking
┣ 📜 main.py – Entry point to test the Transformer
┗ 📜 README.md – Documentation


🔍 Transformer Architecture Overview

Encoder Block

1️⃣ Token Embedding → Converts input words to vectors
2️⃣ Positional Encoding → Adds positional information
3️⃣ Multi-Head Self-Attention → Captures dependencies between words
4️⃣ Feed-Forward Network → Processes embeddings independently
5️⃣ Residual Connections & Layer Normalization

Decoder Block

1️⃣ Masked Multi-Head Attention → Ensures autoregressive property
2️⃣ Cross-Attention → Attends to encoder output
3️⃣ Feed-Forward Network
4️⃣ Final Linear & Softmax → Outputs probability distribution


🧩 Key Components Explained

1️⃣ Multi-Head Self-Attention

Formula: [ \text{Attention}(Q, K, V) = \text{softmax} \left( \frac{Q K^T}{\sqrt{d_k}} \right) V ]

2️⃣ Positional Encoding

Formula: [ PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}}) ] [ PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}}) ]

3️⃣ Feed-Forward Network

A simple two-layer MLP with ReLU activation:

[ FFN(x) = \max(0, xW_1 + b_1)W_2 + b_2 ]


🏋️ Training the Transformer

🚧 Training is not implemented in this repository. If you want to extend it for training:

  1. Implement a loss function (e.g., Cross-Entropy Loss).
  2. Use gradient descent (NumPy-based SGD/Adam) for optimization.
  3. Add a dataset loader for NLP tasks (e.g., machine translation).

🎥 YouTube Video & Explanation

Check out my YouTube video explaining this Transformer model:
📹 Watch Here

🔹 Topics covered in the video:
✔️ Transformer model architecture
✔️ Self-attention & Multi-head attention
✔️ Building a Transformer from scratch
✔️ Hands-on code implementation

🔹 Resources & Code (if applicable):
📜 GitHub Repository: [Insert Link]
📘 Paper: "Attention is All You Need"

🎥 Don’t forget to like, comment, and subscribe if you found this helpful!

#AI #MachineLearning #DeepLearning #LLM #Transformers #NeuralNetworks


🔍 Masking Mechanism Explained

1️⃣ Padding Mask (src_mask)

  • Prevents attention from focusing on padding tokens.
  • Generated using:
    src_mask = (src != 0).astype(np.float32)

2️⃣ Look-Ahead Mask (tgt_mask)

  • Ensures that each token in the decoder only attends to previous tokens.
  • Created using an upper triangular matrix:
    look_ahead_mask = np.triu(np.ones((seq_length, seq_length)), k=1)

3️⃣ Final Target Mask Combination

  • Combines padding mask + look-ahead mask:
    tgt_mask = tgt_mask[:, np.newaxis, :] * look_ahead_mask

📝 References


🌟 Contributing

Want to improve this project? Feel free to submit a pull request or open an issue! 🚀

git checkout -b feature-branch
git commit -m "Add a new feature"
git push origin feature-branch

🛠️ License

📜 MIT License – Feel free to use and modify this repository!


🔥 Enjoy learning Transformers? Give this repo a ⭐ on GitHub!


---

About

A from-scratch implementation of the Transformer model using NumPy (and optional PyTorch tensors for data structures). This project includes the architecture and forward pass of a Transformer model, based on the 'Attention is All You Need' paper.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages