🚀 CoreTransformer - A Transformer Model from Scratch

This repository contains a from-scratch implementation of the Transformer architecture using NumPy. The model is inspired by the paper "Attention is All You Need" and implements multi-head self-attention, positional encoding, feed-forward networks, and masking.

📌 Features

✔️ Built from Scratch – No deep learning frameworks (except NumPy)
✔️ Multi-Head Self-Attention – Implements scaled dot-product attention
✔️ Positional Encoding – Adds positional information to embeddings
✔️ Feed-Forward Networks – Fully connected layers with activation
✔️ Encoder-Decoder Architecture – Implements a full Transformer
✔️ Custom Masking Mechanisms – Supports padding & look-ahead masking

⚡ Installation

Clone the repository:

git clone https://github.com/yourusername/transformer-from-scratch.git
cd transformer-from-scratch

Install dependencies:

pip install numpy

🚀 Usage

1️⃣ Run the Transformer Model

To test the Transformer, run:

python main.py

2️⃣ Example: Creating an Encoder

from transformer import TransformerEncoder

encoder = TransformerEncoder(vocab_size=10000, d_model=512, num_heads=8, num_layers=6, d_ff=2048)
x = np.random.randint(0, 10000, (2, 10))  # Batch of 2 sentences, 10 words each
encoded_output = encoder.forward(x)
print(encoded_output.shape)  # Expected output: (2, 10, 512)

🏗️ Code Structure

📂 transformer_from_scratch/
┣ 📜 transformer.py – Implements Transformer, Encoder, Decoder
┣ 📜 attention.py – Implements Multi-Head Self-Attention
┣ 📜 feedforward.py – Implements Feed-Forward Network
┣ 📜 positional_encoding.py – Implements Positional Encoding
┣ 📜 masks.py – Implements padding & look-ahead masking
┣ 📜 main.py – Entry point to test the Transformer
┗ 📜 README.md – Documentation

🔍 Transformer Architecture Overview

Encoder Block

1️⃣ Token Embedding → Converts input words to vectors
2️⃣ Positional Encoding → Adds positional information
3️⃣ Multi-Head Self-Attention → Captures dependencies between words
4️⃣ Feed-Forward Network → Processes embeddings independently
5️⃣ Residual Connections & Layer Normalization

Decoder Block

1️⃣ Masked Multi-Head Attention → Ensures autoregressive property
2️⃣ Cross-Attention → Attends to encoder output
3️⃣ Feed-Forward Network
4️⃣ Final Linear & Softmax → Outputs probability distribution

🧩 Key Components Explained

1️⃣ Multi-Head Self-Attention

Formula: [ \text{Attention}(Q, K, V) = \text{softmax} \left( \frac{Q K^T}{\sqrt{d_k}} \right) V ]

2️⃣ Positional Encoding

Formula: [ PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}}) ] [ PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}}) ]

3️⃣ Feed-Forward Network

A simple two-layer MLP with ReLU activation:

[ FFN(x) = \max(0, xW_1 + b_1)W_2 + b_2 ]

🏋️ Training the Transformer

🚧 Training is not implemented in this repository. If you want to extend it for training:

Implement a loss function (e.g., Cross-Entropy Loss).
Use gradient descent (NumPy-based SGD/Adam) for optimization.
Add a dataset loader for NLP tasks (e.g., machine translation).

🎥 YouTube Video & Explanation

Check out my YouTube video explaining this Transformer model:
📹 Watch Here

🔹 Topics covered in the video:
✔️ Transformer model architecture
✔️ Self-attention & Multi-head attention
✔️ Building a Transformer from scratch
✔️ Hands-on code implementation

🔹 Resources & Code (if applicable):
📜 GitHub Repository: [Insert Link]
📘 Paper: "Attention is All You Need"

🎥 Don’t forget to like, comment, and subscribe if you found this helpful!

#AI #MachineLearning #DeepLearning #LLM #Transformers #NeuralNetworks

🔍 Masking Mechanism Explained

1️⃣ Padding Mask (`src_mask`)

Prevents attention from focusing on padding tokens.

Generated using:

src_mask = (src != 0).astype(np.float32)

2️⃣ Look-Ahead Mask (`tgt_mask`)

Ensures that each token in the decoder only attends to previous tokens.

Created using an upper triangular matrix:

look_ahead_mask = np.triu(np.ones((seq_length, seq_length)), k=1)

3️⃣ Final Target Mask Combination

Combines padding mask + look-ahead mask:

tgt_mask = tgt_mask[:, np.newaxis, :] * look_ahead_mask

📝 References

Vaswani et al., "Attention is All You Need"

🌟 Contributing

Want to improve this project? Feel free to submit a pull request or open an issue! 🚀

git checkout -b feature-branch
git commit -m "Add a new feature"
git push origin feature-branch

🛠️ License

📜 MIT License – Feel free to use and modify this repository!

🔥 Enjoy learning Transformers? Give this repo a ⭐ on GitHub!

---

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
transformer_from_scratch.py		transformer_from_scratch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 CoreTransformer - A Transformer Model from Scratch

📌 Features

⚡ Installation

🚀 Usage

1️⃣ Run the Transformer Model

2️⃣ Example: Creating an Encoder

🏗️ Code Structure

🔍 Transformer Architecture Overview

Encoder Block

Decoder Block

🧩 Key Components Explained

1️⃣ Multi-Head Self-Attention

2️⃣ Positional Encoding

3️⃣ Feed-Forward Network

🏋️ Training the Transformer

🎥 YouTube Video & Explanation

🔍 Masking Mechanism Explained

1️⃣ Padding Mask (`src_mask`)

2️⃣ Look-Ahead Mask (`tgt_mask`)

3️⃣ Final Target Mask Combination

📝 References

🌟 Contributing

🛠️ License

About

Uh oh!

Releases

Packages

Languages

aaditey932/transformer-from-scratch

Folders and files

Latest commit

History

Repository files navigation

🚀 CoreTransformer - A Transformer Model from Scratch

📌 Features

⚡ Installation

🚀 Usage

1️⃣ Run the Transformer Model

2️⃣ Example: Creating an Encoder

🏗️ Code Structure

🔍 Transformer Architecture Overview

Encoder Block

Decoder Block

🧩 Key Components Explained

1️⃣ Multi-Head Self-Attention

2️⃣ Positional Encoding

3️⃣ Feed-Forward Network

🏋️ Training the Transformer

🎥 YouTube Video & Explanation

🔍 Masking Mechanism Explained

1️⃣ Padding Mask (src_mask)

2️⃣ Look-Ahead Mask (tgt_mask)

3️⃣ Final Target Mask Combination

📝 References

🌟 Contributing

🛠️ License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1️⃣ Padding Mask (`src_mask`)

2️⃣ Look-Ahead Mask (`tgt_mask`)

Packages