LLaMA 4 Project

This project is a pure C++ implementation of the LLaMA 4 model. Its primary goal is educational—helping people understand the architecture and internals of LLaMA 4.

Project Objectives

Educational: Demystify the LLaMA 4 architecture through hands-on C++ implementation.
Optimization Challenge: Compete in optimizing LLaMA 4 inference on a GPU node.
Also read the LLaMA 2 architecture explanation.
Reference: For background, see our previous work on LLaMA 2 (LLAMA2.cpp).

TODOs

Update Checkpoint: 11 Aug 2025

Model C++ Implementation (Owner: Duc)

Inference GPT-OSS using CPU only: on going
Implement tokenizer: on going
Implement model loading
Implement forward pass: on going

Accuracy Evaluation (Owner: Tung)

Decide evaluation metric (BLEU, MMLU, etc.)
Build evaluation scripts
Compare with baseline models
Make Slide for Output Norm: TODO
Make Common Slide: TODO

Performance Evaluation (Owner: Long)

Define threshold criteria: on going
Report throughput and latency metrics
Add logging and debugging utilities: optional
Create reference outputs for gpt-oss-20b for 128 prompts: TODO
Create reference outputs for gpt-oss-120b for 128 prompts: TODO

Server Configuration Requirements (Owner: Huy)

Infrastucture technical requirement (Proposed by Long)

General
- 29 accounts
- A shared directory
- Node requirement: 4 physical nodes, can be dynamically increased or decreased for other purposes
- 3TB disk storage replicated across all physical nodes, used to store models
- CPU/GPU info tools
- Uniform software environment (ROCm, GCC, etc.) in login node and worker nodes
- User guide document
Slurm jobs
- Can be sent to worker nodes with srun command
- Time limitation: 1 hour
- Multi-nodes limitation: 1 node
- GPUs limitation: 8 GPUs
- 2 jobs can't use same GPU
- 2 jobs from same user cannot be executed at the same time
- Jobs from users with no jobs running have higher priority than jobs from users with jobs running. Slurm jobs queue has to loosen FIFO policy: TODO until end project.
Slurm worker nodes
1. Plan A
  - 4 physical nodes serve as 4 Slurm worker nodes
  - Worker nodes can process multiple jobs at the same time
  - User can define number of GPUs to be allocated (up to 8), default: 1
2. Plan B (in case Plan A failed)
  - 11 virtual nodes serve as 11 Slurm worker nodes
    - 3 worker nodes have 8 inter-connected GPUs (type X worker node)
    - 8 worker nodes have 1 GPU (type Y worker node)
    - The number of type X and type Y worker node can be dynamically increased or decreased for workload balancing
  - Only one job can be processed by a worker node at a time
  - No GPU can be accessed by more than 1 worker node

Name		Name	Last commit message	Last commit date
Latest commit History 533 Commits
.github/workflows		.github/workflows
assets		assets
doc		doc
gpt-oss		gpt-oss
LICENSE		LICENSE
LLAMA2.md		LLAMA2.md
Makefile		Makefile
README.md		README.md
build_msvc.bat		build_msvc.bat
configurator.py		configurator.py
export.py		export.py
model.py		model.py
requirements.txt		requirements.txt
run.c		run.c
run.ipynb		run.ipynb
runq.c		runq.c
sample.py		sample.py
test.c		test.c
test_all.py		test_all.py
tinystories.py		tinystories.py
tokenizer.bin		tokenizer.bin
tokenizer.model		tokenizer.model
tokenizer.py		tokenizer.py
train.py		train.py
win.c		win.c
win.h		win.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLaMA 4 Project

Project Objectives

TODOs

Model C++ Implementation (Owner: Duc)

Accuracy Evaluation (Owner: Tung)

Performance Evaluation (Owner: Long)

Server Configuration Requirements (Owner: Huy)

Infrastucture technical requirement (Proposed by Long)

About

Uh oh!

Releases

Packages

Languages

License

moreh-dev/llama2.c

Folders and files

Latest commit

History

Repository files navigation

LLaMA 4 Project

Project Objectives

TODOs

Model C++ Implementation (Owner: Duc)

Accuracy Evaluation (Owner: Tung)

Performance Evaluation (Owner: Long)

Server Configuration Requirements (Owner: Huy)

Infrastucture technical requirement (Proposed by Long)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages