OpenAI MoE (GPT-OSS) Docker Setup for AMD MI300 & MI355

This repository provides step-by-step instructions to set up OpenAI Mixture-of-Experts (MoE) OSS model functionality inside a Docker container on AMD MI300 and MI355 GPUs using ROCm and PyTorch.

This demo leverages Hugging Face’s PEFT LoRA to efficiently fine-tune an OpenAI 20 billion-parameter MoE transformer—with 24 alternating sliding-window/full-attention layers and 32 experts.

Training is performed via Accelerate with FSDP in bfloat16 (gradient checkpointing enabled) on the 200 K-example HuggingFaceH4/ultrachat_200k chat dataset.

UltraChat 200k is a heavily filtered, 200 000-example subset of the original UltraChat pool (≈ 1.4 million ChatGPT-generated multi-turn dialogues). To create it, examples were selected for supervised fine-tuning, true-casing was applied to fix capitalization errors, and any assistant replies that merely disclaim opinions or emotions were removed.

Supported Models:

gpt-oss-120b — for production, general purpose, high reasoning use cases that fits into a single MI300 GPU (117B parameters with 5.1B active parameters)
gpt-oss-20b — for lower latency, and local or specialized use cases (21B parameters with 3.6B active parameters)

The dataset is stored in Parquet format with each entry using the following schema:

prompt
prompt_id
messages: a list of { role, content } pairs

This Flowchart Illustrates Fine-tuning Pipeline for OpenAI MoE Model using UltraChat 200 with Hugging Face Ecosystem.

Prerequisites

Docker installed and working.
ROCm-compatible system with MI300X or MI355X GPUs.
Hugging Face account with a valid token.
Sufficient disk space for model checkpoints (20B model).
Network access to clone repositories and download from Hugging Face.

Step 1. Pull the PyTorch Docker Container

For MI300X

docker pull rocm/pytorch-training:v25.6

For MI355X

docker pull rocm/7.0-preview:rocm7.0_preview_pytorch_training_mi35X_alpha

Step 2. Launch / Run the Docker Container

Replace "/home/USERNAME/" with your actual host path to mount into the container at "/workspace/".
Replace YOUR_DOCKER_NAME with a name you choose.

For MI300X

docker run -it \
  --device /dev/dri \
  --device /dev/kfd \
  --network host \
  --ipc host \
  --group-add video \
  --cap-add SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --privileged \
  -v /home/USERNAME/:/workspace/ \
  --name YOUR_DOCKER_NAME \
  rocm/pytorch-training:v25.6

For MI355X

docker run -it \
  --device /dev/dri \
  --device /dev/kfd \
  --network host \
  --ipc host \
  --group-add video \
  --cap-add SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --privileged \
  -v /home/USERNAME/:/workspace/ \
  --name YOUR_DOCKER_NAME \
  rocm/7.0-preview:rocm7.0_preview_pytorch_training_mi35X_alpha

Step 3. Download the 20B MoE Model Checkpoint

All commands below apply to both MI300X and MI355X.

⚠️ Note on Transformers: Starting from version 4.55.0.dev0 of the Transformers library, built-in OpenAI Mixture of Experts (MoE) support is now available.

# Log in to Hugging Face
huggingface-cli login
# (Enter your HF token when prompted)

# Download the checkpoint
huggingface-cli download HUGGING_FACE_MODEL_DOWNLOAD_LINK --local-dir ./models/MODEL_NAME

#For example, to download 20B model
huggingface-cli download openai/gpt-oss-20b --local-dir ./models/gpt-oss-20b

#For example, to download 120B model
huggingface-cli download openai/gpt-oss-120b --local-dir ./models/gpt-oss-120b

You can rename MODEL_NAME to whatever you prefer.
You need to use official OpenAI mode link instead of HUGGING_FACE_MODEL_DOWNLOAD_LINK.
Ensure the models/MODEL_NAME directory exists or will be created.

Step 4. Clone PEFT Setup, Upgrade Dependencies, and Run LoRA Script

cd /workspace/

# Clone the repository
git clone https://github.com/AMD-AIG-AIMA/HF_PEFT_GPT_OSS.git
cd HF_PEFT_GPT_OSS

# For MI300X, upgrade required libraries
bash requirements_MI300.sh

# For MI355X, upgrade required libraries
bash requirements_MI355.sh

# Run the LoRA fine-tuning script
bash run_peft_lora_openai.sh

Important:
Edit run_peft_lora_openai.sh before running and set the model_name_or_path variable to the path of your downloaded checkpoint (e.g., models/MODEL_NAME).

By default the script runs on a single node with eight GPUs. Modify the script parameters if you need multi-node, different GPU counts, batch sizes, etc.

Troubleshooting

Docker permission errors: Ensure your user has access to /dev/kfd and is part of the video group if required.
ROCm device not visible: Verify ROCm driver installation and that the container has the necessary devices (/dev/kfd, /dev/dri) mounted.
Hugging Face authentication fails: Confirm your token is valid and has the appropriate read access to the model repository.
Dependency/installation issues: Re-run the appropriate requirements_*.sh and inspect their output for missing packages or version conflicts.

Cleanup

To stop and remove the container:

docker stop YOUR_DOCKER_NAME
docker rm YOUR_DOCKER_NAME

To remove pulled images if needed:

docker image rm rocm/pytorch-training:v25.6
docker image rm rocm/7.0-preview:rocm7.0_preview_pytorch_training_mi35X_alpha

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
configs		configs
LICENSE		LICENSE
README.md		README.md
requirements_MI300.sh		requirements_MI300.sh
requirements_MI355.sh		requirements_MI355.sh
run_peft_lora_openai.sh		run_peft_lora_openai.sh
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OpenAI MoE (GPT-OSS) Docker Setup for AMD MI300 & MI355

Prerequisites

Step 1. Pull the PyTorch Docker Container

For MI300X

For MI355X

Step 2. Launch / Run the Docker Container

For MI300X

For MI355X

Step 3. Download the 20B MoE Model Checkpoint

Step 4. Clone PEFT Setup, Upgrade Dependencies, and Run LoRA Script

Troubleshooting

Cleanup

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

AMD-AGI/HF_PEFT_GPT_OSS

Folders and files

Latest commit

History

Repository files navigation

OpenAI MoE (GPT-OSS) Docker Setup for AMD MI300 & MI355

Prerequisites

Step 1. Pull the PyTorch Docker Container

For MI300X

For MI355X

Step 2. Launch / Run the Docker Container

For MI300X

For MI355X

Step 3. Download the 20B MoE Model Checkpoint

Step 4. Clone PEFT Setup, Upgrade Dependencies, and Run LoRA Script

Troubleshooting

Cleanup

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages