This repository provides step-by-step instructions to set up OpenAI Mixture-of-Experts (MoE) OSS model functionality inside a Docker container on AMD MI300 and MI355 GPUs using ROCm and PyTorch.
This demo leverages Hugging Face’s PEFT LoRA to efficiently fine-tune an OpenAI 20 billion-parameter MoE transformer—with 24 alternating sliding-window/full-attention layers and 32 experts.
Training is performed via Accelerate with FSDP in bfloat16 (gradient checkpointing enabled) on the 200 K-example HuggingFaceH4/ultrachat_200k
chat dataset.
UltraChat 200k is a heavily filtered, 200 000-example subset of the original UltraChat pool (≈ 1.4 million ChatGPT-generated multi-turn dialogues). To create it, examples were selected for supervised fine-tuning, true-casing was applied to fix capitalization errors, and any assistant replies that merely disclaim opinions or emotions were removed.
Supported Models:
gpt-oss-120b
— for production, general purpose, high reasoning use cases that fits into a single MI300 GPU (117B parameters with 5.1B active parameters)gpt-oss-20b
— for lower latency, and local or specialized use cases (21B parameters with 3.6B active parameters)
The dataset is stored in Parquet format with each entry using the following schema:
prompt
prompt_id
messages
: a list of{ role, content }
pairs
This Flowchart Illustrates Fine-tuning Pipeline for OpenAI MoE Model using UltraChat 200 with Hugging Face Ecosystem.
- Docker installed and working.
- ROCm-compatible system with MI300X or MI355X GPUs.
- Hugging Face account with a valid token.
- Sufficient disk space for model checkpoints (20B model).
- Network access to clone repositories and download from Hugging Face.
docker pull rocm/pytorch-training:v25.6
docker pull rocm/7.0-preview:rocm7.0_preview_pytorch_training_mi35X_alpha
Replace "/home/USERNAME/" with your actual host path to mount into the container at "/workspace/".
ReplaceYOUR_DOCKER_NAME
with a name you choose.
docker run -it \
--device /dev/dri \
--device /dev/kfd \
--network host \
--ipc host \
--group-add video \
--cap-add SYS_PTRACE \
--security-opt seccomp=unconfined \
--privileged \
-v /home/USERNAME/:/workspace/ \
--name YOUR_DOCKER_NAME \
rocm/pytorch-training:v25.6
docker run -it \
--device /dev/dri \
--device /dev/kfd \
--network host \
--ipc host \
--group-add video \
--cap-add SYS_PTRACE \
--security-opt seccomp=unconfined \
--privileged \
-v /home/USERNAME/:/workspace/ \
--name YOUR_DOCKER_NAME \
rocm/7.0-preview:rocm7.0_preview_pytorch_training_mi35X_alpha
All commands below apply to both MI300X and MI355X.
⚠️ Note on Transformers: Starting from version 4.55.0.dev0 of the Transformers library, built-in OpenAI Mixture of Experts (MoE) support is now available.
# Log in to Hugging Face
huggingface-cli login
# (Enter your HF token when prompted)
# Download the checkpoint
huggingface-cli download HUGGING_FACE_MODEL_DOWNLOAD_LINK --local-dir ./models/MODEL_NAME
#For example, to download 20B model
huggingface-cli download openai/gpt-oss-20b --local-dir ./models/gpt-oss-20b
#For example, to download 120B model
huggingface-cli download openai/gpt-oss-120b --local-dir ./models/gpt-oss-120b
- You can rename
MODEL_NAME
to whatever you prefer. - You need to use official OpenAI mode link instead of
HUGGING_FACE_MODEL_DOWNLOAD_LINK
. - Ensure the
models/MODEL_NAME
directory exists or will be created.
cd /workspace/
# Clone the repository
git clone https://github.com/AMD-AIG-AIMA/HF_PEFT_GPT_OSS.git
cd HF_PEFT_GPT_OSS
# For MI300X, upgrade required libraries
bash requirements_MI300.sh
# For MI355X, upgrade required libraries
bash requirements_MI355.sh
# Run the LoRA fine-tuning script
bash run_peft_lora_openai.sh
Important:
Edit run_peft_lora_openai.sh
before running and set the model_name_or_path
variable to the path of your downloaded checkpoint (e.g., models/MODEL_NAME
).
By default the script runs on a single node with eight GPUs. Modify the script parameters if you need multi-node, different GPU counts, batch sizes, etc.
- Docker permission errors: Ensure your user has access to
/dev/kfd
and is part of thevideo
group if required. - ROCm device not visible: Verify ROCm driver installation and that the container has the necessary devices (
/dev/kfd
,/dev/dri
) mounted. - Hugging Face authentication fails: Confirm your token is valid and has the appropriate read access to the model repository.
- Dependency/installation issues: Re-run the appropriate
requirements_*.sh
and inspect their output for missing packages or version conflicts.
To stop and remove the container:
docker stop YOUR_DOCKER_NAME
docker rm YOUR_DOCKER_NAME
To remove pulled images if needed:
docker image rm rocm/pytorch-training:v25.6
docker image rm rocm/7.0-preview:rocm7.0_preview_pytorch_training_mi35X_alpha