A full-stack infrastructure software from PyTorch to GPUs for the LLM era.
Decouple AI infrastructure from specific hardware vendors.
Virtualize of all GPU/NPUs in a cluster for higher utilization and failover.
Scale to thousands of GPUs/NPUs with automatic parallelization and optimization.
Supports any multi-billion or multi-trillion parameter model for training, and serving.
π Designed to unlock the full potential of your AI infrastructure!
The moai-examples repository is designed to work with a cluster where the MoAI Platform is installed.
To test these scripts, please contact us.
The optimized versions of MAF, Torch, and Flavor for each model are as follows:
Model | MAF Version | Torch Version | Python Version | Flavor | Train Batch | Eval Batch | Name in {model} |
---|---|---|---|---|---|---|---|
hugglyllama/llama-30b | 25.5.3005 |
2.1.0 |
3.10 |
flavor-default-16 |
32 | 16 | llama_30b |
meta-llama/Llama-2-7b-chat-hf | 25.5.3005 |
2.1.0 |
3.10 |
flavor-default-8 |
64 | 32 | llama2_7b |
meta-llama/Llama-2-13b-chat-hf | 25.5.3005 |
2.1.0 |
3.10 |
flavor-default-8 |
32 | 8 | llama2_13b |
meta-llama/Meta-Llama-2-70b-hf | 25.5.3005 |
2.1.0 |
3.10 |
flavor-default-32 |
256 | 64 | llama2_70b |
meta-llama/Meta-Llama-3-8B-Instruct | 25.5.3005 |
2.1.0 |
3.10 |
flavor-default-8 |
64 | 32 | llama3_8b |
meta-llama/Meta-Llama-3-70B-Instuct | 25.5.3005 |
2.1.0 |
3.10 |
flavor-default-32 |
256 | 64 | llama3_70b |
meta-llama/Meta-Llama-3-70B-Instuct (with LoRA) | 25.5.3005 |
2.1.0 |
3.10 |
flavor-default-8 |
16 | 16 | llama3_70b_lora |
Qwen/Qwen2-72B-Instruct | 25.5.3005 |
2.1.0 |
3.10 |
flavor-default-32 |
32 | 32 | qwen2_72b |
You can simply set up fine-tuning with the command below.
cd moai-examples/
uv sync
π‘ Note: This repository uses version 25.5.3005
for set A (recommended).
To fine-tune the model, run the training script as follows:
cd finetuning_codes
bash scripts/train_{model}.sh
By specifying one of the models listed under example model names in {model}
, you can also run other examples.
#!/bin/bash
export TOKENIZERS_PARALLELISM=false
export TRANSFORMERS_VERBOSITY=info
export ACCELERATOR_PLATFORM_FLAVOR=flavor-default-8
uv run accelerate launch \
--config_file config.yaml \
train.py \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--dataset bitext/Bitext-customer-support-llm-chatbot-training-dataset \
--lr 0.00001 \
--train-batch-size 64 \
--eval-batch-size 32 \
--block-size 1024 \
--num-epochs 1 \
--max-steps -1 \
--log-interval 20 \
--save-path $SAVE_DIR
The above script is based on execution from the moai-examples/finetuning_codes
directory.
Please contact the owner of the MoAI platform you wish to use for instructions on how to create an endpoint.
Supported Models | Model Max Length | TP Size | |
---|---|---|---|
MI250 | MI308x | ||
huggyllama/llama-30b | 2048 | 1 | 1 |
meta-llama/Llama-2-7b-chat-hf | 4096 | 1 | 1 |
meta-llama/Llama-2-13b-chat-hf | 4096 | 1 | 1 |
meta-llama/Meta-Llama-2-70b-hf | 4096 | 4 | 2 |
meta-llama/Meta-Llama-3-8B-Instruct | 8192 | 1 | 1 |
meta-llama/Meta-Llama-3-70B-Instuct | 8192 | 4 | 2 |
Qwen/Qwen2-72B-Instruct | 32768 | 4 | 2 |
Qwen/QwQ-32B | 40960 | 4 | 2 |
deepseek-ai/DeepSeek-R1 | 163840 | - | 8 |
The following is the command to create an endpoint. Most models use the same command, but some models may require additional settings.
The command includes several environment variables:
$MODEL
: the name or path of the model$MODEL_MAX_LEN
: the maximum context length supported by the model$TP_SIZE
: tensor parallel size, must match the number of GPUs used
Please note each models may require different values for these variables. Be sure to adjust them accordingly depending on the model you are using.
vllm serve $MODEL --max-model-len $MODEL_MAX_LEN --trust-remote-code --tensor-parallel-size $TP_SIZE --gpu-memory-utilization 0.95 --quantization None --block-size 16 --max-num-batched-tokens $MODEL_MAX_LEN --enable-chunked-prefill False
Commands for these specific models are described separately below:
deepseek-ai/DeepSeek-R1
: Some arguments are unnecessary for DeepSeek.
vllm serve deepseek-ai/DeepSeek-R1 --max-model-len 163840 --trust-remote-code --tensor-parallel-size 8 --gpu-memory-utilization 0.95 --quantization None
The following is an example of a request command using curl
. This format is compatible with the OpenAI Chat Completions API.
curl $ENDPOINT_URL/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "$MODEL",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"}
],
"max_tokens": 200
}'
If the request is sent correctly, you should receive a response similar to the one below.
Note that the exact response content may vary depending on the model and its configuration.
{
"id":"chatcmpl-abc123",
"object":"chat.completion",
"created":1744950029,
"model":"$MODEL",
"choices":[
{
"index":0,
"message":{
"role":"assistant",
"reasoning_content":null,
"content":" The winning team of the 2020 World Series was the Los Angeles Dodgers. They beat the Tampa Bay Rays in the seven-game series, which was held in stringent COVID-19 protocols to ensure the safety of players, staff, and fans.",
"tool_calls":[]
},
"logprobs":null,
"finish_reason":"stop",
"stop_reason":null
}
],
"usage":{
"prompt_tokens":35,
"total_tokens":91,
"completion_tokens":56,
"prompt_tokens_details":null
},
"prompt_logprobs":null
}
You can check the current moai version and flavor through ap-smi
.
ap-smi
+-----------------------------------------------------------------------------+
| Accelerator platform Version : v0.0.30 |
| MoAI Framework Version : 25.5.3005 Latest Version : 25.5.3005 |
+---------+---------+------------------------+----------------+---------------+
| # | NAME | FLAVOR | MEM USAGE(%) | GPU UTIL(%) |
+---------+---------+------------------------+----------------+---------------+
| 0 | 1779 | flavor-default-8 | 43 | 100 |
+---------+---------+------------------------+----------------+---------------+
The structure of the entire repository is as follows:
moai-examples
βββ README.md # Project overview and instructions
βββ checkpoints # Directory to store model checkpoints during finetuning
βββ finetuning_codes # Code related to model fine-tuning
βββ git-hooks # Git hooks directory for code formatting and other pre/post-commit tasks
βββ pyproject.toml # Project metadata
βββ uv.lock # Lockfile that contains exact information about the proejct's dependencies
finetuning_codes
directory contains train codes, model configs and scripts necessary for fine-tuning.
finetuning_codes
βββ config.yaml # Config file for accelerate
βββ logs # Directory for training logs
βββ scripts # Directory containing shell scripts for different fine-tuning setups
βββ train.py # Main Python script for initiating the fine-tuning process
βββ utils.py # Utility functions for train.py
Section | Description |
---|---|
Portal | Overview of technologies and company |
ModelHub | Chatbot using the MoAI Platform solution |
To train or run inference with ultralytics YOLO models on the MoAI Platform, which uses AMD accelerators, follow the steps below:
# Be sure to use the ROCm version of torch/torchvision
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.3
pip3 install ultralytics