-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Description
We previously used the run.Partial / run.Config in a kubernetes environment using Volcano where each pod that is launched runs the following:
8 GPUs total - 2 Nodes
run.sh
#!/bin/bash
set -ex
torchrun \
--standalone \
--nnodes=1 \
--nproc_per_node=4 \
recipe.py \
--factory 'customizer_recipe()' \
--yes \
--yaml config.yaml \
-v
config.yaml
data:
dataset_root: /mount/models/data/
seq_length: 4096
global_batch_size: 8
micro_batch_size: 1
dataset_kwargs:
prompt_template: '{prompt} {completion}'
label_key: completion
truncation_field: prompt
trainer:
accelerator: gpu
max_epochs: 3
max_steps: 30
limit_val_batches: 1.0
log_every_n_steps: 10
val_check_interval: 10
strategy:
tensor_model_parallel_size: 4
pipeline_model_parallel_size: 1
context_parallel_size: 1
ckpt_async_save: false
plugins:
precision: bf16-mixed
log:
ckpt:
save_last: link
save_top_k: 1
train_time_interval: null
optim:
lr_scheduler:
warmup_steps: 50
config:
lr: 0.0001
resume:
restore_config:
path: /mount/models/llama-3_3-70b-instruct_v0.0.1
peft:
dim: 8
alpha: 16
dropout: 0.1
recipe.py
import nemo_run as run
from nemo.collections import llm
from megatron.core.dist_checkpointing.validation import StrictHandling
def customizer_recipe() -> run.Partial:
recipe = llm.hf_auto_model_for_causal_lm.finetune_recipe(
dir="/app/output",
name="cust-llama33-70b",
num_nodes=1,
num_gpus_per_node=4,
peft_scheme="lora",
)
recipe.data = run.Config(
llm.FineTuningDataModule,
seq_length=4096,
dataset_kwargs=run.Config(dict),
)
recipe.trainer.strategy.ckpt_load_strictness = StrictHandling.LOG_ALL
recipe.log.extra_loggers = []
recipe.log.tensorboard = None
return recipe
def run_finetuning():
run.cli.main(llm.finetune, default_factory=customizer_recipe)
if __name__ == "__main__":
run_finetuning()
How do we migrate to the new repository with the same torchrun insertion point?
WORLD_SIZE / RANK is being set by the kubrernetes launcher
Metadata
Metadata
Assignees
Labels
No labels