Skip to content

Migration script from Fiddle Config/Partial #419

@soluwalana

Description

@soluwalana

We previously used the run.Partial / run.Config in a kubernetes environment using Volcano where each pod that is launched runs the following:

8 GPUs total - 2 Nodes

run.sh

#!/bin/bash

set -ex

torchrun \
    --standalone \
    --nnodes=1 \
    --nproc_per_node=4 \
    recipe.py \
    --factory 'customizer_recipe()' \
    --yes \
    --yaml config.yaml \
    -v

config.yaml

data:
  dataset_root: /mount/models/data/
  seq_length: 4096
  global_batch_size: 8
  micro_batch_size: 1
  dataset_kwargs:
    prompt_template: '{prompt} {completion}'
    label_key: completion
    truncation_field: prompt
trainer:
  accelerator: gpu
  max_epochs: 3
  max_steps: 30
  limit_val_batches: 1.0
  log_every_n_steps: 10
  val_check_interval: 10
  strategy:
    tensor_model_parallel_size: 4
    pipeline_model_parallel_size: 1
    context_parallel_size: 1
    ckpt_async_save: false
  plugins:
    precision: bf16-mixed
log:
  ckpt:
    save_last: link
    save_top_k: 1
    train_time_interval: null
optim:
  lr_scheduler:
    warmup_steps: 50
  config:
    lr: 0.0001
resume:
  restore_config:
    path: /mount/models/llama-3_3-70b-instruct_v0.0.1
peft:
  dim: 8
  alpha: 16
  dropout: 0.1

recipe.py

import nemo_run as run

from nemo.collections import llm
from megatron.core.dist_checkpointing.validation import StrictHandling


def customizer_recipe() -> run.Partial:
    recipe = llm.hf_auto_model_for_causal_lm.finetune_recipe(
        dir="/app/output",
        name="cust-llama33-70b",
        num_nodes=1,
        num_gpus_per_node=4,
        peft_scheme="lora",
    )

    recipe.data = run.Config(
        llm.FineTuningDataModule,
        seq_length=4096,
        dataset_kwargs=run.Config(dict),
    )
    recipe.trainer.strategy.ckpt_load_strictness = StrictHandling.LOG_ALL
    recipe.log.extra_loggers = []
    recipe.log.tensorboard = None
    return recipe


def run_finetuning():
    run.cli.main(llm.finetune, default_factory=customizer_recipe)


if __name__ == "__main__":
    run_finetuning()

How do we migrate to the new repository with the same torchrun insertion point?

WORLD_SIZE / RANK is being set by the kubrernetes launcher

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions