Migration script from Fiddle Config/Partial

We previously used the run.Partial / run.Config in a kubernetes environment using Volcano where each pod that is launched runs the following:

8 GPUs total - 2 Nodes

run.sh

```shell
#!/bin/bash

set -ex

torchrun \
    --standalone \
    --nnodes=1 \
    --nproc_per_node=4 \
    recipe.py \
    --factory 'customizer_recipe()' \
    --yes \
    --yaml config.yaml \
    -v
```

config.yaml

```yaml
data:
  dataset_root: /mount/models/data/
  seq_length: 4096
  global_batch_size: 8
  micro_batch_size: 1
  dataset_kwargs:
    prompt_template: '{prompt} {completion}'
    label_key: completion
    truncation_field: prompt
trainer:
  accelerator: gpu
  max_epochs: 3
  max_steps: 30
  limit_val_batches: 1.0
  log_every_n_steps: 10
  val_check_interval: 10
  strategy:
    tensor_model_parallel_size: 4
    pipeline_model_parallel_size: 1
    context_parallel_size: 1
    ckpt_async_save: false
  plugins:
    precision: bf16-mixed
log:
  ckpt:
    save_last: link
    save_top_k: 1
    train_time_interval: null
optim:
  lr_scheduler:
    warmup_steps: 50
  config:
    lr: 0.0001
resume:
  restore_config:
    path: /mount/models/llama-3_3-70b-instruct_v0.0.1
peft:
  dim: 8
  alpha: 16
  dropout: 0.1
```

recipe.py

```python
import nemo_run as run

from nemo.collections import llm
from megatron.core.dist_checkpointing.validation import StrictHandling


def customizer_recipe() -> run.Partial:
    recipe = llm.hf_auto_model_for_causal_lm.finetune_recipe(
        dir="/app/output",
        name="cust-llama33-70b",
        num_nodes=1,
        num_gpus_per_node=4,
        peft_scheme="lora",
    )

    recipe.data = run.Config(
        llm.FineTuningDataModule,
        seq_length=4096,
        dataset_kwargs=run.Config(dict),
    )
    recipe.trainer.strategy.ckpt_load_strictness = StrictHandling.LOG_ALL
    recipe.log.extra_loggers = []
    recipe.log.tensorboard = None
    return recipe


def run_finetuning():
    run.cli.main(llm.finetune, default_factory=customizer_recipe)


if __name__ == "__main__":
    run_finetuning()
```


How do we migrate to the new repository with the same torchrun insertion point?

WORLD_SIZE / RANK is being set by the kubrernetes launcher


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Migration script from Fiddle Config/Partial #419

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Migration script from Fiddle Config/Partial #419

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions