-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Labels
bugSomething isn't workingSomething isn't working
Description
[rank1]: Traceback (most recent call last):
[rank1]: File "/opt/Automodel/examples/llm_finetune/finetune.py", line 33, in <module>
[rank1]: main()
[rank1]: File "/opt/Automodel/examples/llm_finetune/finetune.py", line 29, in main
[rank1]: recipe.run_train_validation_loop()
[rank1]: File "/opt/Automodel/nemo_automodel/recipes/llm/train_ft.py", line 869, in run_train_validation_loop
[rank1]: reporting_loss, grad_norm, tps, num_tokens_in_batch, num_label_tokens = self._run_train_optim_step(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/Automodel/nemo_automodel/recipes/llm/train_ft.py", line 974, in _run_train_optim_step
[rank1]: self._forward_backward_step(
[rank1]: File "/opt/Automodel/nemo_automodel/recipes/llm/train_ft.py", line 949, in _forward_backward_step
[rank1]: (local_loss * self._get_dp_group_size()).backward()
[rank1]: File "/opt/venv/lib/python3.12/site-packages/torch/_tensor.py", line 647, in backward
[rank1]: torch.autograd.backward(
[rank1]: File "/opt/venv/lib/python3.12/site-packages/torch/autograd/__init__.py", line 354, in backward
[rank1]: _engine_run_backward(
[rank1]: File "/opt/venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 829, in _engine_run_backward
[rank1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/venv/lib/python3.12/site-packages/megatron_fsdp/megatron_fsdp.py", line 619, in _root_post_backward
[rank1]: _grad_acc(param)
[rank1]: File "/opt/venv/lib/python3.12/site-packages/megatron_fsdp/megatron_fsdp.py", line 485, in _grad_acc
[rank1]: param.main_grad.add_(to_local_if_dtensor(param.grad))
[rank1]: RuntimeError: The size of tensor a (16000) must match the size of tensor b (32000) at non-singleton dimension 0
[rank0]: Traceback (most recent call last):
[rank0]: File "/opt/Automodel/examples/llm_finetune/finetune.py", line 33, in <module>
[rank0]: main()
[rank0]: File "/opt/Automodel/examples/llm_finetune/finetune.py", line 29, in main
[rank0]: recipe.run_train_validation_loop()
[rank0]: File "/opt/Automodel/nemo_automodel/recipes/llm/train_ft.py", line 869, in run_train_validation_loop
[rank0]: reporting_loss, grad_norm, tps, num_tokens_in_batch, num_label_tokens = self._run_train_optim_step(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/Automodel/nemo_automodel/recipes/llm/train_ft.py", line 974, in _run_train_optim_step
[rank0]: self._forward_backward_step(
[rank0]: File "/opt/Automodel/nemo_automodel/recipes/llm/train_ft.py", line 949, in _forward_backward_step
[rank0]: (local_loss * self._get_dp_group_size()).backward()
[rank0]: File "/opt/venv/lib/python3.12/site-packages/torch/_tensor.py", line 647, in backward
[rank0]: torch.autograd.backward(
[rank0]: File "/opt/venv/lib/python3.12/site-packages/torch/autograd/__init__.py", line 354, in backward
[rank0]: _engine_run_backward(
[rank0]: File "/opt/venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 829, in _engine_run_backward
[rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/venv/lib/python3.12/site-packages/megatron_fsdp/megatron_fsdp.py", line 619, in _root_post_backward
[rank0]: _grad_acc(param)
[rank0]: File "/opt/venv/lib/python3.12/site-packages/megatron_fsdp/megatron_fsdp.py", line 485, in _grad_acc
[rank0]: param.main_grad.add_(to_local_if_dtensor(param.grad))
[rank0]: RuntimeError: The size of tensor a (16000) must match the size of tensor b (32000) at non-singleton dimension 0
Using default TP plan for parallelization. It is compatible with huggingface llama3-style models.
[rank1]: Traceback (most recent call last):
[rank1]: File "/opt/Automodel/examples/llm_finetune/finetune.py", line 33, in <module>
[rank1]: main()
[rank1]: File "/opt/Automodel/examples/llm_finetune/finetune.py", line 28, in main
[rank1]: recipe.setup()
[rank1]: File "/opt/Automodel/nemo_automodel/recipes/llm/train_ft.py", line 762, in setup
[rank1]: model, model_state_dict_keys, self.optimizer, self.loss_fn = build_model_and_optimizer(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/Automodel/nemo_automodel/recipes/llm/train_ft.py", line 233, in build_model_and_optimizer
[rank1]: model, optimizer = model_wrapper.parallelize(model, optimizer)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/Automodel/nemo_automodel/components/distributed/megatron_fsdp.py", line 265, in parallelize
[rank1]: model = megatron_fsdp_strategy_parallelize(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/Automodel/nemo_automodel/components/distributed/parallelizer.py", line 891, in megatron_fsdp_strategy_parallelize
[rank1]: model, optimizer = megatron_fsdp_fully_shard(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/venv/lib/python3.12/site-packages/megatron_fsdp/fully_shard.py", line 286, in fully_shard
[rank1]: model = MegatronFSDP(
[rank1]: ^^^^^^^^^^^^^
[rank1]: File "/opt/venv/lib/python3.12/site-packages/megatron_fsdp/megatron_fsdp.py", line 253, in __init__
[rank1]: self._init_fsdp_param_and_grad_buffer()
[rank1]: File "/opt/venv/lib/python3.12/site-packages/megatron_fsdp/megatron_fsdp.py", line 290, in _init_fsdp_param_and_grad_buffer
[rank1]: self.param_and_grad_buffer = ParamAndGradBuffer(
[rank1]: ^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/venv/lib/python3.12/site-packages/megatron_fsdp/param_and_grad_buffer.py", line 1636, in __init__
[rank1]: self._init_distributed_params()
[rank1]: File "/opt/venv/lib/python3.12/site-packages/megatron_fsdp/param_and_grad_buffer.py", line 2297, in _init_distributed_params
[rank1]: dist_param = make_fsdp_dtensor(
[rank1]: ^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/venv/lib/python3.12/site-packages/megatron_fsdp/param_and_grad_buffer.py", line 3717, in make_fsdp_dtensor
[rank1]: validate_uneven_dtensor(fsdp_tensor)
[rank1]: File "/opt/venv/lib/python3.12/site-packages/megatron_fsdp/uneven_dtensor.py", line 160, in validate_uneven_dtensor
[rank1]: assert all(
[rank1]: ^^^^
[rank1]: AssertionError: [Megatron-FSDP] DTensor chunk metadata is invalid. Offsets: (0, 512), Sizes: (256, 512), Global shape: torch.Size([512, 512]), Local shape: torch.Size([256, 512]), Device mesh: DeviceMesh('cuda', [[0, 1]], mesh_dim_names=('dp', 'tp')).
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working