Skip to content

deepspeed not support? RuntimeError: Found dtype Float but expected Half #198

@Frei-2

Description

@Frei-2

I notice that there are codes about using Deepspeed to train the model in train.py, but when I set the strategy to "deepspeed", I got an error: RuntimeError: Found dtype Float but expected Half. @zqevans

Thanks in advance for any advice or experience on this issue.

The details are as follows.

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/configuration_validator.py:74: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.
[rank: 0] Seed set to 42
initializing deepspeed distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Enabling DeepSpeed FP16. Model parameters and inputs will be cast to `float16`.
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [5,6]
  | Name      | Type                             | Params
---------------------------------------------------------------
0 | diffusion | ConditionedDiffusionModelWrapper | 1.2 B
1 | losses    | MultiLoss                        | 0
---------------------------------------------------------------
1.1 B     Trainable params
156 M     Non-trainable params
1.2 B     Total params
4,853.350 Total estimated model params size (MB)
Epoch 0:   0%|                                                                                                                                                         | 0/31 [00:00<?, ?it/s]
/home/yiminc/projects/MG/stable-audio-tools/stable_audio_tools/models/conditioners.py:362: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(dtype=torch.float16) and torch.set_grad_enabled(self.enable_grad):
/home/yiminc/projects/MG/stable-audio-tools/stable_audio_tools/training/diffusion.py:368: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.

Epoch 0:   0%|                                                                                                                                                         | 0/31 [00:00<?, ?it/s]RuntimeError: Found dtype Float but expected Half
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/yiminc/projects/MG/stable-audio-tools/./train_ds.py", line 169, in <module>
[rank0]:     main()
[rank0]:   File "/home/yiminc/projects/MG/stable-audio-tools/./train_ds.py", line 166, in main
[rank0]:     trainer.fit(training_wrapper, train_dl, val_dl, ckpt_path=args.ckpt_path if args.ckpt_path else None)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit
[rank0]:     call._call_and_handle_interrupt(
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank0]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
[rank0]:     return function(*args, **kwargs)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl
[rank0]:     self._run(model, ckpt_path=ckpt_path)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _run
[rank0]:     results = self._run_stage()
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1036, in _run_stage
[rank0]:     self.fit_loop.run()
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
[rank0]:     self.advance()
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
[rank0]:     self.epoch_loop.run(self._data_fetcher)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
[rank0]:     self.advance(data_fetcher)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 240, in advance
[rank0]:     batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 187, in run
[rank0]:     self._optimizer_step(batch_idx, closure)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 265, in _optimizer_step
[rank0]:     call._call_lightning_module_hook(
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook
[rank0]:     output = fn(*args, **kwargs)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1282, in optimizer_step
[rank0]:     optimizer.step(closure=optimizer_closure)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 151, in step
[rank0]:     step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 263, in optimizer_step
[rank0]:     optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 230, in optimizer_step
[rank0]:     return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/deepspeed.py", line 123, in optimizer_step
[rank0]:     closure_result = closure()
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 140, in __call__
[rank0]:     self._result = self.closure(*args, **kwargs)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 135, in closure
[rank0]:     self._backward_fn(step_output.closure_loss)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 236, in backward_fn
[rank0]:     call._call_strategy_hook(self.trainer, "backward", loss, optimizer)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook
[rank0]:     output = fn(*args, **kwargs)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 204, in backward
[rank0]:     self.precision_plugin.backward(closure_loss, self.lightning_module, optimizer, *args, **kwargs)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/deepspeed.py", line 112, in backward
[rank0]:     deepspeed_engine.backward(tensor, *args, **kwargs)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2216, in backward
[rank0]:     self._do_optimizer_backward(loss, retain_graph)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2162, in _do_optimizer_backward
[rank0]:     self.optimizer.backward(loss, retain_graph=retain_graph)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2082, in backward
[rank0]:     self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank0]:     scaled_loss.backward(retain_graph=retain_graph)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
[rank0]:     torch.autograd.backward(
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/torch/autograd/__init__.py", line 347, in backward
[rank0]:     _engine_run_backward(
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank0]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank0]: RuntimeError: Found dtype Float but expected Half
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/yiminc/projects/MG/stable-audio-tools/./train_ds.py", line 169, in <module>
[rank0]:     main()
[rank0]:   File "/home/yiminc/projects/MG/stable-audio-tools/./train_ds.py", line 166, in main
[rank0]:     trainer.fit(training_wrapper, train_dl, val_dl, ckpt_path=args.ckpt_path if args.ckpt_path else None)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit
[rank0]:     call._call_and_handle_interrupt(
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank0]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
[rank0]:     return function(*args, **kwargs)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl
[rank0]:     self._run(model, ckpt_path=ckpt_path)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _run
[rank0]:     results = self._run_stage()
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1036, in _run_stage
[rank0]:     self.fit_loop.run()
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
[rank0]:     self.advance()
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
[rank0]:     self.epoch_loop.run(self._data_fetcher)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
[rank0]:     self.advance(data_fetcher)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 240, in advance
[rank0]:     batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 187, in run
[rank0]:     self._optimizer_step(batch_idx, closure)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 265, in _optimizer_step
[rank0]:     call._call_lightning_module_hook(
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook
[rank0]:     output = fn(*args, **kwargs)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1282, in optimizer_step
[rank0]:     optimizer.step(closure=optimizer_closure)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 151, in step
[rank0]:     step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 263, in optimizer_step
[rank0]:     optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 230, in optimizer_step
[rank0]:     return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/deepspeed.py", line 123, in optimizer_step
[rank0]:     closure_result = closure()
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 140, in __call__
[rank0]:     self._result = self.closure(*args, **kwargs)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 135, in closure
[rank0]:     self._backward_fn(step_output.closure_loss)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 236, in backward_fn
[rank0]:     call._call_strategy_hook(self.trainer, "backward", loss, optimizer)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook
[rank0]:     output = fn(*args, **kwargs)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 204, in backward
[rank0]:     self.precision_plugin.backward(closure_loss, self.lightning_module, optimizer, *args, **kwargs)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/deepspeed.py", line 112, in backward
[rank0]:     deepspeed_engine.backward(tensor, *args, **kwargs)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2216, in backward
[rank0]:     self._do_optimizer_backward(loss, retain_graph)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2162, in _do_optimizer_backward
[rank0]:     self.optimizer.backward(loss, retain_graph=retain_graph)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2082, in backward
[rank0]:     self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank0]:     scaled_loss.backward(retain_graph=retain_graph)
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
[rank0]:     torch.autograd.backward(
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/torch/autograd/__init__.py", line 347, in backward
[rank0]:     _engine_run_backward(
[rank0]:   File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank0]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank0]: RuntimeError: Found dtype Float but expected Half

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions