-
Notifications
You must be signed in to change notification settings - Fork 389
Open
Description
I notice that there are codes about using Deepspeed to train the model in train.py
, but when I set the strategy
to "deepspeed", I got an error: RuntimeError: Found dtype Float but expected Half. @zqevans
Thanks in advance for any advice or experience on this issue.
The details are as follows.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/configuration_validator.py:74: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.
[rank: 0] Seed set to 42
initializing deepspeed distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Enabling DeepSpeed FP16. Model parameters and inputs will be cast to `float16`.
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [5,6]
| Name | Type | Params
---------------------------------------------------------------
0 | diffusion | ConditionedDiffusionModelWrapper | 1.2 B
1 | losses | MultiLoss | 0
---------------------------------------------------------------
1.1 B Trainable params
156 M Non-trainable params
1.2 B Total params
4,853.350 Total estimated model params size (MB)
Epoch 0: 0%| | 0/31 [00:00<?, ?it/s]
/home/yiminc/projects/MG/stable-audio-tools/stable_audio_tools/models/conditioners.py:362: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(dtype=torch.float16) and torch.set_grad_enabled(self.enable_grad):
/home/yiminc/projects/MG/stable-audio-tools/stable_audio_tools/training/diffusion.py:368: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
Epoch 0: 0%| | 0/31 [00:00<?, ?it/s]RuntimeError: Found dtype Float but expected Half
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/yiminc/projects/MG/stable-audio-tools/./train_ds.py", line 169, in <module>
[rank0]: main()
[rank0]: File "/home/yiminc/projects/MG/stable-audio-tools/./train_ds.py", line 166, in main
[rank0]: trainer.fit(training_wrapper, train_dl, val_dl, ckpt_path=args.ckpt_path if args.ckpt_path else None)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit
[rank0]: call._call_and_handle_interrupt(
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank0]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
[rank0]: return function(*args, **kwargs)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl
[rank0]: self._run(model, ckpt_path=ckpt_path)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _run
[rank0]: results = self._run_stage()
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1036, in _run_stage
[rank0]: self.fit_loop.run()
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
[rank0]: self.advance()
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
[rank0]: self.epoch_loop.run(self._data_fetcher)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
[rank0]: self.advance(data_fetcher)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 240, in advance
[rank0]: batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 187, in run
[rank0]: self._optimizer_step(batch_idx, closure)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 265, in _optimizer_step
[rank0]: call._call_lightning_module_hook(
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook
[rank0]: output = fn(*args, **kwargs)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1282, in optimizer_step
[rank0]: optimizer.step(closure=optimizer_closure)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 151, in step
[rank0]: step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 263, in optimizer_step
[rank0]: optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 230, in optimizer_step
[rank0]: return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/deepspeed.py", line 123, in optimizer_step
[rank0]: closure_result = closure()
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 140, in __call__
[rank0]: self._result = self.closure(*args, **kwargs)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 135, in closure
[rank0]: self._backward_fn(step_output.closure_loss)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 236, in backward_fn
[rank0]: call._call_strategy_hook(self.trainer, "backward", loss, optimizer)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook
[rank0]: output = fn(*args, **kwargs)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 204, in backward
[rank0]: self.precision_plugin.backward(closure_loss, self.lightning_module, optimizer, *args, **kwargs)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/deepspeed.py", line 112, in backward
[rank0]: deepspeed_engine.backward(tensor, *args, **kwargs)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2216, in backward
[rank0]: self._do_optimizer_backward(loss, retain_graph)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2162, in _do_optimizer_backward
[rank0]: self.optimizer.backward(loss, retain_graph=retain_graph)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2082, in backward
[rank0]: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank0]: scaled_loss.backward(retain_graph=retain_graph)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
[rank0]: torch.autograd.backward(
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/torch/autograd/__init__.py", line 347, in backward
[rank0]: _engine_run_backward(
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank0]: RuntimeError: Found dtype Float but expected Half
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/yiminc/projects/MG/stable-audio-tools/./train_ds.py", line 169, in <module>
[rank0]: main()
[rank0]: File "/home/yiminc/projects/MG/stable-audio-tools/./train_ds.py", line 166, in main
[rank0]: trainer.fit(training_wrapper, train_dl, val_dl, ckpt_path=args.ckpt_path if args.ckpt_path else None)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit
[rank0]: call._call_and_handle_interrupt(
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank0]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
[rank0]: return function(*args, **kwargs)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl
[rank0]: self._run(model, ckpt_path=ckpt_path)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _run
[rank0]: results = self._run_stage()
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1036, in _run_stage
[rank0]: self.fit_loop.run()
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
[rank0]: self.advance()
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
[rank0]: self.epoch_loop.run(self._data_fetcher)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
[rank0]: self.advance(data_fetcher)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 240, in advance
[rank0]: batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 187, in run
[rank0]: self._optimizer_step(batch_idx, closure)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 265, in _optimizer_step
[rank0]: call._call_lightning_module_hook(
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook
[rank0]: output = fn(*args, **kwargs)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1282, in optimizer_step
[rank0]: optimizer.step(closure=optimizer_closure)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 151, in step
[rank0]: step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 263, in optimizer_step
[rank0]: optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 230, in optimizer_step
[rank0]: return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/deepspeed.py", line 123, in optimizer_step
[rank0]: closure_result = closure()
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 140, in __call__
[rank0]: self._result = self.closure(*args, **kwargs)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 135, in closure
[rank0]: self._backward_fn(step_output.closure_loss)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 236, in backward_fn
[rank0]: call._call_strategy_hook(self.trainer, "backward", loss, optimizer)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook
[rank0]: output = fn(*args, **kwargs)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 204, in backward
[rank0]: self.precision_plugin.backward(closure_loss, self.lightning_module, optimizer, *args, **kwargs)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/deepspeed.py", line 112, in backward
[rank0]: deepspeed_engine.backward(tensor, *args, **kwargs)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2216, in backward
[rank0]: self._do_optimizer_backward(loss, retain_graph)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2162, in _do_optimizer_backward
[rank0]: self.optimizer.backward(loss, retain_graph=retain_graph)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2082, in backward
[rank0]: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank0]: scaled_loss.backward(retain_graph=retain_graph)
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
[rank0]: torch.autograd.backward(
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/torch/autograd/__init__.py", line 347, in backward
[rank0]: _engine_run_backward(
[rank0]: File "/home/yiminc/miniconda3/envs/sa/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank0]: RuntimeError: Found dtype Float but expected Half
Metadata
Metadata
Assignees
Labels
No labels