You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi
I'm working on applying the technique explained in this repo in order to distill whisper for the Arabic language.
using the common voice dataset arabic split
I did step1: Creating the pseudo-labelled dataset and step2: Initilization of the student model
but in step 3: while training I faced this error:
[rank1]: Traceback (most recent call last):
distil-whisper/training/distil-whisper-small-v1-ar/run_distillation.py", line 1811, in
[rank1]: main()
distil-whisper/training/distil-whisper-small-v1-ar/run_distillation.py", line 1644, in main
[rank1]: student_model.generation_config.save_pretrained(intermediate_dir)
envs/distilW_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1729, in getattr
[rank1]: raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
[rank1]: AttributeError: 'DistributedDataParallel' object has no attribute 'generation_config'
07/31/2024 11:28:33 - INFO - accelerate.checkpointing - Model weights saved in checkpoint-2000-epoch-117/model.safetensors
07/31/2024 11:28:33 - WARNING - accelerate.utils.other - Removed shared tensor {'proj_out.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading
torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1706120 closing signal SIGTERM
torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 1706121) of binary: envs/distilW_env/bin/python
Traceback (most recent call last):
envs/distilW_env/bin/accelerate", line 8, in
sys.exit(main())
distilW_env/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
distilW_env/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
multi_gpu_launcher(args)
distilW_env/lib/python3.10/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher
distrib_run.run(args)
envs/distilW_env/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
envs/distilW_env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
/distilW_env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
run_distillation.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-07-31_11:28:34
Hi
I'm working on applying the technique explained in this repo in order to distill whisper for the Arabic language.
using the common voice dataset arabic split
I did step1: Creating the pseudo-labelled dataset and step2: Initilization of the student model
but in step 3: while training I faced this error:
[rank1]: Traceback (most recent call last):
distil-whisper/training/distil-whisper-small-v1-ar/run_distillation.py", line 1811, in
[rank1]: main()
distil-whisper/training/distil-whisper-small-v1-ar/run_distillation.py", line 1644, in main
[rank1]: student_model.generation_config.save_pretrained(intermediate_dir)
envs/distilW_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1729, in getattr
[rank1]: raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
[rank1]: AttributeError: 'DistributedDataParallel' object has no attribute 'generation_config'
07/31/2024 11:28:33 - INFO - accelerate.checkpointing - Model weights saved in checkpoint-2000-epoch-117/model.safetensors
07/31/2024 11:28:33 - WARNING - accelerate.utils.other - Removed shared tensor {'proj_out.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading
torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1706120 closing signal SIGTERM
torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 1706121) of binary: envs/distilW_env/bin/python
Traceback (most recent call last):
envs/distilW_env/bin/accelerate", line 8, in
sys.exit(main())
distilW_env/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
distilW_env/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
multi_gpu_launcher(args)
distilW_env/lib/python3.10/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher
distrib_run.run(args)
envs/distilW_env/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
envs/distilW_env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
/distilW_env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
run_distillation.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-07-31_11:28:34
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1706121)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Could you help me to figure out this error?
Thanks in advance
The text was updated successfully, but these errors were encountered: