CPUAdam does not find CUDA #1619
Unanswered
javier-alvarez
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
2021-12-08T15:12:02Z INFO Switching optimizer to DeepSpeedCPUAdam
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
[stderr]Traceback (most recent call last):
[stderr] File "InnerEyePrivate/ML/runner.py", line 57, in
[stderr] main()
[stderr] File "InnerEyePrivate/ML/runner.py", line 53, in main
[stderr] post_cross_validation_hook=runner.default_post_cross_validation_hook)
[stderr] File "/mnt/azureml/cr/j/cfa5340abb4d4a3abec1a3ec4d8e39a6/exe/wd/innereye-deeplearning/InnerEye/ML/runner.py", line 442, in run
[stderr] return runner.run()
[stderr] File "/mnt/azureml/cr/j/cfa5340abb4d4a3abec1a3ec4d8e39a6/exe/wd/innereye-deeplearning/InnerEye/ML/runner.py", line 219, in run
[stderr] self.run_in_situ(azure_run_info)
[stderr] File "/mnt/azureml/cr/j/cfa5340abb4d4a3abec1a3ec4d8e39a6/exe/wd/innereye-deeplearning/InnerEye/ML/runner.py", line 398, in run_in_situ
[stderr] self.ml_runner.run()
[stderr] File "/mnt/azureml/cr/j/cfa5340abb4d4a3abec1a3ec4d8e39a6/exe/wd/innereye-deeplearning/InnerEye/ML/run_ml.py", line 327, in run
[stderr] num_nodes=self.azure_config.num_nodes)
[stderr] File "/mnt/azureml/cr/j/cfa5340abb4d4a3abec1a3ec4d8e39a6/exe/wd/innereye-deeplearning/InnerEye/ML/model_training.py", line 263, in model_train
[stderr] trainer.fit(lightning_model, datamodule=data_module)
[stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 460, in fit
[stderr] self._run(model)
[stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 717, in _run
[stderr] self.accelerator.setup(self, model) # note: this sets up self.lightning_module
[stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/pytorch_lightning/accelerators/cpu.py", line 39, in setup
[stderr] return super().setup(trainer, model)
[stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in setup
[stderr] self.setup_optimizers(trainer)
[stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 375, in setup_optimizers
[stderr] trainer=trainer, model=self.lightning_module
[stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 190, in init_optimizers
[stderr] return trainer.init_optimizers(model)
[stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/pytorch_lightning/trainer/optimizers.py", line 34, in init_optimizers
[stderr] optim_conf = model.configure_optimizers()
[stderr] File "/mnt/azureml/cr/j/cfa5340abb4d4a3abec1a3ec4d8e39a6/exe/wd/innereye-deeplearning/InnerEye/ML/SSL/lightning_modules/simclr_module.py", line 68, in configure_optimizers
[stderr] deepspeed_optim = DeepSpeedCPUAdam(params, lr=self.learning_rate, weight_decay=self.weight_decay)
[stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/deepspeed/ops/adam/cpu_adam.py", line 83, in init
[stderr] self.ds_opt_adam = CPUAdamBuilder().load()
[stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/deepspeed/ops/op_builder/builder.py", line 370, in load
[stderr] return self.jit_load(verbose)
[stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/deepspeed/ops/op_builder/builder.py", line 385, in jit_load
[stderr] assert_no_cuda_mismatch()
[stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/deepspeed/ops/op_builder/builder.py", line 97, in assert_no_cuda_mismatch
[stderr] f"Installed CUDA version {sys_cuda_version} does not match the "
[stderr]Exception: Installed CUDA version 10.2 does not match the version torch was compiled with 11.1, unable to compile cuda/cpp extensions without a matching cuda version.
[stderr]
https://github.com/microsoft/InnerEye-DeepLearning/pull/611/files
Any ideas why this does not find CUDA 11? It installs pytorch 1.8 and cuda 11 with conda
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions