-
-
Notifications
You must be signed in to change notification settings - Fork 653
Open
Description
🐛 Bug description
When summoning a slurm step with multiple tasks assigning GPUs with the --ntasks-per-gpu
flag instead of the --ntasks-per-node
as it seems it was intended, ignite uses the SLURM_LOCALID
environment as the local rank and use it as the device id to use even though the --ntasks-per-gpu
already binds the MPI process with a GPU, which cause the call torch.cuda.set_device(self._local_rank)
to fail.
To reproduce:
srun --ntasks-per-gpu=1 --nodes=2 --gpus-per-node=4 python -e "import ignite.distributed as idist; idist.initialize(backend='nccl')"
Which produces the following output:
idist.initialize(backend="nccl")
File ".../python3.11/site-packages/ignite/distributed/utils.py", line 577, in initialize
_set_model(comp_model_cls(backend, **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../python3.11/site-packages/ignite/distributed/comp_models/native.py", line 92, in __init__
self._create_from_backend(
File ".../python3.11/site-packages/ignite/distributed/comp_models/native.py", line 127, in _create_from_backend
torch.cuda.set_device(self._local_rank)
File ".../python3.11/site-packages/torch/cuda/__init__.py", line 408, in set_device
torch._C._cuda_setDevice(device)
Intended behaviour:
Either
- Detect the presence of the
--ntasks-per-gpu
flag, which does not seem to be possible - Detect that only one GPU is available and use it
- Allow to explicitly set the local id or the device to use even though idist is initialized in a slurm environment
- Allow to override local rank with
idist.set_local_rank()
, which is never considered whenSLURM_JOB_ID
is detected
Environment
- PyTorch Version (e.g., 1.4): 2.2
- Ignite Version (e.g., 0.3.0): 0.5.0.post2
- OS (e.g., Linux): Linux
- How you installed Ignite (
conda
,pip
, source): pip - Python version: 3.11
- Any other relevant information: slurm 23.02.7
Metadata
Metadata
Assignees
Labels
No labels