Skip to content

idist.initialize fails in Slurm when using --ntasks-per-gpu #3259

@nowtryz

Description

@nowtryz

🐛 Bug description

When summoning a slurm step with multiple tasks assigning GPUs with the --ntasks-per-gpu flag instead of the --ntasks-per-node as it seems it was intended, ignite uses the SLURM_LOCALID environment as the local rank and use it as the device id to use even though the --ntasks-per-gpu already binds the MPI process with a GPU, which cause the call torch.cuda.set_device(self._local_rank) to fail.

To reproduce:

srun --ntasks-per-gpu=1 --nodes=2 --gpus-per-node=4 python -e "import ignite.distributed as idist; idist.initialize(backend='nccl')"

Which produces the following output:

    idist.initialize(backend="nccl")
  File ".../python3.11/site-packages/ignite/distributed/utils.py", line 577, in initialize
    _set_model(comp_model_cls(backend, **kwargs))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../python3.11/site-packages/ignite/distributed/comp_models/native.py", line 92, in __init__
    self._create_from_backend(
  File ".../python3.11/site-packages/ignite/distributed/comp_models/native.py", line 127, in _create_from_backend
    torch.cuda.set_device(self._local_rank)
  File ".../python3.11/site-packages/torch/cuda/__init__.py", line 408, in set_device
    torch._C._cuda_setDevice(device)

Intended behaviour:
Either

  • Detect the presence of the --ntasks-per-gpu flag, which does not seem to be possible
  • Detect that only one GPU is available and use it
  • Allow to explicitly set the local id or the device to use even though idist is initialized in a slurm environment
  • Allow to override local rank with idist.set_local_rank(), which is never considered when SLURM_JOB_ID is detected

Environment

  • PyTorch Version (e.g., 1.4): 2.2
  • Ignite Version (e.g., 0.3.0): 0.5.0.post2
  • OS (e.g., Linux): Linux
  • How you installed Ignite (conda, pip, source): pip
  • Python version: 3.11
  • Any other relevant information: slurm 23.02.7

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions