Fix RemoteMixtureOfExperts and RemoteSwitchMixtureOfExperts backward() on GPU #626

Vectorrent · 2024-09-06T02:40:51Z

If you try to use RemoteMixtureOfExperts or RemoteSwitchMixtureOfExperts during training on GPU, you will get errors like this:

  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 768, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/torch/autograd/function.py", line 306, in apply
    return user_fn(self, *args)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/torch/autograd/function.py", line 599, in wrapper
    outputs = fn(ctx, *args)
              ^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/hivemind/moe/client/moe.py", line 312, in backward
    inputs_per_expert = zip(*(tensor[alive_ii].split(1, dim=0) for tensor in flat_inputs_cpu))
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/hivemind/moe/client/moe.py", line 312, in <genexpr>
    inputs_per_expert = zip(*(tensor[alive_ii].split(1, dim=0) for tensor in flat_inputs_cpu))
                              ~~~~~~^^^^^^^^^^
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

This does not happen when training on CPU. The error only occurs when training on GPU, and so far as I can tell - there is no way to fix it, without fixing the underlying Hivemind code. I tried torch.cuda.set_device(), I tried moving input tensors to the CPU - none of that works. These hard-coded CPU operations conflict with tensors on GPU.

This PR should fix the problem.

…) on GPU (#626) (cherry picked from commit 9a76360)

Vectorrent mentioned this pull request Sep 6, 2024

RemoteMixtureOfExperts and RemoteSwitchMixtureOfExperts cannot be used on GPU 0-5788719150923125/praxis#3

Closed

fix devices during backprop

cdf763c

mryab force-pushed the fix-backprop-devices branch from 5fc3f0e to cdf763c Compare March 16, 2025 17:09

mryab approved these changes Mar 16, 2025

View reviewed changes

mryab merged commit 9a76360 into learning-at-home:master Mar 16, 2025
22 of 36 checks passed

mryab pushed a commit that referenced this pull request Apr 20, 2025

Fix RemoteMixtureOfExperts and RemoteSwitchMixtureOfExperts backward(…

5cbef68

…) on GPU (#626) (cherry picked from commit 9a76360)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix RemoteMixtureOfExperts and RemoteSwitchMixtureOfExperts backward() on GPU #626

Fix RemoteMixtureOfExperts and RemoteSwitchMixtureOfExperts backward() on GPU #626

Uh oh!

Vectorrent commented Sep 6, 2024

Uh oh!

Uh oh!

Uh oh!

Fix RemoteMixtureOfExperts and RemoteSwitchMixtureOfExperts backward() on GPU #626

Fix RemoteMixtureOfExperts and RemoteSwitchMixtureOfExperts backward() on GPU #626

Uh oh!

Conversation

Vectorrent commented Sep 6, 2024

Uh oh!

Uh oh!

Uh oh!