Skip to content

[Bug]: SubprocVecEnv ignores specified CUDA device and uses GPU 0 #2116

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
5 tasks done
tutututudodo opened this issue Apr 8, 2025 · 1 comment
Open
5 tasks done
Labels
custom gym env Issue related to Custom Gym Env openai gym related to OpenAI Gym interface

Comments

@tutututudodo
Copy link

tutututudodo commented Apr 8, 2025

🐛 Bug

Description

When using SubprocVecEnv with multiple environments, all subprocesses ignore the GPU device specified for the main process and default to GPU 0, regardless of which GPU was set with torch.cuda.set_device().

Reproduction

I created a minimal reproduction script that clearly shows the issue:

import os
import torch
import numpy as np
import gymnasium as gym
from stable_baselines3.common.vec_env import DummyVecEnv, SubprocVecEnv

class GPUTestEnv(gym.Env):
    def __init__(self, env_id=0):
        super().__init__()
        self.observation_space = gym.spaces.Box(low=-1, high=1, shape=(4,), dtype=np.float32)
        self.action_space = gym.spaces.Box(low=-1, high=1, shape=(2,), dtype=np.float32)
        self.env_id = env_id
        
        if torch.cuda.is_available():
            device = torch.cuda.current_device()
            device_name = torch.cuda.get_device_name(device)
            print(f"Env {self.env_id} created on GPU {device} ({device_name}) - PID: {os.getpid()}")
            
    def reset(self, seed=None, options=None):
        if torch.cuda.is_available():
            test_tensor = torch.ones(1, device="cuda")
            device_id = test_tensor.device.index
            print(f"Env {self.env_id} - reset() using GPU: {device_id}")
        return np.zeros(4, dtype=np.float32), {}
        
    def step(self, action):
        if torch.cuda.is_available():
            test_tensor = torch.ones(1, device="cuda")
            device_id = test_tensor.device.index
            print(f"Env {self.env_id} - step() using GPU: {device_id}")
        return np.zeros(4, dtype=np.float32), 0.0, False, False, {}

# Specify GPU 3
torch.cuda.set_device(3)
print(f"Main process current device: {torch.cuda.current_device()}")

# Create environments
env_fns = [lambda idx=i: GPUTestEnv(idx) for i in range(3)]

# DummyVecEnv correctly uses GPU 3
print("\n----- Testing DummyVecEnv -----")
dummy_env = DummyVecEnv(env_fns)
dummy_env.reset()
dummy_env.step(np.zeros((3, 2)))
dummy_env.close()

# SubprocVecEnv incorrectly uses GPU 0
print("\n----- Testing SubprocVecEnv -----")
subproc_env = SubprocVecEnv(env_fns, start_method="spawn")
subproc_env.reset()
subproc_env.step(np.zeros((3, 2)))
subproc_env.close()

Output

Main process current device: 3

----- Testing DummyVecEnv -----
Env 0 created on GPU 3 (NVIDIA A100 80GB PCIe) - PID: 1795905
Env 1 created on GPU 3 (NVIDIA A100 80GB PCIe) - PID: 1795905
Env 2 created on GPU 3 (NVIDIA A100 80GB PCIe) - PID: 1795905
Env 0 - reset() using GPU: 3
Env 1 - reset() using GPU: 3
Env 2 - reset() using GPU: 3
Env 0 - step() using GPU: 3
Env 1 - step() using GPU: 3
Env 2 - step() using GPU: 3

----- Testing SubprocVecEnv -----
Env 0 created on GPU 0 (NVIDIA A100 80GB PCIe) - PID: 1796000
Env 0 - reset() using GPU: 0
Env 0 - step() using GPU: 0
Env 2 created on GPU 0 (NVIDIA A100 80GB PCIe) - PID: 1796002
Env 2 - reset() using GPU: 0
Env 2 - step() using GPU: 0
Env 1 created on GPU 0 (NVIDIA A100 80GB PCIe) - PID: 1796001
Env 1 - reset() using GPU: 0
Env 1 - step() using GPU: 0

Environment

  • Stable Baselines 3 version: 2.5.0
  • PyTorch: 2.6.0+cu124
  • CUDA: 12.4
  • GPUs: 4x NVIDIA A100 80GB PCIe
  • gymnasium:1.0.0

Checklist

  • My issue does not relate to a custom gym environment. (Use the custom gym env template instead)
  • I have checked that there is no similar issue in the repo
  • I have read the documentation
  • I have provided a minimal and working example to reproduce the bug
  • I've used the markdown code blocks for both code and stack traces.
@tutututudodo tutututudodo added the bug Something isn't working label Apr 8, 2025
@araffin araffin added custom gym env Issue related to Custom Gym Env openai gym related to OpenAI Gym interface and removed bug Something isn't working labels Apr 9, 2025
@araffin
Copy link
Member

araffin commented Apr 14, 2025

Hello,
this is the expected behavior because torch.cuda.set_device(3) was executed in the main thread, whereas the envs are run on other processes.
If you want the other processes to use the correct device, you would either need to set the CUDA env variable or to add a method to your env that sets the cuda device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
custom gym env Issue related to Custom Gym Env openai gym related to OpenAI Gym interface
Projects
None yet
Development

No branches or pull requests

2 participants