Skip to content

ARM (and YARR) conflicts with current RLBench (1.2.0) #8

@alexanderdurr

Description

@alexanderdurr

Hi,
can you help me and tell me which rlbench and yarr versions/tags are compatible with each other?
For most of the problems I believe that pytorch is the issue and I don't find in any requirements.txt which one you use to make things work.

I observe this error

Process train_env0:
Traceback (most recent call last):
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/site-packages/yarr/runners/_env_runner.py", line 169, in _run_env
    raise e
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/site-packages/yarr/runners/_env_runner.py", line 143, in _run_env
    for replay_transition in generator:
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/site-packages/yarr/utils/rollout_generator.py", line 35, in generator
    transition = env.step(act_result)
  File "/home/user/ARM/arm/custom_rlbench_env.py", line 128, in step
    obs, reward, terminal = self._task.step(action)
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/site-packages/rlbench/task_environment.py", line 99, in step
    self._action_mode.action(self._scene, action)
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/site-packages/rlbench/action_modes/action_mode.py", line 32, in action
    arm_action = np.array(action[:arm_act_size])
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/site-packages/torch/_tensor.py", line 732, in __array__
    return self.numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
[2022-05-27 10:10:31,983][root][ERROR] - Env train_env0 failed too many times (11 times > 10)
Exception in thread EnvRunnerThread:
Traceback (most recent call last):
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/anaconda3/envs/ARM/lib/python3.9/site-packages/yarr/runners/env_runner.py", line 134, in _run
    raise RuntimeError('Too many process failures.')
RuntimeError: Too many process failures.

I pulled the current version of RLbench and YARR and did a re-install of all packages in a new conda environment.

I am wondering if you use a different torch version that can handle tensor to numpy automatically better.
Currently I fixed this by adding .cpu() in a few files

YARR

 git diff main

diff --git a/yarr/envs/rlbench_env.py b/yarr/envs/rlbench_env.py
index 6aad118..6460fb1 100644
--- a/yarr/envs/rlbench_env.py
+++ b/yarr/envs/rlbench_env.py
@@ -6,7 +6,7 @@ try:
 except (ModuleNotFoundError, ImportError) as e:
     print("You need to install RLBench: 'https://github.com/stepjam/RLBench'")
     raise e
-from rlbench.action_modes import ActionMode
+from rlbench.action_modes.action_mode import ActionMode
 from rlbench.backend.observation import Observation
 from rlbench.backend.task import Task
 
diff --git a/yarr/utils/rollout_generator.py b/yarr/utils/rollout_generator.py
index d4d2973..a3f12ee 100644
--- a/yarr/utils/rollout_generator.py
+++ b/yarr/utils/rollout_generator.py
@@ -27,7 +27,7 @@ class RolloutGenerator(object):
                                    deterministic=eval)
 
             # Convert to np if not already
-            agent_obs_elems = {k: np.array(v) for k, v in
+            agent_obs_elems = {k: np.array(v.cpu()) for k, v in
                                act_result.observation_elements.items()}
             extra_replay_elements = {k: np.array(v) for k, v in
                                      act_result.replay_elements.items()}
@@ -66,7 +66,7 @@ class RolloutGenerator(object):
                     prepped_data = {k: torch.tensor([v], device=self._env_device) for k, v in obs_history.items()}
                     act_result = agent.act(step_signal.value, prepped_data,
                                            deterministic=eval)
-                    agent_obs_elems_tp1 = {k: np.array(v) for k, v in
+                    agent_obs_elems_tp1 = {k: np.array(v.cpu()) for k, v in
                                            act_result.observation_elements.items()}
                     obs_tp1.update(agent_obs_elems_tp1)
                 replay_transition.final_observation = obs_tp1

(Side note: Also observe that with the recent changes in folder structure in RLbench I changed the import for ActionMode.)

RLBench

git diff master

diff --git a/rlbench/action_modes/action_mode.py b/rlbench/action_modes/action_mode.py
index 68171a37..a2c264ef 100644
--- a/rlbench/action_modes/action_mode.py
+++ b/rlbench/action_modes/action_mode.py
@@ -29,8 +29,8 @@ class MoveArmThenGripper(ActionMode):
 
     def action(self, scene: Scene, action: np.ndarray):
         arm_act_size = np.prod(self.arm_action_mode.action_shape(scene))
-        arm_action = np.array(action[:arm_act_size])
-        ee_action = np.array(action[arm_act_size:])
+        arm_action = np.array(action[:arm_act_size].cpu())
+        ee_action = np.array(action[arm_act_size:].cpu())
         self.arm_action_mode.action(scene, arm_action)
         self.gripper_action_mode.action(scene, ee_action)

I believe that the error comes from a change somewhere else though, or that you use a torch version that can deal with this? Can you please help me? I don't know which pyorch version you are using. It is missing in the requirements.txt. I installed pytorch with conda.
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

An error that I am unable to fix is this one

Exception in thread EnvRunnerThread:
Traceback (most recent call last):
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/site-packages/yarr/runners/env_runner.py", line 141, in _run
    new_transitions = self._update()
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/site-packages/yarr/runners/env_runner.py", line 86, in _update
    self._agent_summaries = list(
  File "<string>", line 2, in __getitem__
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/multiprocessing/managers.py", line 825, in _callmethod
    raise convert_to_error(kind, result)
multiprocessing.managers.RemoteError: 
---------------------------------------------------------------------------
Unserializable message: Traceback (most recent call last):
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/multiprocessing/managers.py", line 300, in serve_client
    send(msg)
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/multiprocessing/connection.py", line 211, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 249, in reduce_tensor
    event_sync_required) = storage._share_cuda_()
  File "/home/alexander/anaconda3/envs/ARM/lib/python3.9/site-packages/torch/storage.py", line 623, in _share_cuda_
    return self._storage._share_cuda_(*args, **kwargs)
RuntimeError: Attempted to send CUDA tensor received from another process; this is not currently supported. Consider cloning before sending.

---------------------------------------------------------------------------
[W CudaIPCTypes.cpp:92] Producer process tried to deallocate over 1000 memory blocks referred by consumer processes. Deallocation might be significantly slowed down. We assume it will never going to be the case, but if it is, please file but to https://github.com/pytorch/pytorch

Do you have advice? It seems to me like pytorch is the issue for most of the problems I mentioned.

using: Python 3.9.12

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions