Versatile and Generalizable Manipulation via Goal-Conditioned Reinforcement Learning with Grounded Object Detection
This repository implements the method presented in the paper:
"Versatile and Generalizable Manipulation via Goal-Conditioned Reinforcement Learning with Grounded Object Detection"
Huiyi Wang, Fahim Shahriar, Seyed Alireza Azimi, Gautham Vasan, A. Rupam Mahmood, Colin Bellinger
Accepted at the CoRL 2024 Workshop on Minimalist Robot Learning (MRM-D)
đź“„ Read the paper
This project investigates how goal-conditioned reinforcement learning (GCRL) can be enhanced using mask-based goal representations derived from natural language descriptions of target objects. The method enables a single manipulation policy to generalize across a wide variety of objects and goal configurations.
This repository includes:
- A simulation environment for the UR10e robot
- Integration with a physical UR10e robot
- A trained mask-conditioned PPO policy
- Instructions to train in simulation and deploy on hardware
Traditional GCRL approaches often struggle to generalize to new target objects. This work shows that binary goal masks—either ground-truth or generated by a pre-trained object grounding model—enable better generalization and faster learning than alternative goal conditioning strategies such as one-hot vectors or cropped target images.
In particular, we use a pre-trained object grounding model (GroundingDINO + SAM) to convert a textual goal description (e.g., “apple on the right”) into a binary mask that highlights the object’s location in the scene. This goal mask is updated at every timestep, allowing the agent to:
- Track progress toward the goal
- Receive implicit feedback
- Mitigate the sparse reward problem
The RL policy is conditioned on:
- RGB image
- Proprioceptive state
- Binary goal mask (updated at each timestep)
-
Text Prompt → Grounded Object Detector → Binary Mask
Leverages vision-language grounding to generate object-specific goal representations. -
Goal Conditioning Variants Compared:
- One-hot vector (baseline)
- Goal object image crop
- Binary goal mask (proposed)
-
Learning Algorithm:
PPO (Proximal Policy Optimization) trained on visual, proprioceptive, and mask inputs
Goal Representation | Seen Objects (In-Distribution) | Unseen Objects (Out-of-Distribution) |
---|---|---|
One-hot Vector | 13% | 20% |
Goal Object Image | 62% | 28% |
GT Binary Mask | 89% | 90% |
- Binary masks enable strong zero-shot generalization to novel target objects.
- Training with GT masks transfers well to DINO-generated masks on seen objects (~90% success).
- Performance with real-time DINO-generated masks degrades in cluttered scenes due to detection noise.
This codebase is provided for research purposes. Users are fully responsible for validating and testing any part of the code—both in simulation and on real robotic systems.
The authors and contributors assume no liability for any damage, failure, or unexpected behavior that may result from deploying the provided code on physical hardware. Proceed with caution and validate thoroughly in controlled environments.
git clone https://github.com/cherylwang20/GCRL_UR10e.git
cd GCRL_UR10e
git submodule update --init --recursive
You would also need to use an external pre-trained object recognition model for object inference. We use GDINO here, the model should be cloned already through submodule. Please allow the instruction link in the GDINO repo to make sure that CUDA with torch and GPU is compatible.
cd GroundingDINO
pip install -e .
Note on PyTorch 2.0 Compatibility:
If you encounter an error with value.type()
in ms_deform_attn_cuda.cu
, replace it with value.scalar_type()
in:
groundingdino/models/GroundingDINO/csrc/MsDeformAttn/ms_deform_attn_cuda.cu
Use Python 3.9 (later versions may cause issues with loading the baseline):
python3.9 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cd mj_envs
pip install -e .
mkdir -p policy
gdown 'https://drive.google.com/uc?id=1wKpIUVp2kXvf_Lq1VV7aKIoERLOS6QtW' -O policy/baseline.zip
To train a new policy, run:
python training/Train_reach.py --env_name 'UR10eReach1C-v1' --group 'Reach_4C_dt20' --num_envs 4 --learning_rate 0.0003 --clip_range 0.1 --seed=0 --channel_num 4 --fs 20
Training Script Arguments
--env_name 'UR10eReach1C-v1'
: Specifies the UR10e environment for training.
--group 'Reach_4C_dt20'
: Name of the experiment group for logging.
--num_envs 4
: Number of parallel environments.
--learning_rate 0.0003
: Learning rate for PPO.
--clip_range 0.1
: PPO clip range for stable policy updates.
--seed 0
: Random seed, often set via SLURM for batch runs.
--channel_num 4
: Number of input image channels.
--fs 20
: Frame skip (simulation step interval).
python training/Eval_Baseline.py --env_name "UR10eReach1C-v1" --model_num "baseline"
To achieve effective sim2real transfer, we fine-tune the policy trained above with observation image augmentation using continuous training. To train with image augmentation, download the resized external images originally from OpenX into background
from https://mcgill-my.sharepoint.com/:u:/g/personal/huiyi_wang_mail_mcgill_ca/EZM8oZL_PPVIiOtrbl8Gy0sBLTBYWjd18TOdrS43WULVdA?e=ZBfhfY.
Use the following command:
python training/Train_reach.py --env_name "UR10eReach1C-v1" --group 'Reach_4C_dt20_cont' --num_envs 4 --learning_rate 0.0003 --clip_range 0.1 --seed=0 --channel_num 4 --fs 20 --merge True --cont "Your Previous Policy"
No change in the hyperparameter or reward shaping is required. We trained an additional 1 Million Steps until full convergence. Sim2Real shows a lack of transferability without this augmentation.
- The robot's initial joint configuration is:
[4.7799, -2.0740, 2.6200, 3.0542, -1.5800, 1.4305e-05]
(in radians), with the gripper fully open. - Place target objects 30–50 cm in front of the camera, making sure they are visible at the start.
- The camera is mounted on the Robotiq gripper using a custom 3D-printed bracket.
It is essential that the gripper is visible in the camera view around 17 degrees downwards.
-
Set the correct IP address for your UR10e robot in:
GdinoReachGraspEnv_servoJ.py#L86
-
Both
servoJ
andmoveJ
motion commands are supported.
servoJ
offers better performance for sim-to-real transfer. -
We use a camera resolution of 848 * 480 for best inference results and later rescaled to 212 * 120 for policy training.
-
Due to exceeding performance, we hardcorded a pick up after approaching close to the table and performing a pick up and drop up: https://github.com/cherylwang20/Sim2Real_GCRL_UR10e/blob/3f6d3c6f44f698b062e058aac546f5c7d1629576/src/reachGrasp_env/GdinoReachGraspEnv_servoJ.py#L326. Uncomment if you don't require this behavior.
- G.DINO Prompt: Green Apple
- Control: ServoJ
@inproceedings{
wang2024goalconditioned,
title={Versatile and Generalizable Manipulation via Goal-Conditioned Reinforcement Learning with Grounded Object Detection},
author={Huiyi Wang and Fahim Shahriar and Seyed Alireza Azimi and Gautham Vasan and A. Rupam Mahmood and Colin Bellinger},
booktitle={CoRL 2024 Workshop on Minimalist Robot Learning (MRM-D)},
year={2024},
url={https://openreview.net/forum?id=TgXIkK8WPQ}
}