Versatile and Generalizable Manipulation via Goal-Conditioned Reinforcement Learning with Grounded Object Detection

This repository implements the method presented in the paper:
"Versatile and Generalizable Manipulation via Goal-Conditioned Reinforcement Learning with Grounded Object Detection"
Huiyi Wang, Fahim Shahriar, Seyed Alireza Azimi, Gautham Vasan, A. Rupam Mahmood, Colin Bellinger
Accepted at the CoRL 2024 Workshop on Minimalist Robot Learning (MRM-D)
📄 Read the paper

Overview

This project investigates how goal-conditioned reinforcement learning (GCRL) can be enhanced using mask-based goal representations derived from natural language descriptions of target objects. The method enables a single manipulation policy to generalize across a wide variety of objects and goal configurations.

This repository includes:

A simulation environment for the UR10e robot
Integration with a physical UR10e robot
A trained mask-conditioned PPO policy
Instructions to train in simulation and deploy on hardware

Key Idea

Traditional GCRL approaches often struggle to generalize to new target objects. This work shows that binary goal masks—either ground-truth or generated by a pre-trained object grounding model—enable better generalization and faster learning than alternative goal conditioning strategies such as one-hot vectors or cropped target images.

In particular, we use a pre-trained object grounding model (GroundingDINO + SAM) to convert a textual goal description (e.g., “apple on the right”) into a binary mask that highlights the object’s location in the scene. This goal mask is updated at every timestep, allowing the agent to:

Track progress toward the goal
Receive implicit feedback
Mitigate the sparse reward problem

The RL policy is conditioned on:

RGB image
Proprioceptive state
Binary goal mask (updated at each timestep)

Method Summary

Text Prompt → Grounded Object Detector → Binary Mask
Leverages vision-language grounding to generate object-specific goal representations.
Goal Conditioning Variants Compared:
- One-hot vector (baseline)
- Goal object image crop
- Binary goal mask (proposed)
Learning Algorithm:
PPO (Proximal Policy Optimization) trained on visual, proprioceptive, and mask inputs

Results

Goal Representation	Seen Objects (In-Distribution)	Unseen Objects (Out-of-Distribution)
One-hot Vector	13%	20%
Goal Object Image	62%	28%
GT Binary Mask	89%	90%

Binary masks enable strong zero-shot generalization to novel target objects.
Training with GT masks transfers well to DINO-generated masks on seen objects (~90% success).
Performance with real-time DINO-generated masks degrades in cluttered scenes due to detection noise.

Disclaimer

This codebase is provided for research purposes. Users are fully responsible for validating and testing any part of the code—both in simulation and on real robotic systems.
The authors and contributors assume no liability for any damage, failure, or unexpected behavior that may result from deploying the provided code on physical hardware. Proceed with caution and validate thoroughly in controlled environments.

Code and Usage

Clone the Repository

git clone https://github.com/cherylwang20/GCRL_UR10e.git
cd GCRL_UR10e
git submodule update --init --recursive

Installation

You would also need to use an external pre-trained object recognition model for object inference. We use GDINO here, the model should be cloned already through submodule. Please allow the instruction link in the GDINO repo to make sure that CUDA with torch and GPU is compatible.

cd GroundingDINO
pip install -e .

Note on PyTorch 2.0 Compatibility:
If you encounter an error with value.type() in ms_deform_attn_cuda.cu, replace it with value.scalar_type() in:

groundingdino/models/GroundingDINO/csrc/MsDeformAttn/ms_deform_attn_cuda.cu

Set Up the Virtual Environment

Use Python 3.9 (later versions may cause issues with loading the baseline):

python3.9 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Load Submodules

UR10e Gym Environment (mj_envs)

cd mj_envs
pip install -e .

Download the Pre-trained Policy

mkdir -p policy
gdown 'https://drive.google.com/uc?id=1wKpIUVp2kXvf_Lq1VV7aKIoERLOS6QtW' -O policy/baseline.zip

Training a New Policy

To train a new policy, run:

python training/Train_reach.py --env_name 'UR10eReach1C-v1' --group 'Reach_4C_dt20' --num_envs 4 --learning_rate 0.0003 --clip_range 0.1 --seed=0 --channel_num 4 --fs 20

Training Script Arguments

--env_name 'UR10eReach1C-v1' : Specifies the UR10e environment for training.

--group 'Reach_4C_dt20' : Name of the experiment group for logging.

--num_envs 4 : Number of parallel environments.

--learning_rate 0.0003 : Learning rate for PPO.

--clip_range 0.1 : PPO clip range for stable policy updates.

--seed 0 : Random seed, often set via SLURM for batch runs.

--channel_num 4 : Number of input image channels.

--fs 20 : Frame skip (simulation step interval).

Evaluate an Existing Policy

python training/Eval_Baseline.py --env_name "UR10eReach1C-v1" --model_num "baseline"

Sim2Real

To achieve effective sim2real transfer, we fine-tune the policy trained above with observation image augmentation using continuous training. To train with image augmentation, download the resized external images originally from OpenX into background from https://mcgill-my.sharepoint.com/:u:/g/personal/huiyi_wang_mail_mcgill_ca/EZM8oZL_PPVIiOtrbl8Gy0sBLTBYWjd18TOdrS43WULVdA?e=ZBfhfY.

Use the following command:

python training/Train_reach.py --env_name "UR10eReach1C-v1" --group 'Reach_4C_dt20_cont' --num_envs 4 --learning_rate 0.0003 --clip_range 0.1 --seed=0 --channel_num 4 --fs 20 --merge True --cont "Your Previous Policy"

No change in the hyperparameter or reward shaping is required. We trained an additional 1 Million Steps until full convergence. Sim2Real shows a lack of transferability without this augmentation.

Details on the Robotic Setup

Getting Started

The robot's initial joint configuration is:
[4.7799, -2.0740, 2.6200, 3.0542, -1.5800, 1.4305e-05] (in radians), with the gripper fully open.
Place target objects 30–50 cm in front of the camera, making sure they are visible at the start.

The camera is mounted on the Robotiq gripper using a custom 3D-printed bracket.
It is essential that the gripper is visible in the camera view around 17 degrees downwards.

Set the correct IP address for your UR10e robot in:
GdinoReachGraspEnv_servoJ.py#L86
Both servoJ and moveJ motion commands are supported.
servoJ offers better performance for sim-to-real transfer.
We use a camera resolution of 848 * 480 for best inference results and later rescaled to 212 * 120 for policy training.
Due to exceeding performance, we hardcorded a pick up after approaching close to the table and performing a pick up and drop up: https://github.com/cherylwang20/Sim2Real_GCRL_UR10e/blob/3f6d3c6f44f698b062e058aac546f5c7d1629576/src/reachGrasp_env/GdinoReachGraspEnv_servoJ.py#L326. Uncomment if you don't require this behavior.

Sample Video Demonstration for UR10e Reaching

G.DINO Prompt: Green Apple
Control: ServoJ

📝 Citation

@inproceedings{
    wang2024goalconditioned,
    title={Versatile and Generalizable Manipulation via Goal-Conditioned Reinforcement Learning with Grounded Object Detection},
    author={Huiyi Wang and Fahim Shahriar and Seyed Alireza Azimi and Gautham Vasan and A. Rupam Mahmood and Colin Bellinger},
    booktitle={CoRL 2024 Workshop on Minimalist Robot Learning (MRM-D)},
    year={2024},
    url={https://openreview.net/forum?id=TgXIkK8WPQ}
}

Name		Name	Last commit message	Last commit date
Latest commit History 175 Commits
background		background
mj_envs @ b488235		mj_envs @ b488235
sim2real		sim2real
training		training
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
job.sh		job.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Versatile and Generalizable Manipulation via Goal-Conditioned Reinforcement Learning with Grounded Object Detection

Overview

Key Idea

Method Summary

Results

Disclaimer

Code and Usage

Clone the Repository

Installation

Set Up the Virtual Environment

Load Submodules

UR10e Gym Environment (mj_envs)

Download the Pre-trained Policy

Training a New Policy

Evaluate an Existing Policy

Sim2Real

Details on the Robotic Setup

Getting Started

Sample Video Demonstration for UR10e Reaching

📝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

cherylwang20/GCRL_UR10e

Folders and files

Latest commit

History

Repository files navigation

Versatile and Generalizable Manipulation via Goal-Conditioned Reinforcement Learning with Grounded Object Detection

Overview

Key Idea

Method Summary

Results

Disclaimer

Code and Usage

Clone the Repository

Installation

Set Up the Virtual Environment

Load Submodules

UR10e Gym Environment (mj_envs)

Download the Pre-trained Policy

Training a New Policy

Evaluate an Existing Policy

Sim2Real

Details on the Robotic Setup

Getting Started

Sample Video Demonstration for UR10e Reaching

📝 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages