Skip to content

Fsoft-AIC/GraspMAS

Repository files navigation

GraspMAS: Zero-Shot Language-driven Grasp Detection with Multi-Agent System

Quang NguyenTri LeHuy NguyenThieu VoTung TaBaoru HuangMinh VuAnh Nguyen

Website arXiv

Introduction

image In this paper, we introduce GraspMAS, a new multi-agent system framework for language-driven grasp detection. GraspMAS is designed to reason through ambiguities and improve decision-making in real-world scenarios.

image

Our method consistently produces more plausible grasp poses than existing methods.

Installation

Follow these steps to install the GraspMAS framework:

  1. Clone recursively:

    git clone --recurse-submodules https://github.com/Fsoft-AIC/GraspMAS.git
    cd GraspMAS
  2. OpenAI key: To run the GraspMAS framework, you will need an OpenAI key. This can be done by signing up for an account and then creating a key in account/api-keys. Create a file api.key in the root of this project and store the key in it.

    echo YOUR_OPENAI_API_KEY_HERE > api.key
    
  3. Prepare environment:

     conda create -n graspmas python=3.9 -y
     conda activate graspmas
     conda install -c "nvidia/label/cuda-11.8.0" cuda-toolkit
     conda install pytorch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 pytorch-cuda=11.8 -c pytorch -c nvidia
     pip install -r requirements.txt
     cd detectron2
     pip install -e .
     cd ..
  4. Download pretrained model:

    bash download.sh

Quickstart

  • You can start checking the notebook simple_demo.ipynb for simple demo inference. This notebook includes details instructions and executing queries with visualization. You can run either the complete closed-loop pipeline or the open-loop mode with Coder.
  • If you want to run inference on a single image, use the following:
python main_simple.py \
    --api-file "api.key" \
    --max-round 5 \
    --query "Grasp the knife at its handle" \
    --image-path PATH-TO-INPUT-IMAGE \
    --save-folder PATH-TO-SAVE-FOLDER

Configuration

If you want to customize tools or model hyperparameters and configurations, please refer to image_patch.py. We have only developed sufficient tools for language-driven grasp detection. The GraspMAS framework heavily depends on the effectiveness of pretrained models as tools, so results may be biased. Feel free to add or remove any pretrained models related to image or video processing, including any up-to-date models. Note that some models, such as BLIP or VLM, may require significant GPU memory.

Maniskill Demo with GraspMAS

image

We provide the notebook demo Maniskill_demo.ipynb for simulating language-driven grasp detection on the ManiSkill simulator. The simulation runs in a tabletop environment using a Panda robot arm equipped with a wrist camera.

The notebook includes detailed instructions on how to:

  • Initialize the tabletop environment
  • Visualize observations
  • Use GraspMAS to generate grasp pose rectangles
  • Convert them to 6-DoF grasp poses
  • Manipulate the robot arm accordingly

Note

This is a research project, so the code may not be optimized, regularly updated, or actively maintained.

Citation

If you find our work useful for your research, please cite:

@inproceedings{nguyen2025graspmas,
      title = {GraspMAS: Zero-Shot Language-driven Grasp Detection with Multi-Agent System},
      author = {Nguyen, Quang and Le, Tri and Nguyen, Huy and Vo, Thieu and Ta, Tung D and Huang, Baoru and Vu, Minh N and Nguyen, Anh},
      booktitle = IROS,
      year      = {2025}
  }

Acknowledgement

We thank the valuable work of ViperGPT, ViperDuality that inspired and enabled this research.

About

[IROS 2025] GraspMAS: Zero-Shot Language-driven Grasp Detection with Multi-Agent System

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published