Skip to content

Minimalist framework for language-guided robot navigation using frozen vision-language embeddings. Achieves 74% success rate without fine-tuning. RSS 2025 Workshop paper.

License

Notifications You must be signed in to change notification settings

oadamharoon/text2nav

Repository files navigation

Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation?

arXiv RSS 2025 Python Jupyter

Repository: text2nav

Accepted to Robotics: Science and Systems (RSS) 2025 Workshop on Robot Planning in the Era of Foundation Models (FM4RoboPlan)

๐Ÿ“ Overview

This repository contains the implementation for our research investigating whether frozen vision-language model embeddings can guide robot navigation without fine-tuning or specialized architectures. We present a minimalist framework that achieves 74% success rate in language-guided navigation using only pretrained SigLIP embeddings.

๐ŸŽฏ Key Findings

  • ๐ŸŽฏ 74% success rate using frozen VLM embeddings alone (vs 100% privileged expert)
  • ๐Ÿ” 3.2x longer paths compared to privileged expert, revealing efficiency limitations
  • ๐Ÿ“Š SigLIP outperforms CLIP and ViLT for navigation tasks (74% vs 62% vs 40%)
  • โš–๏ธ Clear performance-complexity tradeoffs for resource-constrained applications
  • ๐Ÿง  Strong semantic grounding but limitations in spatial reasoning and planning

๐Ÿš€ Method

Our approach consists of two phases:

  1. Expert Demonstration Phase: Train a privileged policy with full state access using PPO
  2. Behavioral Cloning Phase: Distill expert knowledge into a policy using only frozen VLM embeddings

The key insight is using frozen vision-language embeddings as drop-in representations without any fine-tuning, providing an empirical baseline for understanding foundation model capabilities in embodied tasks.

๐Ÿ› ๏ธ Installation

Prerequisites

  • Python 3.8+
  • NVIDIA Isaac Sim/Isaac Lab
  • PyTorch
  • CUDA-compatible GPU

Setup

git clone https://github.com/oadamharoon/text2nav.git
cd text2nav

# Install dependencies
pip install torch torchvision
pip install transformers
pip install numpy matplotlib
pip install gymnasium

# For Isaac Lab simulation (follow official installation guide)
# https://isaac-sim.github.io/IsaacLab/

๐Ÿ“ Repository Structure

text2nav/
โ”œโ”€โ”€ CITATION.cff           # Citation information
โ”œโ”€โ”€ LICENSE                # MIT License
โ”œโ”€โ”€ README.md              # This documentation
โ”œโ”€โ”€ IsaacLab/              # Isaac Lab simulation environment setup
โ”œโ”€โ”€ embeddings/            # Vision-language embedding generation
โ”œโ”€โ”€ rl/                    # Reinforcement learning expert training
โ”œโ”€โ”€ generate_embeddings.ipynb    # Generate VLM embeddings from demonstrations
โ”œโ”€โ”€ revised_gen_embed.ipynb      # Revised embedding generation
โ”œโ”€โ”€ train_offline.py             # Behavioral cloning training script
โ”œโ”€โ”€ offlin_train.py              # Alternative offline training
โ”œโ”€โ”€ bc_model.pt                  # Trained behavioral cloning model
โ”œโ”€โ”€ td3_bc_model.pt            # TD3+BC baseline model
โ”œโ”€โ”€ habitat_test.ipynb         # Testing and evaluation notebook
โ””โ”€โ”€ replay_buffer.py           # Data handling utilities

๐ŸŽฎ Usage

1. Expert Demonstration Collection

cd rl/
python train_expert.py --env isaac_sim --num_episodes 500

2. Generate VLM Embeddings

jupyter notebook generate_embeddings.ipynb

3. Train Navigation Policy

python train_offline.py --model siglip --embedding_dim 1152 --batch_size 32

๐Ÿ“Š Results

Model Success Rate (%) Avg Steps Efficiency
Expert (ฯ€ฮฒ) 100.0 113.97 1.0ร—
SigLIP 74.0 369.4 3.2ร—
CLIP 62.0 417.6 3.7ร—
ViLT 40.0 472.0 4.1ร—

๐Ÿ”ฌ Experimental Setup

  • Environment: 3m ร— 3m arena in NVIDIA Isaac Sim
  • Robot: NVIDIA JetBot with RGB camera (256ร—256)
  • Task: Navigate to colored spheres based on language instructions
  • Targets: 5 colored spheres (red, green, blue, yellow, pink)
  • Success Criteria: Reach within 0.1m of correct target

๐Ÿ’ก Key Insights

  1. Semantic Grounding: Pretrained VLMs excel at connecting language descriptions to visual observations
  2. Spatial Limitations: Frozen embeddings struggle with long-horizon planning and spatial reasoning
  3. Prompt Engineering: Including relative spatial cues significantly improves performance
  4. Embedding Dimensionality: Higher-dimensional embeddings (SigLIP: 1152D) outperform lower-dimensional ones

๐Ÿ”ฎ Future Work

  • Hybrid architectures combining frozen embeddings with lightweight spatial memory
  • Data-efficient adaptation techniques to bridge the efficiency gap
  • Testing in more complex environments with obstacles and natural language variation
  • Integration with world models for better spatial reasoning

๐Ÿ“š Citation

@misc{subedi2025pretrainedvisionlanguageembeddingsguide,
      title={Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation?}, 
      author={Nitesh Subedi and Adam Haroon and Shreyan Ganguly and Samuel T. K. Tetteh and Prajwal Koirala and Cody Fleming and Soumik Sarkar},
      year={2025},
      eprint={2506.14507},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2506.14507}, 
}

๐Ÿ™ Acknowledgments

This work is funded by NSF-USDA COALESCE grant #2021-67021-34418. Special thanks to the Iowa State University Mechanical Engineering Department for their support.

๐Ÿ‘ฅ Contributors

*Equal contribution

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ”— Links


For questions or issues, please open a GitHub issue or contact the authors.

About

Minimalist framework for language-guided robot navigation using frozen vision-language embeddings. Achieves 74% success rate without fine-tuning. RSS 2025 Workshop paper.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •