Repository: text2nav
Accepted to Robotics: Science and Systems (RSS) 2025 Workshop on Robot Planning in the Era of Foundation Models (FM4RoboPlan)
This repository contains the implementation for our research investigating whether frozen vision-language model embeddings can guide robot navigation without fine-tuning or specialized architectures. We present a minimalist framework that achieves 74% success rate in language-guided navigation using only pretrained SigLIP embeddings.
- ๐ฏ 74% success rate using frozen VLM embeddings alone (vs 100% privileged expert)
- ๐ 3.2x longer paths compared to privileged expert, revealing efficiency limitations
- ๐ SigLIP outperforms CLIP and ViLT for navigation tasks (74% vs 62% vs 40%)
- โ๏ธ Clear performance-complexity tradeoffs for resource-constrained applications
- ๐ง Strong semantic grounding but limitations in spatial reasoning and planning
Our approach consists of two phases:
- Expert Demonstration Phase: Train a privileged policy with full state access using PPO
- Behavioral Cloning Phase: Distill expert knowledge into a policy using only frozen VLM embeddings
The key insight is using frozen vision-language embeddings as drop-in representations without any fine-tuning, providing an empirical baseline for understanding foundation model capabilities in embodied tasks.
- Python 3.8+
- NVIDIA Isaac Sim/Isaac Lab
- PyTorch
- CUDA-compatible GPU
git clone https://github.com/oadamharoon/text2nav.git
cd text2nav
# Install dependencies
pip install torch torchvision
pip install transformers
pip install numpy matplotlib
pip install gymnasium
# For Isaac Lab simulation (follow official installation guide)
# https://isaac-sim.github.io/IsaacLab/
text2nav/
โโโ CITATION.cff # Citation information
โโโ LICENSE # MIT License
โโโ README.md # This documentation
โโโ IsaacLab/ # Isaac Lab simulation environment setup
โโโ embeddings/ # Vision-language embedding generation
โโโ rl/ # Reinforcement learning expert training
โโโ generate_embeddings.ipynb # Generate VLM embeddings from demonstrations
โโโ revised_gen_embed.ipynb # Revised embedding generation
โโโ train_offline.py # Behavioral cloning training script
โโโ offlin_train.py # Alternative offline training
โโโ bc_model.pt # Trained behavioral cloning model
โโโ td3_bc_model.pt # TD3+BC baseline model
โโโ habitat_test.ipynb # Testing and evaluation notebook
โโโ replay_buffer.py # Data handling utilities
cd rl/
python train_expert.py --env isaac_sim --num_episodes 500
jupyter notebook generate_embeddings.ipynb
python train_offline.py --model siglip --embedding_dim 1152 --batch_size 32
Model | Success Rate (%) | Avg Steps | Efficiency |
---|---|---|---|
Expert (ฯฮฒ) | 100.0 | 113.97 | 1.0ร |
SigLIP | 74.0 | 369.4 | 3.2ร |
CLIP | 62.0 | 417.6 | 3.7ร |
ViLT | 40.0 | 472.0 | 4.1ร |
- Environment: 3m ร 3m arena in NVIDIA Isaac Sim
- Robot: NVIDIA JetBot with RGB camera (256ร256)
- Task: Navigate to colored spheres based on language instructions
- Targets: 5 colored spheres (red, green, blue, yellow, pink)
- Success Criteria: Reach within 0.1m of correct target
- Semantic Grounding: Pretrained VLMs excel at connecting language descriptions to visual observations
- Spatial Limitations: Frozen embeddings struggle with long-horizon planning and spatial reasoning
- Prompt Engineering: Including relative spatial cues significantly improves performance
- Embedding Dimensionality: Higher-dimensional embeddings (SigLIP: 1152D) outperform lower-dimensional ones
- Hybrid architectures combining frozen embeddings with lightweight spatial memory
- Data-efficient adaptation techniques to bridge the efficiency gap
- Testing in more complex environments with obstacles and natural language variation
- Integration with world models for better spatial reasoning
@misc{subedi2025pretrainedvisionlanguageembeddingsguide,
title={Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation?},
author={Nitesh Subedi and Adam Haroon and Shreyan Ganguly and Samuel T. K. Tetteh and Prajwal Koirala and Cody Fleming and Soumik Sarkar},
year={2025},
eprint={2506.14507},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2506.14507},
}
This work is funded by NSF-USDA COALESCE grant #2021-67021-34418. Special thanks to the Iowa State University Mechanical Engineering Department for their support.
- Nitesh Subedi* (Iowa State University)
- Adam Haroon* (Iowa State University)
- Shreyan Ganguly (Iowa State University)
- Samuel T.K. Tetteh (Iowa State University)
- Prajwal Koirala (Iowa State University)
- Cody Fleming (Iowa State University)
- Soumik Sarkar (Iowa State University)
*Equal contribution
This project is licensed under the MIT License - see the LICENSE file for details.
For questions or issues, please open a GitHub issue or contact the authors.