This repository contains:
- the implementation of navigation agents for our paper: Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation;
- a dataset for pretraining outdoor VLN task.
In this project, we use the Touchdown dataset and the StreetLearn dataset. More details regarding these two datasets can be found here.
Our pre-training dataset is built upon StreetLearn.
The guiding instructions for the outdoor VLN task are provided in touchdown/datasets/.
To download the panoramas, please refer to Touchdown Dataset and StreetLearn Dataset.
- Python 3.6
- PyTorch 1.7.0
- Texar
We conduct experiments on Ubuntu 18.04 and Titan RTX.
Please run the following lines to download the code and install Texar:
git clone https://github.com/VegB/VLN-Transformer/
cd VLN-Transformer/
pip install [--user] -e . # install Texar
cd touchdown/Training can be performed with the following command:
python main.py --dataset [DATASET] --img_feat_dir [IMG_DIR] --model [MODEL] --exp_name [EXP_NAME]DATASETis the dataset for outdoor navigation. This repo currently support the following three datasets:touchdownis a dataset for outdoor VLN, the instructions are written by human annotators;manh50is a subset of StreetLearn, the instructions are generated by Google Map API;manh50_maskhas the same trajectories asmanh50, but the instructions are style-modified (which is what we do in this paper).
IMG_DIRcontains the encoded panoramas forDATASET. After you get access to the panoramas, please encode them accordingly. Each file in this directory should be a numpy file[PANO_ID].npythat represent the panorama that has corresponding pano_id. The encoding process are described in Touchdown paper, Section D.1.MODELis the navigation agent, may berconcatfor RCONCAT orvlntransfor VLN Transformer.
More parameters and usage are listed here.
It should be noted here that vlntrans use BERT (bert-base-uncased) to encode the instruction and it takes a lot of space,
which means you may need to adjust the batch size accordingly to fit the model into your GPU.
In our experiments, we use 3 piece of Titan RTX and a batch size of 30.
This is the command we use to pretrain VLN Transformer on our instruction-style-modified dataset:
CUDA_VISIBLE_DEVICES="0,1,2" python main.py --dataset 'manh50_mask' --img_feat_dir '/data/manh50_features_mean/' --model 'vlntrans' --batch_size 30 --max_num_epochs 15 --exp_name 'pretrain_mask'We can finetune the VLN agent on pre-trained models.
python main.py --dataset [DATASET] --img_feat_dir [IMG_DIR] --model [MODEL] --resume_from [PRETRAINED_MODEL] --resume [RESUME_OPTION]PRETRAINED_MODELspecified the pre-trained model;RESUME_OPTIONspecifies the checkpointlatest: the most recent ckpt;TC_best: the ckpt with the best TC score on dev set;SPD_best: the ckpt with the best SPD score on dev set.
We can evaluate the agent's navigation performance on the test set and dev set with the following command:
python main.py --test True --dataset [DATASET] --img_feat_dir [IMG_DIR] --model [MODEL] --resume_from [PRETRAINED_MODEL] --resume [RESUME_OPTION] --CLS [True/False] --DTW [True/False]The pre-trained models for VLN Transformer, RCONCAT and GA can be downloaded
from here.
Please place them in checkpoints/.
To reproduce the results in our paper, please use the following commands:
CUDA_VISIBLE_DEVICES="0" python main.py --test True --dataset 'touchdown' --img_feat_dir [IMG_DIR] --model 'rconcat' --resume_from [PRETRAINED_MODEL] --resume 'TC_best' --CLS True --DTW True
CUDA_VISIBLE_DEVICES="1" python main.py --test True --dataset 'touchdown' --img_feat_dir [IMG_DIR] --model 'ga' --resume_from [PRETRAINED_MODEL] --resume 'TC_best' --CLS True --DTW True
CUDA_VISIBLE_DEVICES="2" python main.py --test True --dataset 'touchdown' --img_feat_dir [IMG_DIR] --model 'vlntrans' --batch_size 30 --resume_from [PRETRAINED_MODEL] --resume 'TC_best' --CLS True --DTW TruePRETRAINED_MODELspecified the pre-trained modelvanilla: Navigation agent trained ontouchdowndataset without pre-training on auxiliary datasets.finetuned_manh50: Pre-trained onmanh50dataset, and finetuned ontouchdowndataset.finetuned_mask: Pre-trained onmanh50_maskdataset, and finetuned ontouchdowndataset.
@misc{zhu2020multimodal,
title={Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation},
author={Wanrong Zhu and Xin Wang and Tsu-Jui Fu and An Yan and Pradyumna Narayana and Kazoo Sone and Sugato Basu and William Yang Wang},
year={2020},
eprint={2007.00229},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
The code and data can't be built without streetlearn, speaker_follower, touchdown, and Texar. We also thank @Jiannan Xiang for his contribution in reproducing the Touchdown task.