SAFE: Multitask Failure Detection for Vision-Language-Action Models

Preprint

Project Page | Paper | ArXiv

Qiao Gu, Yuanliang Ju, Shengxiang Sun, Igor Gilitschenski, Haruki Nishimura, Masha Itkina, Florian Shkurti

We introduce the multitask failure detection problem for VLA models, and propose SAFE, a failure detector that can detect failures for unseen tasks zero-shot and achieve state-of-the-art performance. This repo contains the implementation of SAFE.

Generate rollouts from VLA models

Please follow the following repo for adapted code that runs VLA models on simulated environments and generates rollouts for failure detection. Detailed instructions can be found in the README files of these repos.

openvla for OpenVLA model on the LIBERO benchmark.
openpi for pi0 and pi0-FAST models on the LIBERO benchmark.
open-pi-zero for pi0* models on the SimplerEnv benchmark.

After generating the rollouts, please duplicate setup_envs.bash.template and edit environment variables inside according to the locations of the generated rollouts.

cp setup_envs.bash.template setup_envs.bash

# TODO: Please edit the setup_envs.bash file to set the environment variables

Train and evaluate SAFE and baseline failure detectors

Setup

git clone git@github.com:vla-safe/SAFE.git

# Create a new conda environment (or other virtual environment management tool)
conda create -n vla-safe python=3.10
conda activate vla-safe

# Install pytorch (the newest version should be fine)
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

# Install other required packages
pip install pandas scipy pyyaml tqdm imageio[ffmpeg] hydra-core omegaconf scikit-learn opencv_python einops wandb plotly matplotlib natsort flask

# Log in your wandb account
wandb login

# Install this codebase as a package
# cd to the root directory of this repo
pip install -e .

Training and evaluation

Please see the following file for training and evaluation scripts for SAFE failure detector and all baselines.

Aggregate and plot metrics

The script scripts/get_wandb_metrics.py pulls the evaluation metrics from WandB, aggregates them, and saves them to CSV files, which should reproduce the results in Table 1 of the paper. You can run the script as follows:

python scripts/get_wandb_metrics.py

Other useful scripts are as follows:

# To generate plots as shown in Figure 1 and Figure 7
python scripts/visualize_features.py

# To generate plots as shown in Figure 8
python scripts/eval_conformal_figure.py

Acknowledgements

The SAFE project and this codebase are inspired by and built on the following repos:

Reference

Please cite our work if you find it useful:

@article{gu2025safe,
  title={SAFE: Multitask Failure Detection for Vision-Language-Action Models},
  author={Gu, Qiao and Ju, Yuanliang and Sun, Shengxiang and Gilitschenski, Igor and Nishimura, Haruki and Itkina, Masha and Shkurti, Florian},
  journal={arXiv preprint arXiv:2506.09937},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
failure_prob		failure_prob
scripts		scripts
.gitignore		.gitignore
README.md		README.md
setup.py		setup.py
setup_envs.bash.template		setup_envs.bash.template

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SAFE: Multitask Failure Detection for Vision-Language-Action Models

Generate rollouts from VLA models

Train and evaluate SAFE and baseline failure detectors

Setup

Training and evaluation

Aggregate and plot metrics

Acknowledgements

Reference

About

Uh oh!

Releases

Packages

Languages

vla-safe/SAFE

Folders and files

Latest commit

History

Repository files navigation

SAFE: Multitask Failure Detection for Vision-Language-Action Models

Generate rollouts from VLA models

Train and evaluate SAFE and baseline failure detectors

Setup

Training and evaluation

Aggregate and plot metrics

Acknowledgements

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages