Created by Jacob Phillips
ARES (Automatic Robot Evaluation System) is a open-source (Apache 2.0) platform for automatically ingesting, curating, and evaluating robot data using ML models to quickly and accurately understand policy performance, identify areas for improvement, and generate new robot datasets, all without setting up any heavy infrastructure. Researchers tend to chase point-solutions for specific tasks or paper implementations, but ARES is designed to be a generalized platform for long-term robot research that scales from a laptop to the cloud. ARES is built to be simple and scalable, with a special focus on ease of use. All computation and model inference can be run through local resources or cloud APIs via model providers like OpenAI, Anthropic, Gemini, Modal, Replicate, etc., requiring only a credit card for access - no complex cloud infrastructure or GPU setup needed.
At a high level, ARES is composed of three main components:
- Ingestion: automatically transform raw robot data into a structured format with VLMs
- Annotation: annotate the rollouts with pseduo-labels for downstream tasks
- Curation and Modeling: understand data distributions and select data for training or evaluation
ARES is a platform for understanding robot data, targeted at robot developers and researchers. Researchers and developers suffer from two main problems: building point-solutions for specific tasks or papers and transitioning from research-scripts to production-tools. ARES aims to solve both of these problems by providing a platform for robot data understanding, enabling rapid development of new robot systems.
You can use ARES to:
- Curate and annotate ground-truth teleoperation data
- Evaluate the performance of robot models
- Analyze batches of robot data to improve policies
- 🛠️ Stack
- 🧠 Installation and Setup
- 👤 Configurations
- 📊 Data
- 📥 Ingestion and Annotation
- 🔍 Curation and Analysis
- 🤖 Training and Export
- 📈 Evaluation
- 🧰 Case Studies
- 🎉 Extras
- 💰 Costs
- 🔮 Limitations and Next Steps
- 📝 Acknowledgements and Citation
ARES is built to be a low-friction, easy-to-use infrastructure platform for robot research. As such, we select tools that are simple to setup locally but also smooth to scale to cloud-level resources.
- Databases: MongoDB, SQLAlchemy, FAISS
- Model Inference: litellm, configured with your choice of model provider (OpenAI, Anthropic, Gemini, HuggingFace, etc.)
- Compute Orchestration: Modal
- Frontend: Streamlit, Plotly
- Development: Docker, Cursor, VSCode, Pydantic
First, clone the repository:
git clone https://github.com/jacobphillips99/ares.git
cd ares
If you want to use ARES as a package in your own project, follow the instructions below to install the requirements and install the package (add -e
to install in editable mode and make changes). Otherwise, continue to the Docker and Devcontainer Setup section to run ARES locally.
pip install -r requirements.txt
pip install -e .
ARES was built with a Docker container in mind for easy setup, containerization, and deployment -- this makes it quick, easy, and consistent to setup ARES. We reccomend using a Docker container to run ARES, specifically utilizing a VSCode/Cursor devcontainer extension. The Dockerfile contains the necessary system packages for ARES, including a reference to the requirements.txt
file for Python dependencies. Follow the steps below to setup Docker Desktop and the VSCode/Cursor devcontainer extension.
Go to the Docker Desktop website and download the latest version. Depending on your operating system, setup the Docker Desktop application.
Download and install the IDE of your choice. We recommend VSCode or Cursor. Open the ARES directory in the IDE (using File > Open Folder...
). Then use the Command Prompt to re-open the devcontainer using Dev Containers: Reopen in Container
command. This should build the Docker container and start the ARES application, including opening an integrated terminal to the container.
The devcontainer.json
file contains the necessary configuration for the devcontainer, including the Dockerfile. It mounts some local directories into the container, such as the data
directory for storing robot data, the /tmp
directory for storing temporary files, and the .cache/huggingface
directory for storing model weights.
In order to use the AnnotationDatabase, you will need to set up a MongoDB instance. We use docker-compose
to run the MongoDB instance; see mongo-docker-compose.yml
for the configuration. You can start the MongoDB instance by running
docker-compose -f mongo-docker-compose.yml up -d
in shell in the root directory. This will start the MongoDB instance in detached mode and expose it to the host machine on port 27017; the instance will automatically restart on container restart.
ARES uses environment variables to configure secrets like API keys. We mount these environment variables into the devcontainer using the devcontainer.json
file. We copy over variables like API keys and credentials. If needed, you can also add your own certificates to the /etc/ssl/certs
directory, which is also mounted into the container.
Once your IDE and environment are setup, you're ready to start using ARES!
All of ARES refers back to a base unit of data: the Rollout
object defined in ares/configs/base.py
. In reinforcement learning, a rollout is a collection of states, actions, and rewards in some environment. For our purposes, a Rollout
is a collection of information about a robot completing a task, also called an episode.
These rollouts may be examples of a policy evaluation, a human teleoperation session, simulated or synthetic data, or other data formats. The Rollout
object can contain multiple modalities of data, such as video, pointclouds, motor actions, and joint states. Rollouts can contain metadata about the episode, such as the dataset name, the robot configuration and embodiment, the data collection method, and more.
During ingestion, the user can provide hard-coded information about the rollout; other fields in the Rollout
class will be provided by a VLM during ingestion.
The Rollout
class contains recursive subconfigurations: Robot
, Environment
, Task
, and Trajectory
. These subconfigurations may contain Pydantic fields or other subconfigurations. We use Pydantic not only to validate types, but also provide rich type information, such as descriptions, examples, regex patterns, and other numerical constraints. These Field
objects are then used to automatically configure prompts for VLMs during ingestion, including the extra metadata from the Field. Fields with the suffix _estimate
are inferred by the VLM during ingestion. The configuration classes can be recursively flattened or re-constructed into the original object; this enables a flexible system that can be used to create SQLModel types for database ingestion. Users can define their own configurations by inheriting from the base configuration class or adding Field
objects to the configuration classes.
We start with data from Tensorflow Datasets ports of the Open X-Embodiment project. While these datasets being open and available is great for general robot model training research, the iterator-style dataset makes it extremely difficult to do fast understanding and analysis, which motivates much of this work. As explained below in Ingestion and Annotation, we can use Open X-Embodiment data or user-defined datasets. In order to download the Open X-Embodiment data, use the oxe_downloader
tool.
We make our data and databases available on the Hugging Face Hub, which contains roughly 5000 ingested and annotated rollouts from the Open X-Embodiment project. Users can download the data by running the pull_from_hub.sh
script, which should download, extract, and restore the databases. This can be done by running:
# install the appropriate mongo-db-tools from https://www.mongodb.com/try/download/database-tools
chmod +x scripts/release/pull_from_hub.sh
./scripts/release/pull_from_hub.sh
Users can also upload their own data to the HF Hub by running python -m scripts.release.push_to_hub
. We upload the StructuredDatabase, AnnotationDatabase, EmbeddingDatabase, and videos to the Hub. The data is stored in the data
directory in the root of the repository. The dataset contains the following:
robot_data.db
: the StructuredDatabase SQLite database containing the structured metadata, descriptions, environment details, performance metrics, and more.embedding_data
: the EmbeddingDatabase IndexManager, containing the FAISS-backed indexes of trajectory states, actions, descriptions, and task instructions.annotation_mongodump
: the AnnotationDatabase MongoDB dump, containing the labeled rollouts, detections, success criteria, and other annotations stored in a MongoDB instance.videos
: a directory containing the videos and frames of the ingested rollouts.
We adopt three main steps during ingestion: structured ingestion, embedding ingestion, and grounding ingestion. First, structured ingestion transforms the raw data into a structured format, which is then dumped into a SQL database. Second, embedding ingestion transforms data (such as the text description, video, action and state trajectories) into dense embeddings, which are then stored in a series of FAISS indexes. Third, grounding ingestion uses a VLM to detect objects in the scene and then detector and segmenter models to annotate the rollout with ground-truth labels and store these in a MongoDB database.
We adopt the OpenXEmbodiment specification as the starting point for ingestion, allowing users to ingest datasets from the Open X-Embodiment project. During ingestion, the user can provide hard-coded information about the episode, such as the natural language or templated task instructions. We load the raw data into an OpenXEmbodimentEpisode
object, which includes Pydantic model_validator
functions to process the raw data, such as swapping in highres_image
or pulling end_effector_cartesian_pos
from state
. We also pull data from the Open X-Embodiment spreadsheet, which contains metadata about the datasets and is copied into the repository at ares/extras/oxe.csv
.
The general ingestion pipeline can be found in main.py
, which runs structured, embedding, and grounding ingestion. An example of ingesting a user-defined dataset can be found in ares/scripts/pi_demo_ingestion.py
. This script contains the details on ingesting a series of demonstrations from a Physical Intelligence blogpost, hereafter referred to as the "PI Demos".
The script for structured ingestion can be found in ares/scripts/run_structured_ingestion.py
. The script iterates through asynchronous batches of episodes, extracting structured, hard-coded information from each episode and "filling in the blanks" for estimated fields like description_estimate
, surface_estimate
, success_estimate
etc. The information from the VLM populates a Rollout
object, which is then flattened into the RolloutSQLModel
object and dumped into the SQL database. Our StructuredDatabase
contains columns that match the recursively flattened Rollout
object, enabling retrieval over the entire dataset. This yields long-term, structured metadata about the rollouts, which can be used for downstream analysis and curation.
VLM(video, task, Field Annotations) -> Rollout -> Flattened RolloutSQLModel -> SQL Database Row
We are interested in finding similar rollouts across a dataset, amongst many axes: text description, task instruction, actions, states, etc. To enable this, we embed the rollouts into a dense vector space, where the euclidean distance between rollouts approximates their semantic similarity. See the script in ares/scripts/run_trajectory_embedding_ingestion.py
for more details. For the text axes, we use a Nomic Embedder to embed the text into a dense vector space. For the state and action axes, we first interpolate trajectories to a common time step, normalize sensor values per-dimension, and then flatten and embed the trajectories into a common vector space. This enables comparing and contrasting rollouts across different axes, such as finding rollouts in a similar task
space but extremely different action
spaces.
We want to make it as easy as possible to annotate rollouts with models. This can be further text descriptions (such annotating success criteria or grounding descriptions) or more traditional object detection and segmentation labels. During ingestion, we annotate at 5 FPS using grounding-dino-tiny
and sam-vit-base
models to detect objects and perform object segmentation. These annotations are stored in the AnnotationDatabase, which is backed by a MongoDB instance. The compute orchestration is handled by Modal, which allows us to scale to cloud-level resources and perform asynchronous, parallelized inference. The script for grounding ingestion can be found in ares/scripts/run_grounding.py
.
Once we've ingested and annotated a dataset, we can use ARES to curate and analyze the data. We provide a series of tools for visualizing and understanding the data in a simple frontend powered by Streamlit. You can run the frontend locally by running:
streamlit run src/ares/app/webapp.py
This opens a local Streamlit web app at localhost:8501
that provides a high level overview of the ingested data, covering structured metadata, videos, annotations, and more. The ability to visualization annotations and retrieve rollouts by their annotations is a powerful tool for curation, as well as the ability to filter rollouts by their attributes. Retrieval over trajectory embeddings enables a powerful tool for finding in- and out-of-distribution rollouts. Exploring 2D projections of embeddings (such as task or description) enables the user to find clusters in the data, resulting in deeper understanding of the distributions.
The user can easily select a subset of rollouts based on their hard-coded or inferred attributes. For example, the user can select all rollouts with dynamic background
and low light
to understand the performance of the robot in low-light conditions with a dynamic background. Instead, the user can also select rollouts based on a UMAP projection of the unstructured attributes, such as the task instructions or inferred natural language description.
The main focus of the ARES platform is examining individual rollouts and finding similar examples. We provide a hero display to enable the user to view all information about a rollout at once, including video, annotations, other metadata. Additionally, the user can retrieve similar examples based on the rollout's embedding, searching for similar examples over task, description, or state and action trajectories.
We can also examine trajectories to find in- and out-of-distribution rollouts. We view can find examples of difficult-to-reach joint angles or unusual action sequences. Researchers may use this tool to find poor performing regions of action or state spaces, or to find examples of successful trajectories that can be used to improve policies.
After curating and analyzing a dataset, we provide functionality for the user to export a curated slice. This export can be the graphical depiction of the dashboard saved to PDF or HTML; alternatively, you can export the data to CSV or Parquet files. These CSV and Parquet files can also be used as a pre-processsed dataset for training downstream models. We provide a RolloutDataset
and RolloutDatasetConfig
class to help train models using the ARES platform. See examples in ares/training/preprocess.py
and ares/training/train.py
.
In order to decide which models should be used to power the ARES platform, we developed a benchmark based on the demonstrations released by Physical Intelligence in the π₀ release. The benchmark contains 20 videos over 11 tasks, most with both a success and failure rollout (as provided by the PI team). We use this benchmark to evaluate the performance of the ARES platform, as well as to select the best models for use inthe platform. We conduct evaluations over binary success classification, covering a range of VLMs, methods, consensus votes, and frames-per-second. We demonstrate the flexibility of the ARES platform by using the VLM
object, creating a Pydantic EvalConfig
, and launching batch asynchronous evaluations. In order to maintain consistent, robust evaluations, we first annotate success_criteria
for each rollout using a VLM before predicting success. We run the evaluation script in scripts/eval.py
. Evaluation results are summarized below, as plotted by notebooks/eval_nb.ipynb
.
We evaluate over:
- Models:
claude-3.5-sonnet
,gemini-1.5-pro
,gemini-1.5-flash
,gemini-2-flash
,gpt-4-turbo
,gpt-4o
, andgpt-4o-mini
- Methods:
video
(feeding frames directly),frame_descriptions
(describing each frame individually, then feeding all descriptions to the VLM) - Frames-per-second:
0
,0.25
,0.5
,1
,2
, where0
means just the first and final frames - Consensus: Consensus of either the mean or median of the predictions
- N Votes: Number of votes to take for the consensus prediction, ranging over
1
,2
,3
,4
,5
We show gpt-4o
to be the best performing model, with little impact from the number of votes or consensus strategy. Across models, performs seems to be mostly consistent after 0.5 FPS, while some older models actually perform better at very low FPS (e.g. gemini-1.5-pro
performs best at 0
FPS). We find the frame_description
method to actually slightly outperform the generalized video
method, calling into question progress on long-context video benchmarks. However, this method is significantly more expensive and more difficult to run with respect to request-per-minute and token-per-minute rate limits. As such, we adopt the video
method at 1 FPS with gpt-4o
for all ARES data ingestion.
Zawalski et al.'s "Robotic Control via Embodied Chain-of-Thought Reasoning" (ECoT) demonstrated how composing annotations from multiple sources can create effective reasoning plans for robot control. ARES reimplements this approach with a more modular, scalable architecture that addresses several limitations of the original implementation:
- Model Flexibility: Easy switching between VLMs via LiteLLM integration
- Cloud Scaling: Modal-based orchestration for detection and segmentation models
- Parallel Processing: Asynchronous annotation for improved throughput
- Structured Storage: Database-backed annotations instead of flat files
Our implementation in scripts/annotating/run_pseudo_ecot.py
demonstrates how ARES generates ECoT-like training data efficiently through a process that:
-
Loads previously generated annotations from the
AnnotationDatabase
:- Scene descriptions, object detections, and success criteria
-
Composes these annotations with task information into a structured prompt using a Jinja2 template (
pseudo_ecot.jinja2
) -
Processes results in parallel through the
orchestrate_annotating
framework, which:- Batches rollouts efficiently to manage memory usage
- Handles error tracking and retries for failed annotations
- Manages database connections and transactions
- Provides detailed statistics on annotation progress and performance
This approach generated 2,500 ECoT-style annotations in approximately 10 minutes at a cost of around $5, demonstrating ARES's efficiency in generating complex annotations at scale.
To demonstrate ARES's flexibility for custom dataset ingestion, we ingested demonstrations from Physical Intelligence's π₀ release. This dataset includes paired success and failure examples for various robot tasks, making it ideal for evaluation purposes.
The ingestion process adapts these videos to work within ARES's framework through several key steps:
- Data Preparation: We transform the PI Demo videos into ARES's standard format by extracting frames from the videos and organizing them with appropriate metadata, including success/failure labels.
- Custom Dataset Iterator: We created a simple iterator that processes each task with both its success and failure variants, allowing ARES to ingest the paired examples efficiently.
- Standardized Metadata: The PI Demo task information is mapped to ARES's expected metadata format, ensuring compatibility with our analysis tools.
- Standard Pipeline Integration: Once prepared, the data flows through the standard ARES ingestion pipeline, benefiting from all the structured, embedding, and grounding capabilities.
After ingestion, the user can choose the Physical Intelligence Demos
dataset during structured data filtering to view the dataset distribution and results:
This case study demonstrates how ARES can easily adapt to custom datasets with minimal configuration, requiring only the definition of how to extract frames and map metadata. The resulting ingested dataset served as the foundation for our VLM performance evaluation on success/failure classification, as detailed in the Evaluation section.
We provide some fun extra tools to help with robot research.
- The
Rollout
andRolloutSQLModel
classes can be used to dynamically flatten and reconstruct rollouts, enabled by hierarchical prefix matching. This enables smooth transitions between Pydantic and SQLModel types. - The
notebooks
directory contains a few notebooks for visualization; one for VLM results and the other for annotation results. - The
scripts
directory provides the top-level entrypoint for interacting with data in ARES, including the main ingestion pipeline. Other scripts include:scripts/db_updaters/
: scripts for amending theAnnotationDatabase
andStructuredDatabase
scripts/annotating
: scripts for running annotations for in-context-learning, Embodied Chain of Thought, success criteria, and more, including the Modal-based grounding and detection scripts.scripts/self_heal.py
: a tool for automatically syncing theStructuredDatabase
with theAnnotationDatabase
andEmbeddingDatabase
. This means that there should never be ingested rollouts without annotations or embeddings.
- In order to provide robust error tracking during annotation, we provide a
ResultTracker
object that can be used to track the results of an annotation. This is useful for providing feedback on the status of an annotation job, as well as for providing a record of the annotation. Seeares/annotating/annotating_base.py
for more details. - Normalization in ARES is dependent on the entire dataset being collected ahead of time. However, in practice, we often want to normalize data on-the-fly. To enable this, we provide the
NormalizationTracker
object to perform batch or online normalization. Seeares/databases/embedding_database.py
for more details.
Ingesting, annotating, and annotating data can be quite expensive; we aim to use ARES to make robot research as accessible as possible. The original development of ARES was conducted solely on a laptop, using a combination of local resources (CPU and Disk) and cloud APIs. First, we mitigate costs by holding all data strucutres locally, including the StructuredDatabase
, AnnotationDatabase
, and EmbeddingDatabase
. Second, we downsample frames-per-second while ingesting, running gpt-4o
at 1 FPS via API, using grounding-dino-tiny
and sam-vit-base
at 5 FPS via Modal, and the Nomic Embedder via local CPU. Back-of-the-envelope costs comes out to roughly 1 cent per rollout for gpt-4o
, as most rollouts are relatively short. Annotating 5000 rollouts at 5 FPS (totaling >100,000 frames) for detection and segmentation costs less than $10 of $30 of monthly free Modal credits. We expect these costs to remain low as the pareto-frontier of efficicent multimodal models continues to bring down the cost of inference.
Right now, ARES is a platform to accelerate robot researchers. However, we need to acknowledge that current-generation VLMs are not perfect, leading to discrepencies or inaccuraies in the data during ingestion. At the same time, a lot of the questions being posed to the VLMs to understand robot data are relatively simple, so we are hopeful that next-generation VLMs will continue to improve in accuracy at long-context video understanding tasks. ARES is designed to be simple and scalable, so great next steps would be to make it easier to scale ARES into the cloud via hosted instances like Mongo Atlas, AWS, Snowflake, etc. On top of that, ingesting and open-sourcing more rollouts from the Open X-Embodiment project would be a great way to continue to expand the usefulness of the platform. Currently, we have ingested roughly 5000 rollouts from the roughly 1 million available, focusing on those datasets with natural language task annotations that can also fit on a laptop. Building stronger ingestion pipelines for more complex data types would be a great next step. Further next steps would refactor the ARES frontend (currently running on Streamlit) into a more performant system.
This project was developed by Jacob Phillips as a part of the Andreessen Horowitz American Dynamism Engineering Fellows program. Special thanks to the American Dynamism team for their support and feedback on the project.
If using the ARES platform in your work, please cite it to acknowledge the authors. Suggested format:
@software{ARES,
title = {ARES: Automatic Robot Evaluation System},
author = {Jacob Phillips},
url = {https://github.com/jacobphillips99/ares},
version = {insert version number},
date = {insert date of usage},
year = {2025},
}