EWMBench: Evaluating Scene, Motion, and Semantic Quality in Embodied World Models

Embodied World Model Benchmark (EWMBM) is a benchmark framework specifically designed to evaluate Embodied World Models (EWMs), aiming to assess the performance of text-driven video generation models in embodied tasks. EWMBM systematically evaluates the physical plausibility and task coherence of generated content across three key dimensions: visual scene consistency, motion correctness, and semantic alignment. Compared to traditional perceptual metrics, EWMBM focuses more on the practical usability and rationality of generation results within embodied contexts. It is accompanied by a multi-dimensional evaluation toolkit and a high-quality, diverse dataset, providing insights into the limitations of current methods and driving progress toward the next generation of embodied intelligence models.

Getting started

Setup

CUDA Version: 11.8
Python 3.10+ recommended
Use conda or virtualenv for environment isolation

conda create -n EWMBench python=3.10.16
conda activate EWMBench
git clone --recursive https://github.com/AgibotTech/EWMBench.git
cd EWMBench
pip install -r requirements.txt
pip install git+https://github.com/openai/CLIP.git

Pretrained Weights

Download the official Qwen2.5 checkpoint from Qwen2.5 , and the tool of Qwen2.5 come from Qwen2.5-tool and modify to adapt our algorithm.
Download the clip-vit-base-patch16 weight and CLIP-vit-b-32 weight.
Download the our finetuned DINOv2 and YOLO-World weights.

We need to add the path of each weight to config.yaml.

Data Downloading

Download the dataset from hugging face
Move the download dataset to ./data

Our ground truth data is stored in the gt_dataset folder. This folder contains the standard datasets used to verify model accuracy.

For your reference, we also provide a sample set of the generated results .

It is important to follow the required directory structure outlined below to ensure compatibility.

Ground Truth Data

gt_dataset/
├── task_1/
│   ├── episode_1/
│   │   ├── prompt/
│   │   │   ├── init_frame.png
│   │   │   └── introduction.txt
│   │   └── video/
│   │       ├── frame_00000.jpg
│   │       ├── ...
│   │       └── frame_0000n.jpg
│   ├── episode_2/
│   └── ...
├── task_2/
└── ...

Generated Samples

{xxx}_dataset/
├── task_1/
│   ├── episode_1/
│   │   ├── 1/
│   │   │   └── video/
│   │   │       ├── frame_00000.jpg
│   │   │       ├── ...
│   │   │       └── frame_0000n.jpg
│   │   ├── 2/
│   │   └── 3/
│   ├── episode_2/
│   └── ...
├── task_2/
└── ...

Note: the dataset folder name should end with '_dataset'.

Data Preprocessing

To preprocess the inference data, use the following command:

bash processing.sh ./config.yaml

Note: Ground truth detection must be performed on first run.

After processing, please check the data directory structure

⚠️ If your data structure does not match the following format, it does not meet the benchmark requirements.

Ground Truth Data

gt_dataset/
├── task_1/
│   ├── episode_1/
│   │   ├── gripper_detection/
│   │   │   └── video.mp4
│   │   ├── prompt/
│   │   │   ├── init_frame.png
│   │   │   └── introduction.txt
│   │   ├── traj/
│   │   │   └── traj.npy
│   │   └── video/
│   │       ├── frame_00000.jpg
│   │       ├── ...
│   │       └── frame_0000n.jpg
│   ├── episode_2/
│   └── ...
├── task_2/
└── ...

Generated Samples

{xxx}_dataset/
├── task_1/
│   ├── episode_1/
│   │   ├── 1/
│   │   │   ├── gripper_detection/
│   │   │   │   └── video.mp4
│   │   │   ├── traj/
│   │   │   │   └── traj.npy
│   │   │   └── video/
│   │   │       ├── frame_00000.jpg
│   │   │       ├── ...
│   │   │       └── frame_0000n.jpg
│   │   ├── 2/
│   │   └── 3/
│   ├── episode_2/
│   └── ...
├── task_2/
└── ...

Running Evaluation

Modify the configuration file:

# modify config.yaml

To run evaluation tasks:

python evaluate.py --dimension 'semantics' 'trajectory_consistency' --config ./config.yaml

Available dimensions include:

diversity
scene_consistency
trajectory_consistency
semantics

Your results save as .csv file.

Citation

Please consider citing our paper if our codes are useful:

@article{hu2025ewmbench,
  title={EWMBench: Evaluating Scene, Motion, and Semantic Quality in Embodied World Models},
  author={Hu, Yue and Huang, Siyuan and Liao, Yue and Chen, Shengcong and Zhou, Pengfei and Chen, Liliang and Yao, Maoqing and Ren, Guanghui},
  journal={arXiv preprint arXiv:2505.09694},
  year={2025}
}

License

All the data and code within this repo are under CC BY-NC-SA 4.0.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
EWMBench		EWMBench
picture		picture
processing		processing
submodel		submodel
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
config.yaml		config.yaml
dino_config.yaml		dino_config.yaml
evaluate.py		evaluate.py
processing.sh		processing.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EWMBench: Evaluating Scene, Motion, and Semantic Quality in Embodied World Models

Getting started

Setup

Pretrained Weights

Data Downloading

Data Preprocessing

Running Evaluation

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

AgibotTech/EWMBench

Folders and files

Latest commit

History

Repository files navigation

EWMBench: Evaluating Scene, Motion, and Semantic Quality in Embodied World Models

Getting started

Setup

Pretrained Weights

Data Downloading

Data Preprocessing

Running Evaluation

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages