This repository contains the dataset used in the paper
"Do Pre-trained Vision-Language Models Encode Object States?"
by Kaleb Newman, Shijie Wang, Yuan Zang, David Heffren, and Chen Sun
[arXiv:2409.10488]
The full image dataset (~25K images) extracted from the [ChangeIt video dataset]. Each image corresponds to a key frame depicting an object in a particular physical state.
A golden subset (~1.7K images) of human-verified annotations, containing cropped images based on an object bounding box. Used for fine-grained state recognition evaluation.
Same golden subset as above, but with full uncropped images. Used to compare performance on full vs. object-centric inputs.
Frame-level annotations for each video. This is derived from the ChangeIT dataset
Defines state labels for each object category.
q
: All possible states for a categorya
: Initial and terminal states, used for structured state transition evaluation
Parses frame-level annotation CSVs into a dictionary (cat_dicts
) mapping each category to its annotated videos and corresponding frame-level state labels.
Processes and aligns annotations with selected subset frames from ChangeIT-Subset-Images/
, returning filtered ground-truth labels and category associations.
- The VLMs we tested perform well on object recognition but struggle with fine-grained physical state recognition.
- Even with human-labeled crops and object-centric modifications, models fail to distinguish between states like โpeeled appleโ vs. โcut appleโ.
- Patch-level models (e.g., FLAVA) outperform others in distractor scenarios, possibly due to better region-text binding.
Read the full paper for more insights: arXiv:2409.10488
If you use this dataset or code, please cite:
@article{objectstates,
title={Do Pre-trained Vision-Language Models Encode Object States?},
author={Newman, Kaleb and Wang, Shijie and Zang, Yuan and Heffren, David and Sun, Chen},
journal={arXiv preprint arXiv:2409.10488},
year={2024}
}