[π Project Page] [π Paper] [π Dataset] [π Evaluation]
2025.03.16
We have released our dataset and welcome researchers to use it for evaluation! (If you want to integrate it into a tool or repository for use, please feel free to let me know.)2025.02.27
Our paper has been accepted by CVPR 2025! π Thanks to all co-authors and dataset annotators.2025.02.11
We are very proud to launch β¨EgoTextVQAβ¨, a novel and rigorously constructed benchmark for egocentric QA assistance involving scene text! Our paper has been released on arXiv.
EgoTextVQA is a novel and rigorously constructed benchmark for egocentric QA assistance involving scene text. EgoTextVQA contains 1.5K ego-view videos and 7K scene-text aware questions that reflect real-user needs in outdoor driving and indoor house-keeping activities. The questions are designed to elicit identification and reasoning on scene text in an egocentric and dynamic environment. It consists of two parts: 1) EgoTextVQA-Indoor focuses on the outdoor scenarios, with 694 videos and 4,848 QA pairs that may arise when driving; 2) EgoTextVQA-Outdoor emphasizes indoor scenarios, with 813 videos and 2,216 QA pairs that users may encounter in house-keeping activities.
βββ There are several unique features of EgoTextVQA.
- It stands out as the first VideoQA testbed towards egocentric scene-text aware QA assistance in the wild, with 7K QAs that reflect diverse user intentions under 1.5K different egocentric visual situations.
- The QAs emphasize scene text comprehension, but only about half invoke the exact scene text.
- The situations cover both indoor and outdoor activities.
- Detailed timestamps and categories of the questions are provided to facilitate real-time QA and model analysis.
- The real-time QA setting is that the answer to the question is obtained from the video captured before the question is asked, rather than the global content. The answer changes with the timestamp of the question.
- High and low video resolution settings can be used to evaluate the scene text reading capabilities of MLLMs. In EgoTextVQA, we evaluate the model in both high-resolution (1920Γ1080, 1280Γ720) and low-resolution (960Γ540, 640Γ360) video settings.
- Release paper on arxiv.
- Release dataset.
- Release model QA and evaluation code.
-
Download Videos:
-
Obtain raw_video. Please download the EgoTextVQA-Outdoor videos from the EgoTextVQA-Outdoor Google Drive and the EgoTextVQA-Indoor videos from the EgoSchema GitHub by following their instructions or download from EgoTextVQA-Indoor Google Drive. We provide the video IDs for our dataset in
egotextvqa_outdoor_videoID.json
(694 videos) andegotextvqa_indoor_videoID.json
(813 videos). -
π’ You can download EgoTextVQA videos directly on Hugging Face.
-
-
Video Process:
- Obtain fps6_video. After downloading the raw video, use
video_process/change_video_fps.py
to uniformly process the video to fps=6. - Obtian fps6_video_high_res and fps6_video_low_res. The EgoTextVQA outdoor video is processed into two versions, where the original video is a high-resolution version (1920Γ1080, 1280Γ720) and the low-resolution version (960Γ540, 640Γ360) is processed using
video_process/change_video_res.py
. For EgoTextVQA-Indoor, we do not deal with video resolution. - Obtian fps6_frame. Then use
video_process/video2frame.py
to process the fps=6 video into video frames for model evaluation. - Others. More video process codes in the experiment can be found in
video_process
of the repo.
Data structure as below:
root βββ data β βββ indoor β βββ fps6_frames β βββ fps6_videos β βββ raw_videos β βββ outdoor β βββ fps6_frame_high_res β βββ fps6_frame_low_res β βββ fps6_video_high_res β βββ fps6_video_low_res β βββ raw_videos
- Obtain fps6_video. After downloading the raw video, use
-
QA files:
Please clone our GitHub Repo.
git clone https://github.com/zhousheng97/EgoTextVQA.git cd data
-
MLLM QA Prompt: Please see
mllm_prompt.json
. -
MLLM Evaluation: Please use
gpt_eval.py
.
- Evaluation results of MLLMs on EgoTextVQA-Outdoor with low resolution (960Γ540, 640Γ360).
- Evaluation results of MLLMs on EgoTextVQA-Indoor with resolution (640Γ360, 480Γ360).
- Evaluation results of MLLMs on the real-time QA subset of EgoTextVQA-Outdoor (βΌ623 QA pairs).
- Evaluation results of MLLMs on EgoTextVQA-Outdoor with high resolution (1920Γ1080, 1280Γ720).
If you have any questions or suggestions about the dataset, please contact: hzgn97@gmail.com. We are happy to communicate π.
If this work is helpful to you, consider giving this repository a π and citing our papers as follows:
@InProceedings{Zhou_2025_CVPR,
author = {Zhou, Sheng and Xiao, Junbin and Li, Qingyun and Li, Yicong and Yang, Xun and Guo, Dan and Wang, Meng and Chua, Tat-Seng and Yao, Angela},
title = {EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering},
booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
month = {June},
year = {2025},
pages = {3363-3373}
}
We would like to thank the following repos for their great work:
Our dataset is built upon: RoadTextVQA, EgoSchema, and Ego4D.
Our evaluation is built upon: VideoChatGPT.