EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering

[🌐 Project Page] [📖 Paper] [📊 Dataset] [🏆 Evaluation]

🔥 Update

2025.03.16 We have released our dataset and welcome researchers to use it for evaluation! (If you want to integrate it into a tool or repository for use, please feel free to let me know.)
2025.02.27 Our paper has been accepted by CVPR 2025! 🎉 Thanks to all co-authors and dataset annotators.
2025.02.11 We are very proud to launch ✨EgoTextVQA✨, a novel and rigorously constructed benchmark for egocentric QA assistance involving scene text! Our paper has been released on arXiv.

🔍 EgoTextVQA

EgoTextVQA is a novel and rigorously constructed benchmark for egocentric QA assistance involving scene text. EgoTextVQA contains 1.5K ego-view videos and 7K scene-text aware questions that reflect real-user needs in outdoor driving and indoor house-keeping activities. The questions are designed to elicit identification and reasoning on scene text in an egocentric and dynamic environment. It consists of two parts: 1) EgoTextVQA-Indoor focuses on the outdoor scenarios, with 694 videos and 4,848 QA pairs that may arise when driving; 2) EgoTextVQA-Outdoor emphasizes indoor scenarios, with 813 videos and 2,216 QA pairs that users may encounter in house-keeping activities.

❗❗❗ There are several unique features of EgoTextVQA.

It stands out as the first VideoQA testbed towards egocentric scene-text aware QA assistance in the wild, with 7K QAs that reflect diverse user intentions under 1.5K different egocentric visual situations.
The QAs emphasize scene text comprehension, but only about half invoke the exact scene text.
The situations cover both indoor and outdoor activities.
Detailed timestamps and categories of the questions are provided to facilitate real-time QA and model analysis.
The real-time QA setting is that the answer to the question is obtained from the video captured before the question is asked, rather than the global content. The answer changes with the timestamp of the question.
High and low video resolution settings can be used to evaluate the scene text reading capabilities of MLLMs. In EgoTextVQA, we evaluate the model in both high-resolution (1920×1080, 1280×720) and low-resolution (960×540, 640×360) video settings.

Dataset Comparison

QA examples of different question categories.

Dataset analysis.

✅ TODO List

Release paper on arxiv.
Release dataset.
Release model QA and evaluation code.

📝 Evaluation Pipeline

Download Videos:
- Obtain raw_video. Please download the EgoTextVQA-Outdoor videos from the EgoTextVQA-Outdoor Google Drive and the EgoTextVQA-Indoor videos from the EgoSchema GitHub by following their instructions or download from EgoTextVQA-Indoor Google Drive. We provide the video IDs for our dataset in egotextvqa_outdoor_videoID.json (694 videos) and egotextvqa_indoor_videoID.json (813 videos).
- 📢 You can download EgoTextVQA videos directly on Hugging Face.
Video Process:
- Obtain fps6_video. After downloading the raw video, use video_process/change_video_fps.py to uniformly process the video to fps=6.
- Obtian fps6_video_high_res and fps6_video_low_res. The EgoTextVQA outdoor video is processed into two versions, where the original video is a high-resolution version (1920×1080, 1280×720) and the low-resolution version (960×540, 640×360) is processed using video_process/change_video_res.py. For EgoTextVQA-Indoor, we do not deal with video resolution.
- Obtian fps6_frame. Then use video_process/video2frame.py to process the fps=6 video into video frames for model evaluation.
- Others. More video process codes in the experiment can be found in video_process of the repo.
Data structure as below:
```
root
├── data
│   └──  indoor
│       ├── fps6_frames
│       ├── fps6_videos
│       ├── raw_videos
│   └──  outdoor
│       ├── fps6_frame_high_res
│       ├── fps6_frame_low_res
│       ├── fps6_video_high_res
│       ├── fps6_video_low_res
│       ├── raw_videos
```

QA files:

Please clone our GitHub Repo.

git clone https://github.com/zhousheng97/EgoTextVQA.git
cd data

MLLM QA Prompt: Please see mllm_prompt.json.
MLLM Evaluation: Please use gpt_eval.py.

📈 Experiment Results

Evaluation results of MLLMs on EgoTextVQA-Outdoor with low resolution (960×540, 640×360).

Evaluation results of MLLMs on EgoTextVQA-Indoor with resolution (640×360, 480×360).

Evaluation results of MLLMs on the real-time QA subset of EgoTextVQA-Outdoor (∼623 QA pairs).

Evaluation results of MLLMs on EgoTextVQA-Outdoor with high resolution (1920×1080, 1280×720).

🎨 Result Visualization

Examples on EgoTextVQA-Outdoor.

Examples on EgoTextVQA-Indoor.

📧 Contact

If you have any questions or suggestions about the dataset, please contact: hzgn97@gmail.com. We are happy to communicate 😊.

✨ Citation

If this work is helpful to you, consider giving this repository a 🌟 and citing our papers as follows:

@InProceedings{Zhou_2025_CVPR,
    author    = {Zhou, Sheng and Xiao, Junbin and Li, Qingyun and Li, Yicong and Yang, Xun and Guo, Dan and Wang, Meng and Chua, Tat-Seng and Yao, Angela},
    title     = {EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering},
    booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {3363-3373}
}

💌 Acknowledgement

We would like to thank the following repos for their great work:

Our dataset is built upon: RoadTextVQA, EgoSchema, and Ego4D.

Our evaluation is built upon: VideoChatGPT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering

🔥 Update

🔍 EgoTextVQA

Dataset Comparison

✅ TODO List

📝 Evaluation Pipeline

📈 Experiment Results

🎨 Result Visualization

📧 Contact

✨ Citation

💌 Acknowledgement

About

Uh oh!

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
asset		asset
data		data
video_process		video_process
LICENSE		LICENSE
README.md		README.md
egotextvqa_indoor_videoID.json		egotextvqa_indoor_videoID.json
egotextvqa_outdoor_videoID.json		egotextvqa_outdoor_videoID.json
gpt_eval.py		gpt_eval.py
mllm_prompt.json		mllm_prompt.json

License

zhousheng97/EgoTextVQA

Folders and files

Latest commit

History

Repository files navigation

EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering

🔥 Update

🔍 EgoTextVQA

Dataset Comparison

✅ TODO List

📝 Evaluation Pipeline

📈 Experiment Results

🎨 Result Visualization

📧 Contact

✨ Citation

💌 Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Packages 0

Contributors 2

Uh oh!

Languages

Packages