Skip to content

[CVPR'25] 🌟🌟 EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering

License

Notifications You must be signed in to change notification settings

zhousheng97/EgoTextVQA

Repository files navigation

EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering

Task Dataset Model Model Model


EgoTextVQA

[🌐 Project Page] [πŸ“– Paper] [πŸ“Š Dataset] [πŸ† Evaluation]


πŸ”₯ Update

  • 2025.03.16 We have released our dataset and welcome researchers to use it for evaluation! (If you want to integrate it into a tool or repository for use, please feel free to let me know.)
  • 2025.02.27 Our paper has been accepted by CVPR 2025! πŸŽ‰ Thanks to all co-authors and dataset annotators.
  • 2025.02.11 We are very proud to launch ✨EgoTextVQA✨, a novel and rigorously constructed benchmark for egocentric QA assistance involving scene text! Our paper has been released on arXiv.

πŸ” EgoTextVQA

EgoTextVQA is a novel and rigorously constructed benchmark for egocentric QA assistance involving scene text. EgoTextVQA contains 1.5K ego-view videos and 7K scene-text aware questions that reflect real-user needs in outdoor driving and indoor house-keeping activities. The questions are designed to elicit identification and reasoning on scene text in an egocentric and dynamic environment. It consists of two parts: 1) EgoTextVQA-Indoor focuses on the outdoor scenarios, with 694 videos and 4,848 QA pairs that may arise when driving; 2) EgoTextVQA-Outdoor emphasizes indoor scenarios, with 813 videos and 2,216 QA pairs that users may encounter in house-keeping activities.

❗❗❗ There are several unique features of EgoTextVQA.

  • It stands out as the first VideoQA testbed towards egocentric scene-text aware QA assistance in the wild, with 7K QAs that reflect diverse user intentions under 1.5K different egocentric visual situations.
  • The QAs emphasize scene text comprehension, but only about half invoke the exact scene text.
  • The situations cover both indoor and outdoor activities.
  • Detailed timestamps and categories of the questions are provided to facilitate real-time QA and model analysis.
  • The real-time QA setting is that the answer to the question is obtained from the video captured before the question is asked, rather than the global content. The answer changes with the timestamp of the question.
  • High and low video resolution settings can be used to evaluate the scene text reading capabilities of MLLMs. In EgoTextVQA, we evaluate the model in both high-resolution (1920Γ—1080, 1280Γ—720) and low-resolution (960Γ—540, 640Γ—360) video settings.

Dataset Comparison

Sample Image

QA examples of different question categories.

Sample Image

Dataset analysis.

Sample Image

βœ… TODO List

  • Release paper on arxiv.
  • Release dataset.
  • Release model QA and evaluation code.

πŸ“ Evaluation Pipeline

  1. Download Videos:

    • Obtain raw_video. Please download the EgoTextVQA-Outdoor videos from the EgoTextVQA-Outdoor Google Drive and the EgoTextVQA-Indoor videos from the EgoSchema GitHub by following their instructions or download from EgoTextVQA-Indoor Google Drive. We provide the video IDs for our dataset in egotextvqa_outdoor_videoID.json (694 videos) and egotextvqa_indoor_videoID.json (813 videos).

    • πŸ“’ You can download EgoTextVQA videos directly on Hugging Face.

  2. Video Process:

    • Obtain fps6_video. After downloading the raw video, use video_process/change_video_fps.py to uniformly process the video to fps=6.
    • Obtian fps6_video_high_res and fps6_video_low_res. The EgoTextVQA outdoor video is processed into two versions, where the original video is a high-resolution version (1920Γ—1080, 1280Γ—720) and the low-resolution version (960Γ—540, 640Γ—360) is processed using video_process/change_video_res.py. For EgoTextVQA-Indoor, we do not deal with video resolution.
    • Obtian fps6_frame. Then use video_process/video2frame.py to process the fps=6 video into video frames for model evaluation.
    • Others. More video process codes in the experiment can be found in video_process of the repo.

    Data structure as below:

    root
    β”œβ”€β”€ data
    β”‚   └──  indoor
    β”‚       β”œβ”€β”€ fps6_frames
    β”‚       β”œβ”€β”€ fps6_videos
    β”‚       β”œβ”€β”€ raw_videos
    β”‚   └──  outdoor
    β”‚       β”œβ”€β”€ fps6_frame_high_res
    β”‚       β”œβ”€β”€ fps6_frame_low_res
    β”‚       β”œβ”€β”€ fps6_video_high_res
    β”‚       β”œβ”€β”€ fps6_video_low_res
    β”‚       β”œβ”€β”€ raw_videos
    
  3. QA files:

    Please clone our GitHub Repo.

    git clone https://github.com/zhousheng97/EgoTextVQA.git
    cd data
    
  4. MLLM QA Prompt: Please see mllm_prompt.json.

  5. MLLM Evaluation: Please use gpt_eval.py.

πŸ“ˆ Experiment Results

  • Evaluation results of MLLMs on EgoTextVQA-Outdoor with low resolution (960Γ—540, 640Γ—360).

Sample Image

  • Evaluation results of MLLMs on EgoTextVQA-Indoor with resolution (640Γ—360, 480Γ—360).

Sample Image

  • Evaluation results of MLLMs on the real-time QA subset of EgoTextVQA-Outdoor (∼623 QA pairs).

Sample Image

  • Evaluation results of MLLMs on EgoTextVQA-Outdoor with high resolution (1920Γ—1080, 1280Γ—720).

Sample Image

🎨 Result Visualization

Examples on EgoTextVQA-Outdoor.

Sample Image

Examples on EgoTextVQA-Indoor.

Sample Image

πŸ“§ Contact

If you have any questions or suggestions about the dataset, please contact: hzgn97@gmail.com. We are happy to communicate 😊.

✨ Citation

If this work is helpful to you, consider giving this repository a 🌟 and citing our papers as follows:

@InProceedings{Zhou_2025_CVPR,
    author    = {Zhou, Sheng and Xiao, Junbin and Li, Qingyun and Li, Yicong and Yang, Xun and Guo, Dan and Wang, Meng and Chua, Tat-Seng and Yao, Angela},
    title     = {EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering},
    booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {3363-3373}
}

πŸ’Œ Acknowledgement

We would like to thank the following repos for their great work:

Our dataset is built upon: RoadTextVQA, EgoSchema, and Ego4D.

Our evaluation is built upon: VideoChatGPT.

About

[CVPR'25] 🌟🌟 EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages