Skip to content

[EMNLP 2025] The official implementation of "Zero-shot Multimodal Document Retrieval via Cross-Modal Question Generation"

Notifications You must be signed in to change notification settings

yejinc00/PREMIR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PREMIR-Logo

🔎 PREMIR 🔍

arXiv

⚡ A repository for GPU-free Multimodal Document Retrieval (no local GPU required) 🚀 ⚡

🔥 Updates

📆[2025-08-25] 🎈 Our paper, code and dataset are released! 🎈

⚙️ Installation

conda create --name premir python=3.12
conda activate premir
pip install -r requirements.txt
export OPENAI_API_KEY="YOUR-API-KEY"

🚀 Quick Start (with ViDoSeek)

1. Dataset Setup

Download and prepare the required datasets:

Download Links:

Setup Commands:

# (1) Create directory structure
mkdir -p data/vidoseek

# (2) Locate PreQs
mv vidoseek_VQTQMQ_50.jsonl ./data/vidoseek

# (3) Locate PreQ Embeddings
unzip vidoseek.zip -d ./logs/

# (4) Locate Eval Query Corpus
mv test_vidoseek.json ./data/vidoseek

2. Retrieval

In the Quick Start section, this step utilizes downloaded PreQ embeddings to perform retrieval. The script performs three main operations:

  1. PreQ Embedding Loading: Loads the pre-generated PreQ embeddings from ./logs/
  2. Eval Query Embedding Generation: Creates embeddings for evaluation query corpus
  3. Retrieval Process: Performs retrieval using the loaded PreQ embeddings
cd src
python retrieval.py --dataset_type vidoseek --preq_type VQTQMQ_50

3. Q-clustering

This script performs question clustering on the retrieved PreQ results to group similar questions together.

  • file_path: Path to the retrieval result file from step 2 (e.g., ../result/vidoseek/vidoseek_VQTQMQ_50_False_None_retrieval_result.csv)
  • top_k: Hyperparameter that determines the number of retrieved PreQ to be clustered
python qcluster.py --file_path '/path/to/2/output' --top_k 170 --dataset vidoseek

🔎 PREMIR

1. Dataset Setting

All execution codes for the Dataset Setting should be run from the ./data folder.

cd ./data
  1. ViDoSeek

    huggingface-cli download autumncc/ViDoSeek --repo-type dataset --local-dir .
    mkdir -p vidoseek
    unzip -o vidoseek_pdf_document.zip -d vidoseek/
    mv vidoseek.json ./vidoseek
    python ./scripts/pdf2images.py
  2. REAL-MM-RAG-Bench

    python ./scripts/data_construction.py
    
  3. Structure

    └── data 
        ├── vidoseek 
        │   ├── pdf # not used
        │   │   ├── XXX.pdf
        │   │   └── ...
        │   ├── img
        │   │   ├── XXX.jpg
        │   │   └── ...
        │   └── vidoseek.jsonl
        └── realmmrag
            ├── img
            │   ├── XXX.jpg
            │   └── ...
            └── realmmrag.json
    

2. Offline Phase

All execution codes for the Offline Phase should be run from the ./src folder.

cd ./src

2-1. Document Parsing & OCR

(1) Document Parsing, faster with GPU

  • lang: Choose language: 'en', 'korean', 'ch' options
    mineru -p ../data/vidoseek/img \
        -o ../data/vidoseek/parsed \
        --lang en

(2) OCR and Captioning

  • lang: Choose language: 'en', 'korean', 'ch' options
    python parse.py --dataset_type vidoseek \
        --lang en

2-2. PreQ Generation

You can choose between two options: (1) Direct API request method, (2) Batch API request method

(1) Direct Request Method (utilizing multiprocessing/multithreading)

  • modality_type: Choose question type to generate (ex: TQ, VQ, MQ)
  • max_preq_num: Number of preliminary questions to be generated per document page
    python preq_gen.py --request_type "direct_request" \
        --dataset_type "vidoseek" \
        --modality_type "MQ" \ 
        --max_preq_num 5

(2) Batch Request Method (Recommended: Data collection within 24 hours, 2x cheaper)

  • When you make a request as shown below, a BATCH TIME STRING like 25-08-23-14-21-44-knbh will be output. Record this and use it for check/pull/cancel operations.

  • modality_type: Choose question type to generate (ex: TQ, VQ, MQ)

  • max_preq_num: Number of preliminary questions to be generated per document page

    # (1) Batch job request (Recording BATCH TIME STRING is essential)
    python preq_gen.py --request_type "batch_request" \
        --dataset_type "vidoseek" \
        --modality_type "MQ" \ 
        --max_preq_num 5 
    
    # (2-1) Check batch job status (Expected completion within 24 hours)
    python preq_gen.py --request_type "batch_check" \
        --batch_request_time "25-08-23-14-21-44-knbh"
    
    # (2-2) Cancel batch job (if needed)
    python preq_gen.py --request_type "batch_cancel" \
        --batch_request_time "25-08-23-14-21-44-knbh"
    
    # (3) Collect batch results
    python preq_gen.py --request_type "batch_pull" \
        --batch_request_time "25-08-23-14-21-44-knbh"

3. Online Phase

All execution codes for the Online Phase should be run from the ./src folder.

cd ./src

3-1. PreQ Data Setting

This step prepares the PreQ data for retrieval by merging different question types and setting up the evaluation query corpus.

  1. Merge PreQ Files: Combines VQ/TQ/MQ jsonl files to create {benchname}_VQTQMQ_{preQ_num}.jsonl under ./data/{benchname}
    python preq_merge.py --dataset_type "vidoseek" \
        --max_preq_num 5
  2. Setup Evaluation Corpus: Locate the evaluation query corpus test_{benchname}.json under ./data/{benchname}

3-2. Retrieval

This script performs three main operations:

  1. PreQ Embedding Generation: Creates embeddings for PreQ (VQ+TQ+MQ) data

  2. Eval Query Embedding Generation: Creates embeddings for evaluation query corpus

  3. Retrieval Process: Performs the actual retrieval using the generated embeddings

    python retrieval.py --dataset_type vidoseek --preq_type VQTQMQ_50

3-3. Q-clustering

This script performs question clustering on the retrieved PreQ results.

  • file_path: Path to the retrieval result file from step 3-2 (e.g., ../result/{benchname}/vidoseek_VQTQMQ_50_False_None_retrieval_result.csv)

  • top_k: Hyperparameter that determines the number of retrieved PreQ to be clustered per query

    python qcluster.py --file_path '/path/to/2/output' --top_k 170 --dataset vidoseek

👏 Acknowledgements

We would like to thank MinerU, an open-source document parsing project that facilitated parts of this work.

About

[EMNLP 2025] The official implementation of "Zero-shot Multimodal Document Retrieval via Cross-Modal Question Generation"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages