🔎 PREMIR 🔍

⚡ A repository for GPU-free Multimodal Document Retrieval (no local GPU required) 🚀 ⚡

🔥 Updates

📆[2025-08-25] 🎈 Our paper, code and dataset are released! 🎈

⚙️ Installation

conda create --name premir python=3.12
conda activate premir
pip install -r requirements.txt
export OPENAI_API_KEY="YOUR-API-KEY"

🚀 Quick Start (with ViDoSeek)

1. Dataset Setup

Download and prepare the required datasets:

Download Links:

PreQs - Pre-generated questions
PreQ Embeddings - Pre-generated question embeddings
Eval Query Corpus - Evaluation query corpus

Setup Commands:

# (1) Create directory structure
mkdir -p data/vidoseek

# (2) Locate PreQs
mv vidoseek_VQTQMQ_50.jsonl ./data/vidoseek

# (3) Locate PreQ Embeddings
unzip vidoseek.zip -d ./logs/

# (4) Locate Eval Query Corpus
mv test_vidoseek.json ./data/vidoseek

2. Retrieval

In the Quick Start section, this step utilizes downloaded PreQ embeddings to perform retrieval. The script performs three main operations:

PreQ Embedding Loading: Loads the pre-generated PreQ embeddings from ./logs/
Eval Query Embedding Generation: Creates embeddings for evaluation query corpus
Retrieval Process: Performs retrieval using the loaded PreQ embeddings

cd src
python retrieval.py --dataset_type vidoseek --preq_type VQTQMQ_50

3. Q-clustering

This script performs question clustering on the retrieved PreQ results to group similar questions together.

file_path: Path to the retrieval result file from step 2 (e.g., ../result/vidoseek/vidoseek_VQTQMQ_50_False_None_retrieval_result.csv)
top_k: Hyperparameter that determines the number of retrieved PreQ to be clustered

python qcluster.py --file_path '/path/to/2/output' --top_k 170 --dataset vidoseek

🔎 PREMIR

1. Dataset Setting

All execution codes for the Dataset Setting should be run from the ./data folder.

cd ./data

ViDoSeek

huggingface-cli download autumncc/ViDoSeek --repo-type dataset --local-dir .
mkdir -p vidoseek
unzip -o vidoseek_pdf_document.zip -d vidoseek/
mv vidoseek.json ./vidoseek
python ./scripts/pdf2images.py

REAL-MM-RAG-Bench
```
python ./scripts/data_construction.py
```

Structure

└── data 
    ├── vidoseek 
    │   ├── pdf # not used
    │   │   ├── XXX.pdf
    │   │   └── ...
    │   ├── img
    │   │   ├── XXX.jpg
    │   │   └── ...
    │   └── vidoseek.jsonl
    └── realmmrag
        ├── img
        │   ├── XXX.jpg
        │   └── ...
        └── realmmrag.json

2. Offline Phase

All execution codes for the Offline Phase should be run from the ./src folder.

cd ./src

2-1. Document Parsing & OCR

(1) Document Parsing, faster with GPU

lang: Choose language: 'en', 'korean', 'ch' options

mineru -p ../data/vidoseek/img \
    -o ../data/vidoseek/parsed \
    --lang en

(2) OCR and Captioning

lang: Choose language: 'en', 'korean', 'ch' options

python parse.py --dataset_type vidoseek \
    --lang en

2-2. PreQ Generation

You can choose between two options: (1) Direct API request method, (2) Batch API request method

(1) Direct Request Method (utilizing multiprocessing/multithreading)

modality_type: Choose question type to generate (ex: TQ, VQ, MQ)

max_preq_num: Number of preliminary questions to be generated per document page

python preq_gen.py --request_type "direct_request" \
    --dataset_type "vidoseek" \
    --modality_type "MQ" \ 
    --max_preq_num 5

(2) Batch Request Method (Recommended: Data collection within 24 hours, 2x cheaper)

When you make a request as shown below, a BATCH TIME STRING like 25-08-23-14-21-44-knbh will be output. Record this and use it for check/pull/cancel operations.
modality_type: Choose question type to generate (ex: TQ, VQ, MQ)

max_preq_num: Number of preliminary questions to be generated per document page

# (1) Batch job request (Recording BATCH TIME STRING is essential)
python preq_gen.py --request_type "batch_request" \
    --dataset_type "vidoseek" \
    --modality_type "MQ" \ 
    --max_preq_num 5 

# (2-1) Check batch job status (Expected completion within 24 hours)
python preq_gen.py --request_type "batch_check" \
    --batch_request_time "25-08-23-14-21-44-knbh"

# (2-2) Cancel batch job (if needed)
python preq_gen.py --request_type "batch_cancel" \
    --batch_request_time "25-08-23-14-21-44-knbh"

# (3) Collect batch results
python preq_gen.py --request_type "batch_pull" \
    --batch_request_time "25-08-23-14-21-44-knbh"

3. Online Phase

All execution codes for the Online Phase should be run from the ./src folder.

cd ./src

3-1. PreQ Data Setting

This step prepares the PreQ data for retrieval by merging different question types and setting up the evaluation query corpus.

Merge PreQ Files: Combines VQ/TQ/MQ jsonl files to create {benchname}_VQTQMQ_{preQ_num}.jsonl under ./data/{benchname}
```
python preq_merge.py --dataset_type "vidoseek" \
    --max_preq_num 5
```
Setup Evaluation Corpus: Locate the evaluation query corpus test_{benchname}.json under ./data/{benchname}

3-2. Retrieval

This script performs three main operations:

PreQ Embedding Generation: Creates embeddings for PreQ (VQ+TQ+MQ) data
Eval Query Embedding Generation: Creates embeddings for evaluation query corpus
Retrieval Process: Performs the actual retrieval using the generated embeddings
```
python retrieval.py --dataset_type vidoseek --preq_type VQTQMQ_50
```

3-3. Q-clustering

This script performs question clustering on the retrieved PreQ results.

file_path: Path to the retrieval result file from step 3-2 (e.g., ../result/{benchname}/vidoseek_VQTQMQ_50_False_None_retrieval_result.csv)
top_k: Hyperparameter that determines the number of retrieved PreQ to be clustered per query
```
python qcluster.py --file_path '/path/to/2/output' --top_k 170 --dataset vidoseek
```

👏 Acknowledgements

We would like to thank MinerU, an open-source document parsing project that facilitated parts of this work.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
data		data
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🔎 PREMIR 🔍

🔥 Updates

⚙️ Installation

🚀 Quick Start (with ViDoSeek)

1. Dataset Setup

Download Links:

Setup Commands:

2. Retrieval

3. Q-clustering

🔎 PREMIR

1. Dataset Setting

2. Offline Phase

2-1. Document Parsing & OCR

2-2. PreQ Generation

3. Online Phase

3-1. PreQ Data Setting

3-2. Retrieval

3-3. Q-clustering

👏 Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 2

Languages

yejinc00/PREMIR

Folders and files

Latest commit

History

Repository files navigation

🔎 PREMIR 🔍

🔥 Updates

⚙️ Installation

🚀 Quick Start (with ViDoSeek)

1. Dataset Setup

Download Links:

Setup Commands:

2. Retrieval

3. Q-clustering

🔎 PREMIR

1. Dataset Setting

2. Offline Phase

2-1. Document Parsing & OCR

2-2. PreQ Generation

3. Online Phase

3-1. PreQ Data Setting

3-2. Retrieval

3-3. Q-clustering

👏 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages