⚡ A repository for GPU-free Multimodal Document Retrieval (no local GPU required) 🚀 ⚡
📆[2025-08-25] 🎈 Our paper, code and dataset are released! 🎈
conda create --name premir python=3.12
conda activate premir
pip install -r requirements.txt
export OPENAI_API_KEY="YOUR-API-KEY"
Download and prepare the required datasets:
- PreQs - Pre-generated questions
- PreQ Embeddings - Pre-generated question embeddings
- Eval Query Corpus - Evaluation query corpus
# (1) Create directory structure
mkdir -p data/vidoseek
# (2) Locate PreQs
mv vidoseek_VQTQMQ_50.jsonl ./data/vidoseek
# (3) Locate PreQ Embeddings
unzip vidoseek.zip -d ./logs/
# (4) Locate Eval Query Corpus
mv test_vidoseek.json ./data/vidoseek
In the Quick Start section, this step utilizes downloaded PreQ embeddings to perform retrieval. The script performs three main operations:
- PreQ Embedding Loading: Loads the pre-generated PreQ embeddings from
./logs/
- Eval Query Embedding Generation: Creates embeddings for evaluation query corpus
- Retrieval Process: Performs retrieval using the loaded PreQ embeddings
cd src
python retrieval.py --dataset_type vidoseek --preq_type VQTQMQ_50
This script performs question clustering on the retrieved PreQ results to group similar questions together.
file_path
: Path to the retrieval result file from step 2 (e.g.,../result/vidoseek/vidoseek_VQTQMQ_50_False_None_retrieval_result.csv
)top_k
: Hyperparameter that determines the number of retrieved PreQ to be clustered
python qcluster.py --file_path '/path/to/2/output' --top_k 170 --dataset vidoseek
All execution codes for the Dataset Setting should be run from the ./data
folder.
cd ./data
-
ViDoSeek
huggingface-cli download autumncc/ViDoSeek --repo-type dataset --local-dir . mkdir -p vidoseek unzip -o vidoseek_pdf_document.zip -d vidoseek/ mv vidoseek.json ./vidoseek python ./scripts/pdf2images.py
-
REAL-MM-RAG-Bench
python ./scripts/data_construction.py
-
Structure
└── data ├── vidoseek │ ├── pdf # not used │ │ ├── XXX.pdf │ │ └── ... │ ├── img │ │ ├── XXX.jpg │ │ └── ... │ └── vidoseek.jsonl └── realmmrag ├── img │ ├── XXX.jpg │ └── ... └── realmmrag.json
All execution codes for the Offline Phase should be run from the ./src
folder.
cd ./src
(1) Document Parsing, faster with GPU
lang
: Choose language: 'en', 'korean', 'ch' optionsmineru -p ../data/vidoseek/img \ -o ../data/vidoseek/parsed \ --lang en
(2) OCR and Captioning
lang
: Choose language: 'en', 'korean', 'ch' optionspython parse.py --dataset_type vidoseek \ --lang en
You can choose between two options: (1) Direct API request method, (2) Batch API request method
(1) Direct Request Method (utilizing multiprocessing/multithreading)
modality_type
: Choose question type to generate (ex: TQ, VQ, MQ)max_preq_num
: Number of preliminary questions to be generated per document pagepython preq_gen.py --request_type "direct_request" \ --dataset_type "vidoseek" \ --modality_type "MQ" \ --max_preq_num 5
(2) Batch Request Method (Recommended: Data collection within 24 hours, 2x cheaper)
-
When you make a request as shown below, a BATCH TIME STRING like
25-08-23-14-21-44-knbh
will be output. Record this and use it for check/pull/cancel operations. -
modality_type
: Choose question type to generate (ex: TQ, VQ, MQ) -
max_preq_num
: Number of preliminary questions to be generated per document page# (1) Batch job request (Recording BATCH TIME STRING is essential) python preq_gen.py --request_type "batch_request" \ --dataset_type "vidoseek" \ --modality_type "MQ" \ --max_preq_num 5 # (2-1) Check batch job status (Expected completion within 24 hours) python preq_gen.py --request_type "batch_check" \ --batch_request_time "25-08-23-14-21-44-knbh" # (2-2) Cancel batch job (if needed) python preq_gen.py --request_type "batch_cancel" \ --batch_request_time "25-08-23-14-21-44-knbh" # (3) Collect batch results python preq_gen.py --request_type "batch_pull" \ --batch_request_time "25-08-23-14-21-44-knbh"
All execution codes for the Online Phase should be run from the ./src
folder.
cd ./src
This step prepares the PreQ data for retrieval by merging different question types and setting up the evaluation query corpus.
- Merge PreQ Files: Combines VQ/TQ/MQ jsonl files to create
{benchname}_VQTQMQ_{preQ_num}.jsonl
under./data/{benchname}
python preq_merge.py --dataset_type "vidoseek" \ --max_preq_num 5
- Setup Evaluation Corpus: Locate the evaluation query corpus
test_{benchname}.json
under./data/{benchname}
This script performs three main operations:
-
PreQ Embedding Generation: Creates embeddings for PreQ (VQ+TQ+MQ) data
-
Eval Query Embedding Generation: Creates embeddings for evaluation query corpus
-
Retrieval Process: Performs the actual retrieval using the generated embeddings
python retrieval.py --dataset_type vidoseek --preq_type VQTQMQ_50
This script performs question clustering on the retrieved PreQ results.
-
file_path
: Path to the retrieval result file from step 3-2 (e.g.,../result/{benchname}/vidoseek_VQTQMQ_50_False_None_retrieval_result.csv
) -
top_k
: Hyperparameter that determines the number of retrieved PreQ to be clustered per querypython qcluster.py --file_path '/path/to/2/output' --top_k 170 --dataset vidoseek
We would like to thank MinerU, an open-source document parsing project that facilitated parts of this work.