CapRetrieval

This repository contains the dataset and evaluation script for CapRetrieval, introduced in the paper Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings.

The dataset is also available on Huggingface.

Dataset

CapRetrieval evaluates the fine-grained embedding matching, tailored towards a practical image search scenario in Chinese via dense passage retrieval:

Candidate passages are image captions, and queries are short phrases of entities or events reflected in captions.
Overall, the dataset comprises seemingly simple queries and captions; however, text encoders are shown limitations resolving these cases.
Evaluation results call for attention on embedding training strategies with different granularity.

CapRetrievalEn is the direct translation of CapRetrieval to English. It is provided only for reference, since the original labels may not hold true due to different language traits.

Format

CapRetrieval follows the same retrieval task format as in MTEB, with relevance labels in $[0,1,2]$ for each pair. Note that unlike prior datasets, we annotate full labels for each query-passage pair (1.3 million pairs), minimizing false negatives for more accurate evaluation.

A small amount of queries do not have any relevant captions; they are excluded in computation of retrieval metrics (e.g. nDCG), but can be useful for other analysis, e.g. in classification setting.

Evaluation Script

run.py is a general script to evaluate embedding retrieval of various encers.

Results and embeddings will be saved under a new evaluation directory.

Environment

Install pytorch according to your local environment, then pip install -r reodquirements.txt

Usage

See options by python run.py --help

The script by default automatically uses the most appropriate device; you can also set device_map explicitly. Embeddings will be cached and re-used.

Current Options

options:
  -h, --help            show this help message and exit
  --dataset DATASET     Dataset name
  --lang {en,zh}        Dataset language (for BM25)
  --mode {dense,bm25}   Search mode
  --model MODEL         HF model name or path
  --device_map DEVICE_MAP
                        Set model device map explicitly
  --max_len MAX_LEN     Max seq length
  --pooling {cls,mean,last,use_sentence_transformer}
                        Encoder pooling style
  --disable_normalization
                        Disable embedding normalization
  --query_template QUERY_TEMPLATE
                        Prompt template for query
  --candidate_template CANDIDATE_TEMPLATE
                        Prompt template for candidate
  --padding_side {left,right}
                        Tokenizer padding side
  --threshold THRESHOLD
                        Use results under distance threshold for evaluation
  --topk TOPK           Use top k results for evaluation
  --batch_size BATCH_SIZE
                        Eval batch size
  --result_path RESULT_PATH
                        Compute metrics of existing results directly

Output Example

Search: 100%|██████████| 404/404 [00:00<00:00, 5315.29it/s]
Metrics for dataset CapRetrieval:
Query evaluation: reciprocal_rank @top10 = 88.70
Query evaluation: average_precision @top10 = 82.91
Query evaluation: ndcg @top10 = 78.86
Query evaluation: hit @top10 = 92.08
Query evaluation: query_precision = 38.22
Query evaluation: query_recall = 68.71
Query evaluation: pair_precision = 38.22
Query evaluation: pair_recall = 32.97
Query evaluation: query_f1 = 49.12
Query evaluation: query_f2 = 59.25
Query evaluation: pair_f1 = 35.40
Query evaluation: pair_f2 = 33.90

Saved 404 query results to evaluation/results.CapRetrieval.bge-base-zh-v1.5.top10.json
Saved report to evaluation/report.CapRetrieval.bge-base-zh-v1.5.top10.json

Usage Examples

Evaluate BM25:

python run.py --dataset CapRetrieval --topk 10 --mode bm25 --lang zh

Evaluate BGE encoders using CLS pooling (default pooling):

python run.py --dataset CapRetrieval --topk 10 --model BAAI/bge-base-zh-v1.5

Evaluate GTE multilingual model using CLS pooling:

python run.py --dataset CapRetrieval --topk 10 --model Alibaba-NLP/gte-multilingual-base

Evaluate Conan-v1 encoder using default SentenceTransformers setup:

python run.py --dataset CapRetrieval --topk 10 --model TencentBAC/Conan-embedding-v1 --pooling use_sentence_transformer

Evaluate E5 encoders using mean pooling, with suggested prompt templates:

python run.py --dataset CapRetrieval --topk 10 --model intfloat/multilingual-e5-base --pooling mean --max_len 512 --query_template "query: {text}" --candidate_template "passage: {text}"

Evaluate GTE-Qwen encoders using last token pooling, with according prompt templates:

python run.py --dataset CapRetrieval --topk 10 --model Alibaba-NLP/gte-Qwen2-7B-instruct --pooling last --query_template "Instruct: Given an image search query, retrieve relevant image captions\nQuery: {text}" --batch_size 8

Evaluate Qwen3 embedding models using last token pooling, with according prompt templates:

python run.py --dataset CapRetrieval --topk 10 --model Qwen/Qwen3-Embedding-8B --padding_side left --pooling last --query_template "Instruct: Given an image search query, retrieve relevant image captions\nQuery: {text}" --batch_size 8

Evaluation Scores

Type	Model	nDCG@10
BM25	Basic BM25	66.54

0.1B	bge-base-zh-v1.5	78.86
	gte-multilingual-base	79.67
	multilingual-e5-base	76.33
0.3B	bge-large-zh-v1.5	79.15
	multilingual-e5-large	81.01
	Conan-embedding-v1	77.04
0.6B	Qwen3-Embedding-0.6B	81.04
>1B	gte-Qwen2-1.5B-instruct	77.35
	gte-Qwen2-7B-instruct	86.55
	e5-mistral-7b-instruct	76.40
	Qwen3-Embedding-8B	84.61

Trained	Out-of-Domain	87.23
	In-Domain	91.83

The trained models (based on bge-base-zh-v1.5) are trained with queries by our data generation strategies described in the paper. The in-domain model can be downloaded from Google Drive.

License Agreement

The dataset and trained models are licensed under Apache 2.0.

Citation

@misc{xu2025denseretrieversfailsimple,
      title={Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings}, 
      author={Liyan Xu and Zhenlin Su and Mo Yu and Jiangnan Li and Fandong Meng and Jie Zhou},
      year={2025},
      eprint={2506.08592},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.08592}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
dataset		dataset
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
io_util.py		io_util.py
metric.py		metric.py
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CapRetrieval

Dataset

Format

Evaluation Script

Environment

Usage

Usage Examples

Evaluation Scores

License Agreement

Citation

About

Uh oh!

Releases

Packages

Languages

License

lxucs/CapRetrieval

Folders and files

Latest commit

History

Repository files navigation

CapRetrieval

Dataset

Format

Evaluation Script

Environment

Usage

Usage Examples

Evaluation Scores

License Agreement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages