Skip to content

lxucs/CapRetrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CapRetrieval

This repository contains the dataset and evaluation script for CapRetrieval, introduced in the paper Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings.

The dataset is also available on Huggingface.

Dataset

CapRetrieval evaluates the fine-grained embedding matching, tailored towards a practical image search scenario in Chinese via dense passage retrieval:

  • Candidate passages are image captions, and queries are short phrases of entities or events reflected in captions.
  • Overall, the dataset comprises seemingly simple queries and captions; however, text encoders are shown limitations resolving these cases.
  • Evaluation results call for attention on embedding training strategies with different granularity.

CapRetrievalEn is the direct translation of CapRetrieval to English. It is provided only for reference, since the original labels may not hold true due to different language traits.

Format

CapRetrieval follows the same retrieval task format as in MTEB, with relevance labels in $[0,1,2]$ for each pair. Note that unlike prior datasets, we annotate full labels for each query-passage pair (1.3 million pairs), minimizing false negatives for more accurate evaluation.

A small amount of queries do not have any relevant captions; they are excluded in computation of retrieval metrics (e.g. nDCG), but can be useful for other analysis, e.g. in classification setting.

Evaluation Script

run.py is a general script to evaluate embedding retrieval of various encers.

Results and embeddings will be saved under a new evaluation directory.

Environment

Install pytorch according to your local environment, then pip install -r reodquirements.txt

Usage

See options by python run.py --help

The script by default automatically uses the most appropriate device; you can also set device_map explicitly. Embeddings will be cached and re-used.

Current Options
options:
  -h, --help            show this help message and exit
  --dataset DATASET     Dataset name
  --lang {en,zh}        Dataset language (for BM25)
  --mode {dense,bm25}   Search mode
  --model MODEL         HF model name or path
  --device_map DEVICE_MAP
                        Set model device map explicitly
  --max_len MAX_LEN     Max seq length
  --pooling {cls,mean,last,use_sentence_transformer}
                        Encoder pooling style
  --disable_normalization
                        Disable embedding normalization
  --query_template QUERY_TEMPLATE
                        Prompt template for query
  --candidate_template CANDIDATE_TEMPLATE
                        Prompt template for candidate
  --padding_side {left,right}
                        Tokenizer padding side
  --threshold THRESHOLD
                        Use results under distance threshold for evaluation
  --topk TOPK           Use top k results for evaluation
  --batch_size BATCH_SIZE
                        Eval batch size
  --result_path RESULT_PATH
                        Compute metrics of existing results directly
Output Example
Search: 100%|██████████| 404/404 [00:00<00:00, 5315.29it/s]
Metrics for dataset CapRetrieval:
Query evaluation: reciprocal_rank @top10 = 88.70
Query evaluation: average_precision @top10 = 82.91
Query evaluation: ndcg @top10 = 78.86
Query evaluation: hit @top10 = 92.08
Query evaluation: query_precision = 38.22
Query evaluation: query_recall = 68.71
Query evaluation: pair_precision = 38.22
Query evaluation: pair_recall = 32.97
Query evaluation: query_f1 = 49.12
Query evaluation: query_f2 = 59.25
Query evaluation: pair_f1 = 35.40
Query evaluation: pair_f2 = 33.90

Saved 404 query results to evaluation/results.CapRetrieval.bge-base-zh-v1.5.top10.json
Saved report to evaluation/report.CapRetrieval.bge-base-zh-v1.5.top10.json

Usage Examples

Evaluate BM25:

  • python run.py --dataset CapRetrieval --topk 10 --mode bm25 --lang zh

Evaluate BGE encoders using CLS pooling (default pooling):

  • python run.py --dataset CapRetrieval --topk 10 --model BAAI/bge-base-zh-v1.5

Evaluate GTE multilingual model using CLS pooling:

  • python run.py --dataset CapRetrieval --topk 10 --model Alibaba-NLP/gte-multilingual-base

Evaluate Conan-v1 encoder using default SentenceTransformers setup:

  • python run.py --dataset CapRetrieval --topk 10 --model TencentBAC/Conan-embedding-v1 --pooling use_sentence_transformer

Evaluate E5 encoders using mean pooling, with suggested prompt templates:

  • python run.py --dataset CapRetrieval --topk 10 --model intfloat/multilingual-e5-base --pooling mean --max_len 512 --query_template "query: {text}" --candidate_template "passage: {text}"

Evaluate GTE-Qwen encoders using last token pooling, with according prompt templates:

  • python run.py --dataset CapRetrieval --topk 10 --model Alibaba-NLP/gte-Qwen2-7B-instruct --pooling last --query_template "Instruct: Given an image search query, retrieve relevant image captions\nQuery: {text}" --batch_size 8

Evaluate Qwen3 embedding models using last token pooling, with according prompt templates:

  • python run.py --dataset CapRetrieval --topk 10 --model Qwen/Qwen3-Embedding-8B --padding_side left --pooling last --query_template "Instruct: Given an image search query, retrieve relevant image captions\nQuery: {text}" --batch_size 8

Evaluation Scores

Type Model nDCG@10
BM25 Basic BM25 66.54
0.1B bge-base-zh-v1.5 78.86
gte-multilingual-base 79.67
multilingual-e5-base 76.33
0.3B bge-large-zh-v1.5 79.15
multilingual-e5-large 81.01
Conan-embedding-v1 77.04
0.6B Qwen3-Embedding-0.6B 81.04
>1B gte-Qwen2-1.5B-instruct 77.35
gte-Qwen2-7B-instruct 86.55
e5-mistral-7b-instruct 76.40
Qwen3-Embedding-8B 84.61
Trained Out-of-Domain 87.23
In-Domain 91.83

The trained models (based on bge-base-zh-v1.5) are trained with queries by our data generation strategies described in the paper. The in-domain model can be downloaded from Google Drive.

License Agreement

The dataset and trained models are licensed under Apache 2.0.

Citation

@misc{xu2025denseretrieversfailsimple,
      title={Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings}, 
      author={Liyan Xu and Zhenlin Su and Mo Yu and Jiangnan Li and Fandong Meng and Jie Zhou},
      year={2025},
      eprint={2506.08592},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.08592}, 
}

About

Dataset and evaluation script for CapRetrieval.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages