dense-retrieval

encoder-only

Hard-negative mining methods:
1. Dense Passage Retrieval for Open-Domain Question Answering Vladimir Karpukhin et al., 2020.09 EMNLP2020
2. Approximate nearest neighbor negative contrastive learning for dense text retrieval. Lee Xiong et al., 2020.10 ICLR2021
3. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering Yingqi Qu et al., 2020.10 NAACL2021
4. rocketqav2: a joint training method for dense passage retrieval and passage re-ranking Ruiyang Ren et al., 2021.10 EMNLP2021
5. Optimizing Dense Retrieval Model Training with Hard Negatives Jingtao Zhan et al., 2021.04 SIGIR2021
6. Conan-embedding: General Text Embedding with More and Better Negative Samples Shiyu Li et al., 2024.08
Loss function:
1. SamToNe: Improving Contrastive Loss for Dual Encoder Retrieval Models with Same Tower Negatives Fedor Moiseev, et al., ACL Findings 2023
2. PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval Ruiyang Ren, et al., ACL Findings 2021
Interaction
1. D-q:
  1. DRPQ: Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval Hongyin Tang et al., 2021.03 ACL2021
  2. I3 Retriever: Incorporating Implicit Interaction in Pre-trained Language Models for Passage Retrieval Qian Dong et al., 2023.07 CIKM2023
  3. DCE: Learning Diverse Document Representations with Deep Query Interactions for Dense Retrieval Zehan Li et al., 2022.08
  4. CAPSTONE:Curriculum Sampling for Dense Retrieval with Document Expansion Xingwei He et al., EMNLP2023
2. q-D
  1. Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback HongChien Yu et al., 2021.08 CIKM2021
Multi-vector
1. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT Omar KhaŠab et al., 2020.07 SIGIR2020
Pre-training methods:
1. Auto-encoding:
  1. Less is More: Pre-train a Strong Text Encoder for Dense Retrieval Using a Weak Decoder Shuqi Lu et al., 2021.02 EMNLP2021
  2. Retromae: Pre-training retrieval-oriented transformers via masked autoencoder Shitao Xiao et al., 2022.03, EMNLP2022
  3. ConTextual Masked Auto-Encoder for Dense Passage Retrieval Xing Wu et al., 2022.08 AAAI2023
  4. SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval Liang Wang et al., ACL2023
  5. MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are Better Dense Retrievers Kun Zhou et al., 2022.12 ECML-PKDD 2023
  6. Drop your Decoder: Pre-training with Bag-of-Word Prediction for Dense Passage Retrieval Guangyuan Ma, et al., SIGIR2024
2. Transformers:
  1. Condenser: a Pre-training Architecture for Dense Retrieval Luyu Gao et al., 2021.04 EMNLP2021
3. Representative Words Prediction
  1. PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval Xinyu Ma et al., WSDM2021, 2020.10
  2. B-PROP: Bootstrapped Pre-training with Representative Words Prediction for Ad-hoc Retrieval SIGIR2021 2021.04
4. Synthetic data generation
  1. Multi-stage Training with Improved Negative Contrast for Neural Passage Retrieval Jing Lu et al., EMNLP2021
5. Others:
  1. How to Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval 2023.02
  2. M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation Jianlv Chen et al., 2024 07
Query Reformulation
1. Generation-Augmented Retrieval for Open-Domain Question Answering, Yuning Mao, et al., ACL2021

LLMs coming

W-RAG: Weakly Supervised Dense Retrieval in RAG for Open-domain Question Answering (2024) paper link
REPLUG: Retrieval-Augmented Black-Box Language Models (NAACL 2024) paper link key point: Generation helps retrieval.
A Case Study of Enhancing Sparse Retrieval using LLMs (WWW'24) paper

LLMs for IR post-processing (relevance in retriever, and utility or usefulness in generator)

From the perspective of cognition：
1. Are Large Language Models Good at Utility Judgments? (Hengran Zhang, SIGIR 2024)
2. Iterative Utility Judgment Framework via LLMs Inspired by Relevance in Philosophy (Hengran Zhang, 2024.7)
3. Corrective Retrieval Augmented Generation, Shi-Qi Yan, et al., Arxirv 2024
4. Similarity is Not All You Need: Endowing Retrieval-Augmented Generation with Multi–layered Thoughts, Chunjing Gan, et al., Arxiv 2024
5. ARKS:ActiveRetrieval in Knowledge Soup for Code Generation, Hongjin Su, et al., 2024
6. CONTEXT-AUGMENTED CODE GENERATION USING PROGRAMMING KNOWLEDGE GRAPHS, Iman Saberi, et.al., 2024.10.9
7. Evaluating Retrieval Quality in Retrieval-Augmented Generation. Alireza Salem, et al., SIGIR2024
8. Bridging the Preference Gap between Retrievers and LLMs. Zixuan Ke et al. ACL2024
Query reformulation
1. Large Language Models are Strong Zero-Shot Retriever Tao Shen, et al., ACL2024 Findings
2. Precise Zero-Shot Dense Retrieval without Relevance LabelsLuyu Gao, et al., ACL2023
Zero-shot encoder
1. PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval Shengyao Zhuang, et al., EMNLP2024

LLMs in dense retrieval (LLMs as encoder)

Fine-tuning LLaMa for Multi-stage Text Retrieval(SIGIR 2024)
1. add eos token
2. use the eos hidden states to embed whole sentence
Making Large Language Models a Better Foundation For Dense Retrieval(2023)-> Llama2Vec: Unsupervised Adaptation of Large Language Models for Dense Retrieval (ACL 2024)
1. first work on pre-training for dense retrieval using LLMs
2. motivation: As a result, the LLMs’ output embeddings will mainly focus on capturing the local and near-future semantic of the context. However, dense retrieval calls for embeddings to represent the global semantic of the entire context.
Improving Text Embeddings with Large Language Models(2023), Liang Wang, et al., ACL2024
1. E5-mistral-7B
2. fine-tuning on both the generated synthetic data and a collection of 13 public datasets.
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models(2024), ICLR25
1. Two stage fine-tuning task:
  1. It first applies contrastive training with instructions on retrieval datasets, utilizing in-batch negatives and curated hard negative examples.
  2. At stage-2, it blends various non-retrieval datasets into instruction tuning, which not only enhances non-retrieval task accuracy but also improves retrieval performance.
Repetition Improves Language Model Embeddings(2023), Jacob Mitchell Springer, et al., ICLR25
1. input the query or passage twice.
2. improve the embedding of the last token.
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders, Parishad BehnamGhader, et al., COLM2024
NV-Retriever: Improving text embedding models with effective hard-negative mining, Gabriel de Souza P. Moreira, Feb 2025, Arxiv
ScalingNote: Scaling up Retrievers with Large Language Models for Real-World Dense Retrieval, Suyuan Huang, et al.,
Unleashing the Power of LLMs in Dense Retrieval with Query Likelihood Modeling, Hengran Zhang, et al., Arxiv2025
Scaling Sparse and Dense Retrieval in Decoder-Only LLMs, Hansi Zeng, et al., SISGIR25
Gemini key technologies: Gecko: Versatile Text Embeddings Distilled from Large Language Models Jinhyuk Lee, et al., 9 Mar 2024, Arxiv
NovaSearch Jasper and Stella: distillation of SOTA embedding models Dun Zhang, et al., 23 Jan 2025, Arxiv
Linq-Embed-Mistral Report Junseong Kim, et al., May 2024
PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval Shengyao Zhuang, et al., EMNLP2024
Scaling sentence embeddings with large language models Ting Jiang, et al., EMNLP2024 Findings
Improving General Text Embedding Model: Tackling Task Conflict and Data Imbalance through Model Merging Mingxin Li, et al., Oct 2024
Qwen3-text-embedding: Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models Yanzhao Zhang et al., Jun 2025
Gemini: Generalizable Embeddings from Gemini Jinhyuk Lee, et al., Mar 2025
BGE-en-icl: Making Text Embedders Few-Shot Learners Chaofan Li, et al., ICLR2025

Ranking

LLMs for ranking

Zero-shot

A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language ModelsSIGIR 2024
PRP-Graph: Pairwise Ranking Prompting to LLMs with Graph Aggregation for Effective Text Re-ranking(ACL 2024)
Improving Zero-shot LLM Re-Ranker with Risk Minimization(EMNLP'2024)
RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs(Arxiv'24)
Re-Ranking Step by Step: Investigating Pre-Filtering for Re-Ranking with Large Language ModelsArxiv2024
Lightweight reranking for language model generationsACL'24
JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking Pointwise: core question->select key sentences->relevance judgment
Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document RankingShengyao Zhuang, et al., EMNLP2023 Findings
Zero-Shot Cross-Lingual Reranking with Large Language Models for Low-Resource Languages Mofetoluwa Adeyemi, et al., ACL'24 short
Improving Passage Retrieval with Zero-Shot Question Generation Devendra Singh Sachan et al., EMNL2022
Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents(EMNLP'23)
Are Large Language Models Good at Utility Judgments? Hengran Zhang, et al., (SIGIR'24)
Distillation and Refinement of Reasoning in Small Language Models for Document Re-ranking, Chris Samarinas,Hamed Zamani, Arxiv2025
Beyond Yes and No: Improving Zero-Shot LLM Rankers via Scoring Fine-Grained Relevance Labels, Honglei Zhuang, NACCL2024 Log-probability of the relevance label
PaRaDe: Passage Ranking using Demonstrations with Large Language Models(Andrew Drozdov et al., Findings of EMNLP'23 short paper) Using the demonstration query likelihood to estimate difficulty and then select demonstration
APEER : Automatic Prompt Engineering Enhances Large Language Model RerankingCan Jin, et al., WWW'25 Refined the prompt for ranking directly using LLMs
Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document Ranking Shengyao Zhuang, et al., EMNLP2023 Findings

Fine-tuning

Leveraging Passage Embeddings for Efficient Listwise Reranking with Large Language Models Qi Liu et al., (WWW'25) To address the high latency of LLM inference on listwise ranking
FIRST: Faster Improved Listwise Reranking with Single Token Decoding Revanth Gangi Reddy, et al., EMNLP2024
RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models (Arxiv'23)/RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze! Ronak Pradeep, et al., Arxiv2023
Self-Calibrated Listwise Reranking with Large Language Models Ruiyang Ren, et al., WWW25
Adapting Decoder-Based Language Models for Diverse Encoder Downstream Tasks Paul Suganthan, et al., March Arxiv2025

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

dense-retrieval

encoder-only

LLMs coming

LLMs for IR post-processing (relevance in retriever, and utility or usefulness in generator)

LLMs in dense retrieval (LLMs as encoder)

Ranking

LLMs for ranking

Zero-shot

Fine-tuning

About

Uh oh!

Releases

Packages

hengran/dense-retrieval-paper

Folders and files

Latest commit

History

Repository files navigation

dense-retrieval

encoder-only

LLMs coming

LLMs for IR post-processing (relevance in retriever, and utility or usefulness in generator)

LLMs in dense retrieval (LLMs as encoder)

Ranking

LLMs for ranking

Zero-shot

Fine-tuning

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages