-
Hard-negative mining methods:
- Dense Passage Retrieval for Open-Domain Question Answering Vladimir Karpukhin et al., 2020.09 EMNLP2020
- Approximate nearest neighbor negative contrastive learning for dense text retrieval. Lee Xiong et al., 2020.10 ICLR2021
- RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering Yingqi Qu et al., 2020.10 NAACL2021
- rocketqav2: a joint training method for dense passage retrieval and passage re-ranking Ruiyang Ren et al., 2021.10 EMNLP2021
- Optimizing Dense Retrieval Model Training with Hard Negatives Jingtao Zhan et al., 2021.04 SIGIR2021
- Conan-embedding: General Text Embedding with More and Better Negative Samples Shiyu Li et al., 2024.08
-
Loss function:
- SamToNe: Improving Contrastive Loss for Dual Encoder Retrieval Models with Same Tower Negatives Fedor Moiseev, et al., ACL Findings 2023
- PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval Ruiyang Ren, et al., ACL Findings 2021
-
Interaction
- D-q:
- DRPQ: Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval Hongyin Tang et al., 2021.03 ACL2021
- I3 Retriever: Incorporating Implicit Interaction in Pre-trained Language Models for Passage Retrieval Qian Dong et al., 2023.07 CIKM2023
- DCE: Learning Diverse Document Representations with Deep Query Interactions for Dense Retrieval Zehan Li et al., 2022.08
- CAPSTONE:Curriculum Sampling for Dense Retrieval with Document Expansion Xingwei He et al., EMNLP2023
- q-D
- Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback HongChien Yu et al., 2021.08 CIKM2021
- D-q:
-
Multi-vector
- ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT Omar KhaŠab et al., 2020.07 SIGIR2020
-
Pre-training methods:
- Auto-encoding:
- Less is More: Pre-train a Strong Text Encoder for Dense Retrieval Using a Weak Decoder Shuqi Lu et al., 2021.02 EMNLP2021
- Retromae: Pre-training retrieval-oriented transformers via masked autoencoder Shitao Xiao et al., 2022.03, EMNLP2022
- ConTextual Masked Auto-Encoder for Dense Passage Retrieval Xing Wu et al., 2022.08 AAAI2023
- SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval Liang Wang et al., ACL2023
- MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are Better Dense Retrievers Kun Zhou et al., 2022.12 ECML-PKDD 2023
- Drop your Decoder: Pre-training with Bag-of-Word Prediction for Dense Passage Retrieval Guangyuan Ma, et al., SIGIR2024
- Transformers:
- Condenser: a Pre-training Architecture for Dense Retrieval Luyu Gao et al., 2021.04 EMNLP2021
- Representative Words Prediction
- PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval Xinyu Ma et al., WSDM2021, 2020.10
- B-PROP: Bootstrapped Pre-training with Representative Words Prediction for Ad-hoc Retrieval SIGIR2021 2021.04
- Synthetic data generation
- Multi-stage Training with Improved Negative Contrast for Neural Passage Retrieval Jing Lu et al., EMNLP2021
- Others:
- Auto-encoding:
-
Query Reformulation
- Generation-Augmented Retrieval for Open-Domain Question Answering, Yuning Mao, et al., ACL2021
- W-RAG: Weakly Supervised Dense Retrieval in RAG for Open-domain Question Answering (2024) paper link
- REPLUG: Retrieval-Augmented Black-Box Language Models (NAACL 2024) paper link key point: Generation helps retrieval.
- A Case Study of Enhancing Sparse Retrieval using LLMs (WWW'24) paper
- From the perspective of cognition:
- Are Large Language Models Good at Utility Judgments? (Hengran Zhang, SIGIR 2024)
- Iterative Utility Judgment Framework via LLMs Inspired by Relevance in Philosophy (Hengran Zhang, 2024.7)
- Corrective Retrieval Augmented Generation, Shi-Qi Yan, et al., Arxirv 2024
- Similarity is Not All You Need: Endowing Retrieval-Augmented Generation with Multi–layered Thoughts, Chunjing Gan, et al., Arxiv 2024
- ARKS:ActiveRetrieval in Knowledge Soup for Code Generation, Hongjin Su, et al., 2024
- CONTEXT-AUGMENTED CODE GENERATION USING PROGRAMMING KNOWLEDGE GRAPHS, Iman Saberi, et.al., 2024.10.9
- Evaluating Retrieval Quality in Retrieval-Augmented Generation. Alireza Salem, et al., SIGIR2024
- Bridging the Preference Gap between Retrievers and LLMs. Zixuan Ke et al. ACL2024
- Query reformulation
- Large Language Models are Strong Zero-Shot Retriever Tao Shen, et al., ACL2024 Findings
- Precise Zero-Shot Dense Retrieval without Relevance LabelsLuyu Gao, et al., ACL2023
- Zero-shot encoder
- PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval Shengyao Zhuang, et al., EMNLP2024
- Fine-tuning LLaMa for Multi-stage Text Retrieval(SIGIR 2024)
- add eos token
- use the eos hidden states to embed whole sentence
- Making Large Language Models a Better Foundation For Dense Retrieval(2023)-> Llama2Vec: Unsupervised Adaptation of Large Language Models for
Dense Retrieval (ACL 2024)
- first work on pre-training for dense retrieval using LLMs
- motivation: As a result, the LLMs’ output embeddings will mainly focus on capturing the local and near-future semantic of the context. However, dense retrieval calls for embeddings to represent the global semantic of the entire context.
- Improving Text Embeddings with Large Language Models(2023), Liang Wang, et al., ACL2024
- E5-mistral-7B
- fine-tuning on both the generated synthetic data and a collection of 13 public datasets.
- NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models(2024), ICLR25
- Two stage fine-tuning task:
- It first applies contrastive training with instructions on retrieval datasets, utilizing in-batch negatives and curated hard negative examples.
- At stage-2, it blends various non-retrieval datasets into instruction tuning, which not only enhances non-retrieval task accuracy but also improves retrieval performance.
- Two stage fine-tuning task:
- Repetition Improves Language Model Embeddings(2023), Jacob Mitchell Springer, et al., ICLR25
- input the query or passage twice.
- improve the embedding of the last token.
- LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders, Parishad BehnamGhader, et al., COLM2024
- NV-Retriever: Improving text embedding models with effective hard-negative mining, Gabriel de Souza P. Moreira, Feb 2025, Arxiv
- ScalingNote: Scaling up Retrievers with Large Language Models for Real-World Dense Retrieval, Suyuan Huang, et al.,
- Unleashing the Power of LLMs in Dense Retrieval with Query Likelihood Modeling, Hengran Zhang, et al., Arxiv2025
- Scaling Sparse and Dense Retrieval in Decoder-Only LLMs, Hansi Zeng, et al., SISGIR25
- Gemini key technologies: Gecko: Versatile Text Embeddings Distilled from Large Language Models Jinhyuk Lee, et al., 9 Mar 2024, Arxiv
- NovaSearch Jasper and Stella: distillation of SOTA embedding models Dun Zhang, et al., 23 Jan 2025, Arxiv
- Linq-Embed-Mistral Report Junseong Kim, et al., May 2024
- PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval Shengyao Zhuang, et al., EMNLP2024
- Scaling sentence embeddings with large language models Ting Jiang, et al., EMNLP2024 Findings
- Improving General Text Embedding Model: Tackling Task Conflict and Data Imbalance through Model Merging Mingxin Li, et al., Oct 2024
- Qwen3-text-embedding: Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models Yanzhao Zhang et al., Jun 2025
- Gemini: Generalizable Embeddings from Gemini Jinhyuk Lee, et al., Mar 2025
- BGE-en-icl: Making Text Embedders Few-Shot Learners Chaofan Li, et al., ICLR2025
- A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language ModelsSIGIR 2024
- PRP-Graph: Pairwise Ranking Prompting to LLMs with Graph Aggregation for Effective Text Re-ranking(ACL 2024)
- Improving Zero-shot LLM Re-Ranker with Risk Minimization(EMNLP'2024)
- RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs(Arxiv'24)
- Re-Ranking Step by Step: Investigating Pre-Filtering for Re-Ranking with Large Language ModelsArxiv2024
- Lightweight reranking for language model generationsACL'24
- JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking Pointwise: core question->select key sentences->relevance judgment
- Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document RankingShengyao Zhuang, et al., EMNLP2023 Findings
- Zero-Shot Cross-Lingual Reranking with Large Language Models for Low-Resource Languages Mofetoluwa Adeyemi, et al., ACL'24 short
- Improving Passage Retrieval with Zero-Shot Question Generation Devendra Singh Sachan et al., EMNL2022
- Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents(EMNLP'23)
- Are Large Language Models Good at Utility Judgments? Hengran Zhang, et al., (SIGIR'24)
- Distillation and Refinement of Reasoning in Small Language Models for Document Re-ranking, Chris Samarinas,Hamed Zamani, Arxiv2025
- Beyond Yes and No: Improving Zero-Shot LLM Rankers via Scoring Fine-Grained Relevance Labels, Honglei Zhuang, NACCL2024 Log-probability of the relevance label
- PaRaDe: Passage Ranking using Demonstrations with Large Language Models(Andrew Drozdov et al., Findings of EMNLP'23 short paper) Using the demonstration query likelihood to estimate difficulty and then select demonstration
- APEER : Automatic Prompt Engineering Enhances Large Language Model RerankingCan Jin, et al., WWW'25 Refined the prompt for ranking directly using LLMs
- Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document Ranking Shengyao Zhuang, et al., EMNLP2023 Findings
- Leveraging Passage Embeddings for Efficient Listwise Reranking with Large Language Models Qi Liu et al., (WWW'25) To address the high latency of LLM inference on listwise ranking
- FIRST: Faster Improved Listwise Reranking with Single Token Decoding Revanth Gangi Reddy, et al., EMNLP2024
- RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models (Arxiv'23)/RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze! Ronak Pradeep, et al., Arxiv2023
- Self-Calibrated Listwise Reranking with Large Language Models Ruiyang Ren, et al., WWW25
- Adapting Decoder-Based Language Models for Diverse Encoder Downstream Tasks Paul Suganthan, et al., March Arxiv2025