Skip to content

hengran/dense-retrieval-paper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 

Repository files navigation

dense-retrieval

encoder-only

  1. Hard-negative mining methods:

    1. Dense Passage Retrieval for Open-Domain Question Answering Vladimir Karpukhin et al., 2020.09 EMNLP2020
    2. Approximate nearest neighbor negative contrastive learning for dense text retrieval. Lee Xiong et al., 2020.10 ICLR2021
    3. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering Yingqi Qu et al., 2020.10 NAACL2021
    4. rocketqav2: a joint training method for dense passage retrieval and passage re-ranking Ruiyang Ren et al., 2021.10 EMNLP2021
    5. Optimizing Dense Retrieval Model Training with Hard Negatives Jingtao Zhan et al., 2021.04 SIGIR2021
    6. Conan-embedding: General Text Embedding with More and Better Negative Samples Shiyu Li et al., 2024.08
  2. Loss function:

    1. SamToNe: Improving Contrastive Loss for Dual Encoder Retrieval Models with Same Tower Negatives Fedor Moiseev, et al., ACL Findings 2023
    2. PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval Ruiyang Ren, et al., ACL Findings 2021
  3. Interaction

    1. D-q:
      1. DRPQ: Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval Hongyin Tang et al., 2021.03 ACL2021
      2. I3 Retriever: Incorporating Implicit Interaction in Pre-trained Language Models for Passage Retrieval Qian Dong et al., 2023.07 CIKM2023
      3. DCE: Learning Diverse Document Representations with Deep Query Interactions for Dense Retrieval Zehan Li et al., 2022.08
      4. CAPSTONE:Curriculum Sampling for Dense Retrieval with Document Expansion Xingwei He et al., EMNLP2023
    2. q-D
      1. Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback HongChien Yu et al., 2021.08 CIKM2021
  4. Multi-vector

    1. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT Omar KhaŠab et al., 2020.07 SIGIR2020
  5. Pre-training methods:

    1. Auto-encoding:
      1. Less is More: Pre-train a Strong Text Encoder for Dense Retrieval Using a Weak Decoder Shuqi Lu et al., 2021.02 EMNLP2021
      2. Retromae: Pre-training retrieval-oriented transformers via masked autoencoder Shitao Xiao et al., 2022.03, EMNLP2022
      3. ConTextual Masked Auto-Encoder for Dense Passage Retrieval Xing Wu et al., 2022.08 AAAI2023
      4. SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval Liang Wang et al., ACL2023
      5. MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are Better Dense Retrievers Kun Zhou et al., 2022.12 ECML-PKDD 2023
      6. Drop your Decoder: Pre-training with Bag-of-Word Prediction for Dense Passage Retrieval Guangyuan Ma, et al., SIGIR2024
    2. Transformers:
      1. Condenser: a Pre-training Architecture for Dense Retrieval Luyu Gao et al., 2021.04 EMNLP2021
    3. Representative Words Prediction
      1. PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval Xinyu Ma et al., WSDM2021, 2020.10
      2. B-PROP: Bootstrapped Pre-training with Representative Words Prediction for Ad-hoc Retrieval SIGIR2021 2021.04
    4. Synthetic data generation
      1. Multi-stage Training with Improved Negative Contrast for Neural Passage Retrieval Jing Lu et al., EMNLP2021
    5. Others:
      1. How to Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval 2023.02
      2. M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation Jianlv Chen et al., 2024 07
  6. Query Reformulation

    1. Generation-Augmented Retrieval for Open-Domain Question Answering, Yuning Mao, et al., ACL2021

LLMs coming


  1. W-RAG: Weakly Supervised Dense Retrieval in RAG for Open-domain Question Answering (2024) paper link
  2. REPLUG: Retrieval-Augmented Black-Box Language Models (NAACL 2024) paper link key point: Generation helps retrieval.
  3. A Case Study of Enhancing Sparse Retrieval using LLMs (WWW'24) paper

LLMs for IR post-processing (relevance in retriever, and utility or usefulness in generator)

  1. From the perspective of cognition:
    1. Are Large Language Models Good at Utility Judgments? (Hengran Zhang, SIGIR 2024)
    2. Iterative Utility Judgment Framework via LLMs Inspired by Relevance in Philosophy (Hengran Zhang, 2024.7)
    3. Corrective Retrieval Augmented Generation, Shi-Qi Yan, et al., Arxirv 2024
    4. Similarity is Not All You Need: Endowing Retrieval-Augmented Generation with Multi–layered Thoughts, Chunjing Gan, et al., Arxiv 2024
    5. ARKS:ActiveRetrieval in Knowledge Soup for Code Generation, Hongjin Su, et al., 2024
    6. CONTEXT-AUGMENTED CODE GENERATION USING PROGRAMMING KNOWLEDGE GRAPHS, Iman Saberi, et.al., 2024.10.9
    7. Evaluating Retrieval Quality in Retrieval-Augmented Generation. Alireza Salem, et al., SIGIR2024
    8. Bridging the Preference Gap between Retrievers and LLMs. Zixuan Ke et al. ACL2024
  2. Query reformulation
    1. Large Language Models are Strong Zero-Shot Retriever Tao Shen, et al., ACL2024 Findings
    2. Precise Zero-Shot Dense Retrieval without Relevance LabelsLuyu Gao, et al., ACL2023
  3. Zero-shot encoder
    1. PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval Shengyao Zhuang, et al., EMNLP2024

LLMs in dense retrieval (LLMs as encoder)

  1. Fine-tuning LLaMa for Multi-stage Text Retrieval(SIGIR 2024)
    1. add eos token
    2. use the eos hidden states to embed whole sentence
  2. Making Large Language Models a Better Foundation For Dense Retrieval(2023)-> Llama2Vec: Unsupervised Adaptation of Large Language Models for Dense Retrieval (ACL 2024)
    1. first work on pre-training for dense retrieval using LLMs
    2. motivation: As a result, the LLMs’ output embeddings will mainly focus on capturing the local and near-future semantic of the context. However, dense retrieval calls for embeddings to represent the global semantic of the entire context.
  3. Improving Text Embeddings with Large Language Models(2023), Liang Wang, et al., ACL2024
    1. E5-mistral-7B
    2. fine-tuning on both the generated synthetic data and a collection of 13 public datasets.
  4. NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models(2024), ICLR25
    1. Two stage fine-tuning task:
      1. It first applies contrastive training with instructions on retrieval datasets, utilizing in-batch negatives and curated hard negative examples.
      2. At stage-2, it blends various non-retrieval datasets into instruction tuning, which not only enhances non-retrieval task accuracy but also improves retrieval performance.
  5. Repetition Improves Language Model Embeddings(2023), Jacob Mitchell Springer, et al., ICLR25
    1. input the query or passage twice.
    2. improve the embedding of the last token.
  6. LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders, Parishad BehnamGhader, et al., COLM2024
  7. NV-Retriever: Improving text embedding models with effective hard-negative mining, Gabriel de Souza P. Moreira, Feb 2025, Arxiv
  8. ScalingNote: Scaling up Retrievers with Large Language Models for Real-World Dense Retrieval, Suyuan Huang, et al.,
  9. Unleashing the Power of LLMs in Dense Retrieval with Query Likelihood Modeling, Hengran Zhang, et al., Arxiv2025
  10. Scaling Sparse and Dense Retrieval in Decoder-Only LLMs, Hansi Zeng, et al., SISGIR25
  11. Gemini key technologies: Gecko: Versatile Text Embeddings Distilled from Large Language Models Jinhyuk Lee, et al., 9 Mar 2024, Arxiv
  12. NovaSearch Jasper and Stella: distillation of SOTA embedding models Dun Zhang, et al., 23 Jan 2025, Arxiv
  13. Linq-Embed-Mistral Report Junseong Kim, et al., May 2024
  14. PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval Shengyao Zhuang, et al., EMNLP2024
  15. Scaling sentence embeddings with large language models Ting Jiang, et al., EMNLP2024 Findings
  16. Improving General Text Embedding Model: Tackling Task Conflict and Data Imbalance through Model Merging Mingxin Li, et al., Oct 2024
  17. Qwen3-text-embedding: Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models Yanzhao Zhang et al., Jun 2025
  18. Gemini: Generalizable Embeddings from Gemini Jinhyuk Lee, et al., Mar 2025
  19. BGE-en-icl: Making Text Embedders Few-Shot Learners Chaofan Li, et al., ICLR2025

Ranking

LLMs for ranking

Zero-shot

  1. A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language ModelsSIGIR 2024
  2. PRP-Graph: Pairwise Ranking Prompting to LLMs with Graph Aggregation for Effective Text Re-ranking(ACL 2024)
  3. Improving Zero-shot LLM Re-Ranker with Risk Minimization(EMNLP'2024)
  4. RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs(Arxiv'24)
  5. Re-Ranking Step by Step: Investigating Pre-Filtering for Re-Ranking with Large Language ModelsArxiv2024
  6. Lightweight reranking for language model generationsACL'24
  7. JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking Pointwise: core question->select key sentences->relevance judgment
  8. Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document RankingShengyao Zhuang, et al., EMNLP2023 Findings
  9. Zero-Shot Cross-Lingual Reranking with Large Language Models for Low-Resource Languages Mofetoluwa Adeyemi, et al., ACL'24 short
  10. Improving Passage Retrieval with Zero-Shot Question Generation Devendra Singh Sachan et al., EMNL2022
  11. Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents(EMNLP'23)
  12. Are Large Language Models Good at Utility Judgments? Hengran Zhang, et al., (SIGIR'24)
  13. Distillation and Refinement of Reasoning in Small Language Models for Document Re-ranking, Chris Samarinas,Hamed Zamani, Arxiv2025
  14. Beyond Yes and No: Improving Zero-Shot LLM Rankers via Scoring Fine-Grained Relevance Labels, Honglei Zhuang, NACCL2024 Log-probability of the relevance label
  15. PaRaDe: Passage Ranking using Demonstrations with Large Language Models(Andrew Drozdov et al., Findings of EMNLP'23 short paper) Using the demonstration query likelihood to estimate difficulty and then select demonstration
  16. APEER : Automatic Prompt Engineering Enhances Large Language Model RerankingCan Jin, et al., WWW'25 Refined the prompt for ranking directly using LLMs
  17. Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document Ranking Shengyao Zhuang, et al., EMNLP2023 Findings

Fine-tuning

  1. Leveraging Passage Embeddings for Efficient Listwise Reranking with Large Language Models Qi Liu et al., (WWW'25) To address the high latency of LLM inference on listwise ranking
  2. FIRST: Faster Improved Listwise Reranking with Single Token Decoding Revanth Gangi Reddy, et al., EMNLP2024
  3. RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models (Arxiv'23)/RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze! Ronak Pradeep, et al., Arxiv2023
  4. Self-Calibrated Listwise Reranking with Large Language Models Ruiyang Ren, et al., WWW25
  5. Adapting Decoder-Based Language Models for Diverse Encoder Downstream Tasks Paul Suganthan, et al., March Arxiv2025

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published