GitHub - augustinLib/SPIKE: [COLM 2025] This is the official code for the COLM 2025 paper "Imagine All The Relevance: Scenario-Profiled Indexing with Knowledge Expansion for Dense Retrieval"

This repository contains the implementation of Scenario-Profiled Indexing with Knowledge Expansion (SPIKE), a dense retrieval framework that explicitly indexes potential implicit relevance within documents.

SPIKE reframes document representations into hypothetical retrieval scenarios, where each scenario encapsulates the reasoning process required to uncover implicit relevance between a hypothetical information need and the document content. This approach:

Enhances retrieval performance by explicitly modeling how a document addresses hypothetical information needs, capturing implicit relevance between query and document.
Effectively connects query-document pairs across different formats such as code snippets, enabling semantic alignment despite format differences.
Improves the retrieval experience for users by providing useful information while also serving as valuable context for LLMs in RAG settings.

Implementation

SPIKE is implemented through the following process:

Scenario Generator

Before generating scenarios, we train our scenario generator model to effectively identify implicit relevance:

Scenario-augmented training data: Using high-performing LLMs (like GPT-4o) to create high-quality scenarios
Scenario Distillation: Training a smaller model(Llama-3.2-3B-Instruct) to efficiently produce reasoning-driven scenarios

This process is implemented in src/scenario_generator/

Scenario Generation

We generate hypothetical retrieval scenarios for each document using the trained scenario generator. This process is implemented in data/scenario_extract/generator_scenario/generate_scenario.py.

For efficient scenario generation at scale, we utilize vllm's OpenAI-compatible server, which allows us to process multiple documents in parallel using asyncio, handle large batches efficiently, and maintain consistent generation quality while improving throughput.

We also provide our trained scenario generator model for public use. The model is available at: Scenario Generator HF Link

This model can be used to generate hypothetical retrieval scenarios for your own documents, enabling you to implement the SPIKE framework in your retrieval applications.

Embedding and Indexing

For efficiency and accuracy, we used different libraries to extract embeddings for different dense retrieval models. Specifically, the models used for each library are as follows:

Vllm: E5-Mistral-7B, SFR, Qwen
Sentencetransformer: SBERT(all-mpnet-base-v2), BGE-Large
GritLM: GritLM

We also use FAISS for efficient similarity search.

The embedding and indexing process is implemented in data/embedding/, supporting various embedding models with different dimensions and context length limitations.

Retrieval with Scenarios

During retrieval, SPIKE combines document-level and scenario-level relevance:

Compute Document-level and Scenario-level scores: Computing relevance scores between the query and both documents and scenarios
Score Aggregation: Combining document scores with the maximum scenario score using a weighted sum
Result Ranking: Producing the final ranked list based on the combined relevance scores

This process is implemented in src/main_result.

Requirements

datasets==2.21.0
faiss-gpu-cu12==1.10.0
gritlm==1.0.2
lightning==2.5.1
openai==1.69.0
pyarrow==19.0.1
sentence-transformers==4.0.1
sentencepiece==0.2.0
tokenizers==0.21.1
torch==2.6.0
transformers==4.50.2
vllm==0.8.2
wandb==0.19.8

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
asset		asset
data		data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Implementation

Scenario Generator

Scenario Generation

Embedding and Indexing

Retrieval with Scenarios

Requirements

About

Uh oh!

Releases

Packages

Languages

License

augustinLib/SPIKE

Folders and files

Latest commit

History

Repository files navigation

Implementation

Scenario Generator

Scenario Generation

Embedding and Indexing

Retrieval with Scenarios

Requirements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages