We present CADET, a framework for fine-tuning embedding models for retrieval on specific corpora using diverse synthetic queries and cross-encoder listwise distillation.
We will continue to refine this codebase. For questions or support, please reach out to mtamber@uwaterloo.ca.
Model link: cadet-embed-base-v1 on Hugging Face
-
encoding/
Contains scripts to encode corpora and evaluate models. -
query_generation/
Includes scripts for generating synthetic queries. -
reranker/
Code for reranking. -
training_scripts/
Scripts for fine-tuning models.
If you use CADET, please cite the following paper:
@article{tamber2025conventional,
title={Conventional Contrastive Learning Often Falls Short: Improving Dense Retrieval with Cross-Encoder Listwise Distillation and Synthetic Data},
author={Tamber, Manveer Singh and Kazi, Suleman and Sourabh, Vivek and Lin, Jimmy},
journal={arXiv:2505.19274},
year={2025}
}