Stefanie Jegelka3,4, Chenyu You2
In this paper, we show that sparse coding offers a compelling alternative for achieving adaptive representation with minimal overhead and higher fidelity. We propose Contrastive Sparse Representation, a method that sparsifies pre-trained embeddings into a high-dimensional but selectively activated feature space. By leveraging lightweight autoencoding and task-aware contrastive objectives, CSR preserves semantic quality while allowing flexible, cost-effective inference at different sparsity levels. Extensive experiments on image, text, and multimodal benchmarks demonstrate that CSR consistently outperforms MRL in terms of both accuracy and retrieval speed-often by large margins-while also cutting training time to a fraction of that required by MRL. Our results establish sparse coding as a powerful paradigm for adaptive representation learning in real-world applications where efficiency and fidelity are both paramount.
This project is built on top of prior research and infrastructure from the Y Research team.
Please refer to the Y Research repo for more details.
- 2025.07.01 Our models can be loaded with Sentence_Transformers now! 😁😁
- 2025.06.06 More model checkpoints are released!! 😁😁
- 2025.05.25 Major Update. We have thoroughly reorganized our repository with the following changes: 🎉🎉
- Minor code changes on Visual Experiments, especially dataset preparation.
- Training & Evaluation Pipeline for Text Experiments, including text classification, text clustering and text retrieval.
- Minor changes in codes for Multimodal Experiments.
- More detailed instructions for Data Preparation & Training & Evaluation.
- 2025.04.25 Training & Evaluation Pipeline for Multimodal Retrieval is Now Live! We further provide pre-computed ImageNet1k embeddings at Dataset Link for easy follow up! 🙌🙌
- 2025.03.25 Evaluation framework for multimodal retrieval tasks is now online!! 🪼🪼
- 2025.03.07 Weights for visual embeds(k=8 & 32), multimodal embeds(k=64) are now online!! 😁😁
- 2025.03.05 Code released! Let's embrace sparsity!! 🎉🎉
In this repo, we will release (updating):
- Environment Dependencies ✅
- Checkpoints ✅
- Visual ckpt (on ImageNet) ✅
- Text ckpt (on partly MTEB datasets) ✅
- MutilModal ckpt (on MS COCO) ✅
- Reproducing Experiments ✅
- Visual Exp ✅
- Dataset preparations ✅
- Training ✅
- Evaluation ✅
- Text Exp;
- Dataset preparations ✅
- Training ✅
- Evaluation ✅
- MultiModal Exp ✅
- Dataset preparations ✅
- Training ✅
- Evaluation ✅
- Retrieval Time Evaluation ✅
- Visual Exp ✅
- loading CSR with Sentence Transformers ✅
- fit CSR into Sentence Transformers
SparseEncoder
v5.0.0 release 📌
You can load our models with Sentence Transfomers now. Here is an example on how to evaluate our models on MTEB with sentence_transformers:
import mteb
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
"Y-Research-Group/CSR-NV_Embed_v2-Classification-Banking77",
trust_remote_code=True
)
model.prompts = {
"Banking77Classification": "Instruct: Given a online banking query, find the corresponding intents\nQuery:"
}
task = mteb.get_tasks(tasks=["Banking77Classification"])
evaluation = mteb.MTEB(tasks=task)
evaluation.run(model, eval_splits=["test"], output_folder="./results/Banking77Classification",
batch_size=32, show_progress_bar=True)
Currently we are investigating into the v5.0.0 release of Sentence Transformer so that our CSRs will catch up with the latest SparseEncoder
update.
You only need to prepare an empty conda environment with Python 3 (reference version: Python 3.8.20) and pip install
the requirements.txt
file in this directory.
conda create --name csr python=3.8.20
pip install -r requirements.txt
First, please move to our vision_representation
codebase directory.
cd ./vision_representation/
We provide embeds extracted by FF2048 backbones (same backbone weights with MRL), and embeds by SoTA backbone at Dataset Link .
To train CSR with different visual backbones, please follow the preparation steps below.
Step I: Download Imagenet1k dataset and bounding box annotations from Imagenet1k Official Website.
Step II: Convert the original dataset to Pytorch style.
# Prepare the annotations.txt file for both training and validation set
python ./dataset_preparation annotations.py --xml_dir "/path/to/train/annotation/directory" --output_file "/path/to/annotation.txt/directory"
# Convert original dataset to Pytorch style
python ./dataset_preparation to_pytorch_style.py --split_path "/path/to/pytorch/style/dataset"
Step III: Follow the pipeline of FFCV for ResNet50 to generate the dataset with the following command (IMAGENET_DIR
should point to a Pytorch style ImageNet dataset).
cd dataset_preparation
# Required environmental variables for the script:
export IMAGENET_DIR=/path/to/pytorch/format/imagenet/directory/
export WRITE_DIR=/your/path/here/
# Serialize images with:
# - dataset type: train/val
# - 500px side length maximum
# - 50% JPEG encoded, 90% raw pixel values
# - quality=90 JPEGs
./write_imagenet.sh "train" 500 0.50 90
./write_imagenet.sh "val" 500 0.50 90
For training and evaluation simplicity, we precompute image embeddings using models from Timm. In our paper, we select resnet50d.ra4_e3600_r224_in1k as our pre-trained visual backbone. To extract embeddings, run the following command:
python pretrained_embeddings.py \
--train_data_ffcv /path/to/train.ffcv \
--eval_data_ffcv /path/to/val.ffcv \
--model_name "pre-trained visual backbone" \
Then stack embeds together:
python stack_emb.py
Note: We did this only in consideration of memory constrain on our computer, otherwise you can directly infer the entire training embeds without stack operation.
After getting embeds, you can train CSR with main_visual.py
. You must customize the pretrained_emb
(Path to the embeds) and model_name
(timm
model's name). For other parameters, you can both follow the default settings and customize them. The trained models will be saved to ./ckpt/CSR_topk_{args.topk}/
.
python main_visual.py \
--pretrained_emb /path/to/pretrained_emb \
--model_name "pre-trained visual backbone" \
--use_ddp False \ # set True if you want to use multi-GPU
--gpu 1 \ # GPU ID, set None if you use multi-GPU
--batch-size 1024 * 4 \
--lr 4e-4 \
--use_CL True \ # whether to use contrastive learning
--topk 8 \ # topk for CSR
--auxk 512 \ # auxiliary sparse code size
--hidden-size 8192 \ # By default, 4 * visual backbone embedding size
You can get CSR embeddings with csr_inference.py
. You must customize the train_embed_path
, eval_emb_path
and csr_ckpt
. The embeds will be saved in ./retrieval/
by default. Note: Considering that the CSR embeddings are too large, we split them into chunks stored in the same directory with chunk_original_npz_file.py
. Minor code change is needed in method generate_retrieval_data
in utils.py
is needed if you prefer inference with single .npz
file.
# Chunk original embeddings
python chunk_npz_file.py \
--input_path "Path/to/original/embeddings" \
--output_path "Path/to/chunk/directory" \
--chunk_size "Number of samples per chunk"
# Inference
python csr_inference.py \
--train_emb_path /path/to/train_emb \
--eval_emb_path /path/to/val_emb \
--model_name "pre-trained visual backbone" \
--topk 8\
--hidden-size 8192 # By default, 4 * visual backbone embedding size
--csr_ckpt "CSR ckpt path"\
We use FAISS for KNN evaluation and calculate Top1 Accuracy under different sparsity conditions. It should be noted that this evaluation process runs on a 128-CPU server, requires approximately 150GB of memory, and takes about 20 minutes to complete. Further optimization is needed, and we welcome collaboration. You only need to set --topk
if all generated files are not moved.
# Get FAISS index
python ./retrieval/faiss_nn.py --topk 8
# Evaluate Top1 accuracy
python ./retrieval/compute_metrics.py --topk 8
We focus on three kinds of tasks for comparison: Text Classification, Text Clustering and Text Retrieval. First, please move to our text_representation
codebase directory.
cd ./Text
We focus on three datasets for classification: MTOPIntent, Banking77 and tweetSentiment.
We download these datasets from hugging face to ./datasets
directory. You can use load_dataset
API if you like (minor changes to code may be needed). If you follow our way of data preparation, no extra change to original data is necessary. For instance, banking77
directory should be organized in the following format:
./banking77
├── README.md
├── data
│ ├── test-00000-of-00001.parquet
│ └── train-00000-of-00001.parquet
├── prepare_data.py
├── test.jsonl
└── train.jsonl
You can get our pre-computed NV-Embed-v2 embeddings from Dataset Link. Or you can use ./get_embeddings/get_classification_embeddings.py
to generate your own embeddings.
cd get_embeddings/
python get_classification_embeddings.py \
--dataset "Dataset_name" \ # Dataset to process
--language None \ # Required for MTOP_Intent dataset
--split None # default: all splits
After getting embeddings, you can train CSR with train_CSR_models.py
. You need to customize --task_name
, --pretrained_embed
and --gpu
.
python train_CSR_models.py \
--task_name $TASK_NAME \ # the dataset you want to train CSR on.
--pretrained_emb /path/to/your/embeddings \ # path to embeddings for training
--gpu $DEVICE \ # the device to train CSR
--batch-size 128 \
--lr 4e-5 \
--topk 32 \
--auxk 1024 \
--hidden-size 16384 \
--embed_dim 4096
You can get CSR embeddings with text_classification_csr_inference.py
. You must customize --train_emb_path
(path to training dataset embeddings), --val_emb_path
(path to validation dataset embeddings) and --csr_ckpt
(path to CSR model).
python ./text_classification_csr_inference.py \
--train_emb_path /path/to/training/dataset/embeddings \
--val_emb_path /path/to/validation/dataset/embeddings \
--topk 32 \
--hidden-size 16384 \
--csr_ckpt /path/to/CSR/model
We use Top-1 Acc (%) as the text classification evaluation metrics. The training set embeddings are used to train a logistic regression classifier, which is scored on the test set. You only need to customize the --training_embedding_path
and --test_embedding_path
.
python ./evaluation_on_textclassification.py \
--training_embedding_path "Path/to/Training/Set/embeddings" \
--test_embedding_path "Path/to/Test/Set/Embeddings" \
--n_jobs -1 \
--verbose 0 \
--max_iter 2000
We focus on three datasets for clustering: BiorxivP2P, BiorxivS2S and TwentyNews.
Please follow the data preparation pipeline in Text Classification. For instance, biorxiv-clustering-p2p
directory should be organized in the following format:
./biorxiv-clustering-p2p
├── README.md
└── test.jsonl
You can get our pre-computed NV-Embed-v2 embeddings from Dataset Link. Or you can use ./get_embeddings/get_clustering_embeddings.py
to generate your own embeddings.
cd get_embeddings/
python get_clustering_embeddings.py \
--dataset "Dataset_name" \ # Dataset to process
--split None # default: all splits
After getting embeddings, you can train CSR with train_CSR_models.py
. You need to customize --task_name
, --pretrained_embed
and --gpu
.
python train_CSR_models.py \
--task_name $TASK_NAME \ # the dataset you want to train CSR on.
--pretrained_emb /path/to/your/embeddings \ # path to embeddings for training
--gpu $DEVICE \ # the device to train CSR
--batch-size 128 \
--lr 4e-5 \
--topk 32 \
--auxk 1024 \
--hidden-size 16384 \
--embed_dim 4096
You can get CSR embeddings with text_clustering_csr_inference.py
. You must customize --test_emb_path
(path to test dataset embeddings) and --csr_ckpt
(path to CSR model).
python ./text_clustering_csr_inference.py \
--test_emb_path /path/to/test/dataset/embeddings \
--topk 32 \
--hidden-size 16384 \
--csr_ckpt /path/to/CSR/model
We use Top-1 Acc (%) as the text clustering evaluation metrics. A mini-batch k-means model with batch size 32 and n_clusters
equal to the number of different labels is trained on the embeddings. You need to customize the --n_clusters
and --embedding_path
.
python ./evaluation_on_textclustering.py \
--embedding_path "Path/to/embeddings" \
--n_clusters $cluster_num \ # The number of classes in each dataset
--batch_size 32 \
--n_init "auto"
We focus on three datasets for retrieval: FiQA2018, NFCorpus, and SciFACT.
Please follow the data preparation pipeline in Text Classification. For instance, scifact
directory should be organized in the following format:
./scifact
├── README.md
├── concat_two_embeddings.py
├── corpus.jsonl
├── qrels
│ ├── test.jsonl
│ ├── test.tsv
│ ├── train.jsonl
│ └── train.tsv
├── queries.jsonl
You can get our pre-computed NV-Embed-v2 embeddings from Dataset Link. Or you can use ./get_embeddings/get_retrieval_embeddings.py
to generate your own embeddings.
cd get_embeddings/
python get_retrieval_embeddings.py \
--dataset "Dataset_name" \ # Dataset to process
--language None \ # Required for MTOP_Intent dataset
--split None # default: all splits
After getting embeddings, you can train CSR with train_CSR_models.py
. You need to customize --task_name
, --pretrained_embed
and --gpu
. Note: you can also use the query-corpus pretrained embedding pairs for calculating contrastive loss to get better results.
python train_CSR_models.py \
--task_name $TASK_NAME \ # the dataset you want to train CSR on.
--pretrained_emb /path/to/your/embeddings \ # path to embeddings for training
--gpu $DEVICE \ # the device to train CSR
--batch-size 128 \
--lr 4e-5 \
--topk 32 \
--auxk 1024 \
--hidden-size 16384 \
--embed_dim 4096
You can get CSR embeddings with text_retrieval_csr_inference.py
. You must customize --queries_emb_path
(path to training dataset embeddings), --corpus_emb_path
(path to validation dataset embeddings) and --csr_ckpt
(path to CSR model).
python ./text_retrieval_csr_inference.py \
--corpus_emb_path /path/to/corpus/embeddings \
--queries_emb_path /path/to/queries/embeddings \
--topk 32 \
--hidden-size 16384 \
--csr_ckpt /path/to/CSR/model
We use NDCG@10 (%) as the text retrieval evaluation metrics. The cosine similarity of the embeddings of each corpus-query pair is calculated. For each query, select the top ten corpus for NDCG@10 calculation. You need to customize the --corpus_embed_path
(path to corpus embeddings), --queries_embed_path
(path to queries embeddings), --corpus_jsonl
(in original dataset, to get the id for each corpus), --queries_jsonl
(in original dataset, to get the id for each query) and --qrels_path
(in original dataset, to get the relevance score for each query-corpus pair).
python evaluation_on_textretrieval.py \
--corpus_embed_path "Path/to/corpus/embeddings" \ # Path to corpus embeddings
--queries_embed_path "Path/to/queries/embeddings" \ # Path to queries embeddings
--corpus_jsonl "Path/to/corpus/JSONL" \ # Path to corpus JSONL file
--queries_jsonl "Path/to/queries/JSONL" \ # Path to queries JSONL file
--qrels_path "Path/to/qrels/TSV" \ # Path to qrels TSV file
--k 10 \ # Evaluate NDCG@k
We evaluate on multi-model retrieval tasks for multi-modal representation comparison.
cd ./Multi-model/
You can use the following script to Train CSR on CC3M. You only need to customize the $DEVICE
you would like to train on and --train-data
(path to your training data).
CUDA_VISIBLE_DEVICES=$DEVICE torchrun --nproc_per_node=1 --rdzv-endpoint localhost:29400 ./multimodel_training.py\
--train-data "./cc3m-wds/cc3m-train-{0000..575}.tar" \
--train-num-samples 2905954 \
--dataset-type webdataset \
--precision amp \
--workers 16 \
--model "ViT-B-16"\
--pretrained "dfn2b" \
--epochs 5 \
--save-frequency 1 \
--report-to tensorboard \
--csr-topk 64 \
--csr-auxk 1024 \
--csr-cl_coef 1 \
--csr-hidden-size 2048 \
--csr-input-dim 512 \
--batch-size 1024 \
--lr 4e-4 \
--wd 1e-4 \
--grad-clip-norm 1.0 \
--save-frequency 1
You can evaluate trained CSR with the following script. You need to decide on the --dataset
, --output
(path to store the evaluation results) and -csr_ckpt
(path to the CSR model).
python main_multimodal.py eval \
--dataset "flickr30k_or_MSCOCO" \
--task "zeroshot_retrieval" \
--pretrained "dfn2b" \
--model "ViT-B-16" \
--output "Path/to/JSON/file" \
--batch_size 64 \
--dataset_root "./flickr30k_dataset" \
--recall_k 5 \
--csr_ckpt "Path/to/CSR" \
--topk 256 \
--hidden-size 2048 \
--rep_dim 1024
If you find this work useful, please cite the accompanying paper:
@inproceedings{wen2025matryoshkarevisitingsparsecoding,
title={Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation},
author={Tiansheng Wen and Yifei Wang and Zequn Zeng and Zhong Peng and Yudi Su and Xinyang Liu and Bo Chen and Hongwei Liu and Stefanie Jegelka and Chenyu You},
booktitle={International Conference on Machine Learning},
year={2025}
}
This repository was built off of Sparse_AutoEncoder, Torchvision, Open-clip.