Recommender Systems with Generative Retrieval

This repository provides an external PyTorch implementation of the TIGER model. The original approach is introduced in the NeurIPS ’23 paper “Recommender Systems with Generative Retrieval” by Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan H. Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Q. Tran, Jonah Samost, Maciej Kula, Ed H. Chi, and Maheswaran Sathiamoorthy.

Link to the paper: http://arxiv.org/abs/2305.05065

If you use this code, please cite:

@inproceedings{rajput2023recommender,
  title     = {Recommender Systems with Generative Retrieval},
  author    = {Rajput, Shashank and Mehta, Nikhil and Singh, Anima and Keshavan, Raghunandan H and Vu, Trung and Heldt, Lukasz and Hong, Lichan and Tay, Yi and Tran, Vinh Q and Samost, Jonah and others},
  booktitle = {Advances in Neural Information Processing Systems},
  pages     = {33415--33437},
  year      = {2023}
}

Datasets

This repository includes processed data for the Beauty dataset to run experiments immediately. For other datasets (Sports and Toys), please download the preprocessed data from: https://zenodo.org/records/17351848

The dataset structure includes the following files for each domain:

data/
├── Beauty/
│   ├── Beauty_5.json
│   ├── content_embeddings.pkl
│   ├── index_rqkmeans.json
│   ├── index_rqvae.json
│   ├── inter_new.json
│   └── inter.json
├── Sport/
│   ├── content_embeddings.pkl
│   ├── index_rqkmeans.json
│   ├── index_rqvae.json
│   ├── inter_new.json
│   ├── inter.json
│   └── Sports_and_Outdoors_5.json
└── Toys/
    ├── content_embeddings.pkl
    ├── index_rqkmeans.json
    ├── index_rqvae.json
    ├── inter.json
    └── Toys_and_Games_5.json

Metadata JSON files are not included and precomputed content embeddings are provided for all datasets.

If you want to work with raw data from the Amazon Review dataset, you can download it from the official source: https://jmcauley.ucsd.edu/data/amazon/. All data processing and preparation scripts are available in the notebooks folder.

Requirements

In order to launch our implementation install the following packages (tested versions in parentheses):

pip install murmurhash  # (==1.0.13)
pip install numpy  # (==2.1.2)
pip install tensorboard  # (==2.20.0)
pip install torch  # (==2.6.0+cu126)
pip install transformers  # (==4.50.3)

Usage

Important: Execute all commands from the tiger/tiger directory (not from the project root).

Training SASRec:

python ./train_sasrec.py --params ./configs/sasrec_train_config.json

Training TIGER:

python ./train_tiger.py --params ./configs/tiger_train_config.json

Reproducibility

Because there is no official TIGER reference implementation, we compare on Beauty against the results reported in "Learnable Item Tokenization for Generative Recommendation". Results for Yelp and Instruments will be added later. It is also noticable that SasRec is underperforms on this dataset. We are currently investigating this.

Results

Following the original setup, we report NDCG (N) and Hit Rate (H) with @5, @10, and additionaly @20 cutoffs.

Model	Dataset	N@5	N@10	N@20	H@5	H@10	N@20
SASRec	Beauty	0.02087	0.02718	0.03447	0.03197	0.051647	0.08071
TIGER	Beauty	0.02524	0.03191	0.03940	0.03756	0.05822	0.08800
SASRec	Sports	0.01217	0.01550	0.01952	0.01806	0.02846	0.04444
TIGER	Sports	0.01484	0.01921	0.02358	0.02315	0.03680	0.05868
SASRec	Toys	0.02274	0.02831	0.03436	0.03462	0.05202	0.07624
TIGER	Toys	0.02359	0.02917	0.03531	0.03488	0.05224	0.07696

Implementation Differences

The original paper states “Overall, the model has around 13 million parameters.” Our implementation currently has approximately 2.5× fewer parameters (~5M vs. ~13M), which is the primary architectural discrepancy. It remains to be determined which components are under‑parameterized in our implementation.

Minor Differences:

Also, in our implementation there are some other differences.

Constant learning rate: a fixed LR is used for the entire training; no scheduler is applied. With an initial LR of 0.01 and inverse square‑root decay, we observed instability during first 10k steps and inferior convergence compared with a fixed LR.
Early Stopping: training stops if the validation metric (NDCG@20) does not improve for 40 epochs; this consistently yielded the strongest checkpoints in our runs.
LLaMA 7B inference for content embeddings: we use a LLaMA‑7B checkpoint for content‑based embeddings instead of a T5 checkpoint as in the original work. We plan to evaluate T5 later as well.

Future Work

We plan to investigate these implementation differences more thoroughly and address the following:

Add a coverage metric to the evaluation protocol
Add LETTER and PLUM implementations.
Add more datasets (e.g.: Yelp, Instruments, Books)
Fine-tune hyperparameters to achieve the best performance metrics
Add a decoder-only variant of the TIGER model
Conduct ablation studies on the impact of different tokenization methods (RQ-VAE vs RQ-Kmeans)

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data/Beauty		data/Beauty
notebooks		notebooks
tiger		tiger
LICENCE		LICENCE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Recommender Systems with Generative Retrieval

Datasets

Requirements

Usage

Reproducibility

Results

Implementation Differences

Minor Differences:

Future Work

About

Uh oh!

Contributors 2

Languages

License

NonameUntitled/tiger

Folders and files

Latest commit

History

Repository files navigation

Recommender Systems with Generative Retrieval

Datasets

Requirements

Usage

Reproducibility

Results

Implementation Differences

Minor Differences:

Future Work

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 2

Languages