the source code and datasets for our paper “In-context Clustering-based Entity Resolution with Large Language Models: A Design Space Exploration” (SIGMOD26)
- Python >= 3.8.0
- scikit-learn 1.3.2
- matplotlib, networkx, tqdm, hydra, numpy, pandas
We use the following datasets in our experiments:
We use the pre-trained models utilized by SBERT:
- requirements.txt: the environment required to run the code.
- LLMCER.ipynb: the code to finish ER task.
pip install -r requirements.txt
jupyter notebook
As an example, we will demonstrate how to use the LLMCER.ipynb
file.
To connect to GPT, you need to configure your proxy and set your api_key
. Open the notebook and locate the following lines:
import os
client = OpenAI(
api_key="your api key"
)
Replace "proxy address"
with your proxy address if needed, and replace "your api key"
with your actual OpenAI API key.
Locate the variables for the data path (file_path
) and the ground truth path (gt_path
). Update them to point to your dataset and ground truth files. For example:
file_path = './dataset/cora/'
data_file_path = file_path + 'cora.csv'
gt_path = file_path + 'gt.csv'
Replace file_path
with the directory containing your dataset, and ensure data_file_path
and gt_path
point to the correct files.
Once the necessary modifications are made:
- Open the Jupyter Notebook file.
- Execute the notebook cells sequentially.
- Wait for the results to be generated.
If you are interested in our work, you can cite our paper, for any code problem, you can contact haitong, tht@zju.edu.cn, thx❤
@article{LLMCER-SIGMOD2026,
author={Jiajie Fu and Haitong Tang and Arijit Khan and Sharad Mehrotra and Xiangyu Ke and Yunjun Gao},
title={In-context Clustering-based Entity Resolution with Large Language Models: A Design Space Exploration},
journal={Proceedings of the ACM on Management of Data (SIGMOD)},
year = {2026}
}