Project

the source code and datasets for our paper “In-context Clustering-based Entity Resolution with Large Language Models: A Design Space Exploration” (SIGMOD26)

Requirements

Python >= 3.8.0
scikit-learn 1.3.2
matplotlib, networkx, tqdm, hydra, numpy, pandas

Datasets and Pre-train model

We use the following datasets in our experiments:

We use the pre-trained models utilized by SBERT:

all-MiniLM-L6-v2

Structure

requirements.txt: the environment required to run the code.
LLMCER.ipynb: the code to finish ER task.

Usage

1. Set Up the Environment：

pip install -r requirements.txt

2. Running the Jupyter Notebook:

jupyter notebook

3. Running End-to-End ER Code：

As an example, we will demonstrate how to use the LLMCER.ipynb file.

Step 1: Modify API Key

To connect to GPT, you need to configure your proxy and set your api_key. Open the notebook and locate the following lines:

import os

client = OpenAI(
    api_key="your api key"
)

Replace "proxy address" with your proxy address if needed, and replace "your api key" with your actual OpenAI API key.

Step 2: Update File Paths

Locate the variables for the data path (file_path) and the ground truth path (gt_path). Update them to point to your dataset and ground truth files. For example:

file_path = './dataset/cora/'
data_file_path = file_path + 'cora.csv'
gt_path = file_path + 'gt.csv'

Replace file_path with the directory containing your dataset, and ensure data_file_path and gt_path point to the correct files.

Step 3: Run the Notebook

Once the necessary modifications are made:

Open the Jupyter Notebook file.
Execute the notebook cells sequentially.
Wait for the results to be generated.

Citation

If you are interested in our work, you can cite our paper, for any code problem, you can contact haitong, tht@zju.edu.cn, thx❤

@article{LLMCER-SIGMOD2026,  
	author={Jiajie Fu and Haitong Tang and Arijit Khan and Sharad Mehrotra and Xiangyu Ke and Yunjun Gao}, 
    title={In-context Clustering-based Entity Resolution with Large Language Models: A Design Space Exploration},  
    journal={Proceedings of the ACM on Management of Data (SIGMOD)}, 
    year = {2026}    
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
code		code
datasets		datasets
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Project

Requirements

Datasets and Pre-train model

Structure

Usage

1. Set Up the Environment：

2. Running the Jupyter Notebook:

3. Running End-to-End ER Code：

Step 1: Modify API Key

Step 2: Update File Paths

Step 3: Run the Notebook

Citation

About

Uh oh!

Releases

Packages

Languages

License

ZJU-DAILY/LLMCER

Folders and files

Latest commit

History

Repository files navigation

Project

Requirements

Datasets and Pre-train model

Structure

Usage

1. Set Up the Environment：

2. Running the Jupyter Notebook:

3. Running End-to-End ER Code：

Step 1: Modify API Key

Step 2: Update File Paths

Step 3: Run the Notebook

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages