WSDM Cup 2024

1st Solution For Conversational Multi-Doc QA Workshop & International Challenge @ WSDM'24 - Xiaohongshu.Inc

Introduction

This repo contains the source code of our competition in WSDM Cup 2024: Conversational Multi-Doc QA

Please refer to our paper for details (Accepted by ACL 2025):

Leveraging Large Language Models for Conversational Multi-Doc Question Answering: The First Place of WSDM Cup 2024

Method Overview

SOLAR-10.7B-Instruct backbone
Hybrid Training
Noisy Document Filter
Model Ensemble

Environment

Follow Installation for modelscope/swift to install swift.
Install vllm
Install deepspeed
Install sklearn
Install SentenceTransformers

Or you can run this: (Tested on V100 32G with CUDA 11.8, Ubuntu 20.04.1)

conda create -n swift python=3.10
conda activate swift
pip install ms-swift[all] -U
pip install vllm==0.3.1
pip install deepspeed
pip install scikit-learn
pip install sentence_transformers

Main package version:

python==3.10.13
ms-swift==1.6.1
scikit-learn==1.4.1.post1
sentence-transformers==2.3.1
torch==2.1.2
transformers==4.37.2
vllm==0.3.1

Data Processing

preprocess/data_format.py: Format data required for train and eval

preprocess/data_format_Pseudo.py: For hybrid training data

preprocess/score_train_eval(test).py: Calculate scores for noisy documents filter

preprocess/score_order.py: Interactive code to delete noisy documents

Training

Use LLM Framework ms-swift by ModelScope

Finetuning

runsh/solar_instruct_sft_template.sh

Inference

runsh/solar_instruct_infer_template.sh

Ensemble learning

merge/calculate_score.py: Calculate scores for ensemble learning

merge/merge_score.py: Ensemble results

Other

keyword: Try directly generating keywords or answers by GPT

multi_stage: Multi Stage LLM try (Not work)

Reproduce results on the leaderboard

You can find all intermediate files in result folder

Prepare models

Download Pretrained Models From Huggingface
- upstage/SOLAR-10.7B-Instruct-v1.0 (10.7 B)
- nomic-ai/nomic-embed-text-v1 (0.14 B)
Download Our 8 Finetuned LoRA Adapters From our huggingface repository (0.03 B Each)

So our model size is 10.7B + 0.14B + 0.03B * 8 = 11.08B, much fewer than 14 billion (14B) parameters.

Put them in the right folder. The folder should look as follows:

└── checkpoints
    ├── v08-20240205-114459/
    ├── v10-20240205-114325/
    ├── v13-20240202-072530/
    ├── v13-20240206-111010/
    ├── v16-20240206-224659/
    ├── v27-20240209-133614/
    ├── v33-20240210-002918/
    └── v35-20240210-120550/
└── pretrained
    └── nomic-ai/nomic-embed-text-v1/
        ├── 1_Pooling/
        ├── config.json
        ├── config_sentence_transformers.json
        ├── configuration_hf_nomic_bert.py
        ├── .gitattributes
        ├── .locks/
        ├── modeling_hf_nomic_bert.py
        ├── model.safetensors
        ├── modules.json
        ├── onnx/
        ├── pytorch_model.bin
        ├── README.md
        ├── sentence_bert_config.json
        ├── special_tokens_map.json
        ├── tokenizer_config.json
        ├── tokenizer.json
        └── vocab.txt
    └── upstage/SOLAR-10.7B-Instruct-v1.0/
        ├── config.json
        ├── generation_config.json
        ├── .gitattributes
        ├── .locks/
        ├── model-00001-of-00005.safetensors
        ├── model-00002-of-00005.safetensors
        ├── model-00003-of-00005.safetensors
        ├── model-00004-of-00005.safetensors
        ├── model-00005-of-00005.safetensors
        ├── model.safetensors.index.json
        ├── README.md
        ├── solar_logo.png
        ├── tokenizer_config.json
        ├── tokenizer.json
        └── tokenizer.model

Inference Result

Run python data_format.py to preprocess original test data.

Then run shell script in the runsh folder

bash runsh/v08-20240205-114459.sh
bash runsh/v10-20240205-114325.sh
bash runsh/v13-20240202-072530.sh
bash runsh/v13-20240206-111010.sh
bash runsh/v16-20240206-224659.sh
bash runsh/v27-20240209-133614.sh
bash runsh/v33-20240210-002918.sh
bash runsh/v35-20240210-120550.sh

You can modify CUDA device at the beginning of each shell script CUDA_VISIBLE_DEVICES=
The result files are saved in the merge folder, which should look as follows:

└── merge
    ├── v08-20240205-114459.jsonl
    ├── v10-20240205-114325.jsonl
    ├── v13-20240202-072530.jsonl
    ├── v13-20240206-111010.jsonl
    ├── v16-20240206-224659.jsonl
    ├── v27-20240209-133614.jsonl
    ├── v33-20240210-002918.jsonl
    └── v35-20240210-120550.jsonl

Besides, the results above are as follows:

File	Word-level ROUGE-L	Character-level ROUGE-L	Keywords Recall
v08-20240205-114459	0.45532953438881013	0.6143454883849857	0.6824189095928223
v10-20240205-114325	0.456275615214309	0.6149276913541135	0.6817805383022769
v13-20240202-072530	0.4554468517276402	0.6141346993379754	0.6827095609704305
v13-20240206-111010	0.456388581088847	0.6149210447203279	0.6840088655306036
v16-20240206-224659	0.45375515045837794	0.613359666771279	0.6879538939321544
v27-20240209-133614	0.45574561117381773	0.6145520850027292	0.6826942984551678
v33-20240210-002918	0.4559195951083145	0.6141543510329665	0.6865596963423041
v35-20240210-120550	0.45573339341665703	0.614208192382808	0.6813332802463232

So even if they are not ensembled, each of them is still way ahead of the second place.

Ensemble

First, calculate the embedding score

python calculate_score.py

Note that this program is accelerated by torch.multiprocessing, you can modify the number of processes near num_group = 16. (It works well in V100 32G)

Then generate final result,

python merge_score.py

It will generate emb_a_s_8_0_1_2_3_4_5_6_7.zip in the root folder, which is our final result.

Word-level ROUGE-L	Character-level ROUGE-L	Keywords Recall
0.465360141853671	0.6208371209722543	0.6953475871954128

Citation

If you find our work helpful, please consider citing the following paper:

@inproceedings{li-zhang-2025-leveraging,
    title = "Leveraging Large Language Models for Conversational Multi-Doc Question Answering: The First Place of {WSDM} Cup 2024",
    author = "Li, Yiming  and
      Zhang, Zhao",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-acl.19/",
    pages = "349--355",
    ISBN = "979-8-89176-256-5",
    abstract = "Conversational multi-doc question answering aims to answer specific questions based on the retrieved documents as well as the contextual conversations. In this paper, we introduce our winning approach for the ``Conversational Multi-Doc QA'' challenge in WSDM Cup 2024, which exploits the superior natural language understanding and generation capability of Large Language Models (LLMs). We first adapt LLMs to the task, then devise a hybrid training strategy to make the most of in-domain unlabeled data. Moreover, an advanced text embedding model is adopted to filter out potentially irrelevant documents, and several approaches are designed and compared for the model ensemble. Equipped with all these techniques, our solution finally ranked 1st place in WSDM Cup 2024, surpassing its rivals to a large extent. The source codes have been released at https://github.com/zhangzhao219/WSDM-Cup-2024."
}

Contacts

Zhao Zhang: zhaozhao809@163.com

Yiming Li: eamon.y.li@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
config		config
data/wsdm		data/wsdm
keyword		keyword
merge		merge
multi_stage		multi_stage
output/solar-10-7b-instruct-v1		output/solar-10-7b-instruct-v1
pic		pic
preprocess		preprocess
runsh		runsh
submit		submit
utils		utils
README.md		README.md
README_zh.md		README_zh.md
llm_dpo.py		llm_dpo.py
llm_infer.py		llm_infer.py
llm_sft.py		llm_sft.py
vllm_demo.py		vllm_demo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WSDM Cup 2024

Introduction

Method Overview

Environment

Data Processing

Training

Finetuning

Inference

Ensemble learning

Other

Reproduce results on the leaderboard

Prepare models

Inference Result

Ensemble

Citation

Contacts

About

Uh oh!

Contributors 3

Uh oh!

Languages

zhangzhao219/WSDM-Cup-2024

Folders and files

Latest commit

History

Repository files navigation

WSDM Cup 2024

Introduction

Method Overview

Environment

Data Processing

Training

Finetuning

Inference

Ensemble learning

Other

Reproduce results on the leaderboard

Prepare models

Inference Result

Ensemble

Citation

Contacts

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors 3

Uh oh!

Languages