GitHub - Lizhecheng02/Kaggle-Multilingual_Chatbot_Arena: This competition challenges you to predict which responses users will prefer in a head-to-head battle between chatbots powered by large language models (LLMs).

This Repo is for Kaggle - WSDM Cup - Multilingual Chatbot Arena

Python Environment

1. Install Packages

pip install --upgrade -r requirements.txt

2. Create `config.yaml` File

openai:
  api_key: "YOUR_OPENAI_API_KEY"
  organization: "YOUR_OPENAI_ORGANIZATION"
huggingface:
  token: "YOUR_HF_TOKEN"

Prepare Datasets

1. Set Up Kaggle Env

export KAGGLE_USERNAME="YOUR_KAGGLE_USERNAME"
export KAGGLE_KEY="YOUR_KAGGLE_API_KEY"

2. Download Datasets

Original Competition Dataset

sudo apt install unzip
kaggle competitions download -c wsdm-cup-multilingual-chatbot-arena
unzip wsdm-cup-multilingual-chatbot-arena.zip

Extra Datasets (also include the original competition dataset)

sudo apt install unzip
kaggle datasets download -d lizhecheng/kaggle-multilingual-chatbot-arena-datasets
unzip kaggle-multilingual-chatbot-arena-datasets.zip

Sequence Classification

1. Gemma Classification

Customize the parameters in gemma2-9b-main.sh.

cd gemma_cls
chmod +x ./gemma2-9b-main.sh
./gemma2-9b-main.sh

Use advanced training code based on LMSYS 1.0 & customize the parameters in train_gemma_cls.yaml.

cd gemma_cls_advanced
chmod +x ./gemma_cls.sh
./gemma_cls.sh

AutoModelForMultipleChoice

1. mdeberta-v3-base (with AWP)

Customize the parameters in train.sh.

cd mDeBERTa
chmod +x ./train.sh
./train.sh

(Note: The performance of microsoft/mdeberta-v3-base is suboptimal, with a cv score of approximately 0.645.)

Fine-tune an OpenAI Model for Pseudo-Labeling

1. Create `.jsonl` Files

Customize the parameters in data.sh.

cd openai_finetune
chmod +x ./data.sh
python data.py

2. Token Count

python calculate.py

3. Start Fine-tuning

python finetune.py

(Note: Remember your file and job IDs for later use.)

4. Test Fine-tuned Model

Set the fine_tune_job_id and change the prompt in test.py.

python test.py

5. Compare Original Model vs. Fine-tuned Model

Set your validation file path and two different model names in compare.py.

python compare.py

6th Place Solution -> WSDM Cup - Multilingual Chatbot Arena

First, I sincerely want to thank Kaggle and the Chatbot Arena for organizing this meaningful competition. I also want to extend my deepest gratitude to my teammates: @pingfan, @xuanmingzhang777, @xiaoqinglong1996, @tonyarobertson, with whom I've competed on Kaggle for almost a year. Unfortunately, we faced unexpected situations in previous competitions, leading to shake down or missing gold medals. However, today five Kaggle competition experts have become competition masters together. This is truly one of the most incredible moments of our journey, being able to share this success with my teammates!

TL;DR

Pseudo Labeling:
- Leveraged data generated by the 3rd-place team of LMSYS.
- Sampled prompts from a 1M dataset and used APIs to generate responses.
- Incorporated open-source DPO data (e.g., RLHFlow), mixing them to generate pseudo labels.
Distillation:
- Distilled Llama3.3-70B and Qwen2.5-72B models into Gemma2-9B and Qwen2.5-14B.
- Trained at a maximum sequence length of 2500 tokens, using 4-bit quantization.
Multilingual Strategy:
- Multilingual performance was not a primary focus, as Gemma and Qwen are already among the most powerful multilingual models.
- Prioritized the top five main languages, especially English, as we found English accuracy to be suboptimal.

Pseudo Labeling

Pseudo labeling played a crucial role in our approach. By effectively utilizing pseudo-labeled data, we achieved a LB score above 0.693, even without training directly on the competition dataset.

Dataset

We aggregated multiple data sources, filtering short responses to obtain approximately 560K samples:

Data generated by the 3rd-place LMSYS team (special thanks to @conjuring92).
Prompts sampled from a 1M dataset, with API-generated responses from various models.
Around 10 open-source DPO datasets (e.g., RLHFlow).

We processed these datasets to generate high-quality pseudo labels.

Pseudo Labeling Approach

To ensure label accuracy and minimize data leakage, we experimented with two approaches for our judging model:

Baseline Method: Fine-tuned Gemma2-9B on competition data.
Enhanced Method: Fine-tuned Llama3.3-70B and Qwen2.5-72B on competition data.

While the enhanced method showed minor improvements, it required significantly longer inference times. Ultimately, retraining Gemma2-9B with pseudo-labeled data yielded comparable results across KL Divergence Loss and Cross-Entropy Loss.

Knowledge Distillation

Teacher Models: Qwen2.5-72B, Llama3.3-70B
Training Data: WSDM + LMSYS
Loss Functions: KL Divergence Loss, Cross-Entropy Loss, and an equally weighted average of two losses

We conducted extensive post-training for distillation. While the distillation process had no significant effect on the Qwen2.5-14B model, it demonstrated measurable improvements on Gemma2-9B.

Final Training Phase

Direct LoRA training on 4-bit quantized models for both Qwen2.5-14B and Gemma2-9B.
Max sequence length: 2500 — extending this to 3072 did not yield noticeable benefits, and training time constraints discouraged further increases.

Inference Strategy

Primary model: Qwen2.5-14B
Supporting model: Gemma2-9B
Inference mechanism:
- Used logits from Qwen2.5-14B for primary classification.
- Deployed Gemma2-9B selectively, prioritizing cases where Qwen2.5-14B struggled with classification.
- Managed inference time efficiently to fully utilize the 12-hour limit.

What Didn’t Work

TTA (Test-Time Augmentation) showed no measurable impact on sequence classification.
LoRA Merge caused significant performance degradation, an issue that remained unresolved despite debugging.
Multi-LoRA ensemble approaches failed to improve performance.
Dynamic token allocation based on response length led to an abnormal length distribution.
Model selection: Gemma2-9B and Qwen2.5-14B outperformed other models.
Original labels from DPO datasets were ineffective for post-training but useful for pseudo-labeling.
Chain-of-Thought prompting strategies did not yield meaningful improvements.
Dynamic truncation during inference initially provided a +0.003 LB boost, but its effectiveness diminished after updating training code and truncation methods.

My Personal "What Didn't Work" List

I fine-tuned an mDeBERTa model for multiple-choice tasks using AutoModelForMultipleChoice. However, as expected, the model’s parameter limitations resulted in poor performance.
I attempted to use GPT-3.5 as a judge model for pseudo-labeling by fine-tuning it on 10K samples. However, OpenAI’s lack of transparency in the training process and sensitivity to parameter adjustments led to unsatisfactory results.
Contrary to research suggesting that few-shot prompting improves accuracy, this did not hold in our competition setting. I experimented with 2-shot to 32-shot GPT-3.5 configurations, but there was no significant accuracy improvement — possibly due to excessive human intervention.

Improvement Opportunities

Suboptimal cross-validation and leaderboard performance of teacher models — large-parameter model training remains a major challenge; for example, we could not achieve satisfactory results when training Gemma2-27B.
LoRA merging and post-training quantization issues caused unexpected performance degradation, which remains a critical bottleneck.

Conclusion

Again, thanks to my four amazing teammates: @pingfan, @xuanmingzhang777, @xiaoqinglong1996, @tonyarobertson.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
augmentation		augmentation
cv		cv
data_construction		data_construction
external_data		external_data
gemma_cls		gemma_cls
gemma_cls_advanced		gemma_cls_advanced
huggingface_api		huggingface_api
mDeBERTa		mDeBERTa
msicl_evaluator		msicl_evaluator
openai_finetune		openai_finetune
original_data		original_data
.gitignore		.gitignore
.python_version		.python_version
README.md		README.md
combine.ipynb		combine.ipynb
eda.ipynb		eda.ipynb
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

This Repo is for Kaggle - WSDM Cup - Multilingual Chatbot Arena

Python Environment

1. Install Packages

2. Create `config.yaml` File

Prepare Datasets

1. Set Up Kaggle Env

2. Download Datasets

Sequence Classification

1. Gemma Classification

AutoModelForMultipleChoice

1. mdeberta-v3-base (with AWP)

Fine-tune an OpenAI Model for Pseudo-Labeling

1. Create `.jsonl` Files

2. Token Count

3. Start Fine-tuning

4. Test Fine-tuned Model

5. Compare Original Model vs. Fine-tuned Model

6th Place Solution -> WSDM Cup - Multilingual Chatbot Arena

TL;DR

Pseudo Labeling

Dataset

Pseudo Labeling Approach

Knowledge Distillation

Final Training Phase

Inference Strategy

What Didn’t Work

My Personal "What Didn't Work" List

Improvement Opportunities

Conclusion

About

Uh oh!

Releases

Packages

Languages

Lizhecheng02/Kaggle-Multilingual_Chatbot_Arena

Folders and files

Latest commit

History

Repository files navigation

This Repo is for Kaggle - WSDM Cup - Multilingual Chatbot Arena

Python Environment

1. Install Packages

2. Create config.yaml File

Prepare Datasets

1. Set Up Kaggle Env

2. Download Datasets

Sequence Classification

1. Gemma Classification

AutoModelForMultipleChoice

1. mdeberta-v3-base (with AWP)

Fine-tune an OpenAI Model for Pseudo-Labeling

1. Create .jsonl Files

2. Token Count

3. Start Fine-tuning

4. Test Fine-tuned Model

5. Compare Original Model vs. Fine-tuned Model

6th Place Solution -> WSDM Cup - Multilingual Chatbot Arena

TL;DR

Pseudo Labeling

Dataset

Pseudo Labeling Approach

Knowledge Distillation

Final Training Phase

Inference Strategy

What Didn’t Work

My Personal "What Didn't Work" List

Improvement Opportunities

Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

2. Create `config.yaml` File

1. Create `.jsonl` Files

Packages