This Repo is for Kaggle - WSDM Cup - Multilingual Chatbot Arena
pip install --upgrade -r requirements.txt
openai:
api_key: "YOUR_OPENAI_API_KEY"
organization: "YOUR_OPENAI_ORGANIZATION"
huggingface:
token: "YOUR_HF_TOKEN"
export KAGGLE_USERNAME="YOUR_KAGGLE_USERNAME"
export KAGGLE_KEY="YOUR_KAGGLE_API_KEY"
Original Competition Dataset
sudo apt install unzip
kaggle competitions download -c wsdm-cup-multilingual-chatbot-arena
unzip wsdm-cup-multilingual-chatbot-arena.zip
Extra Datasets (also include the original competition dataset)
sudo apt install unzip
kaggle datasets download -d lizhecheng/kaggle-multilingual-chatbot-arena-datasets
unzip kaggle-multilingual-chatbot-arena-datasets.zip
- Customize the parameters in
gemma2-9b-main.sh
.
cd gemma_cls
chmod +x ./gemma2-9b-main.sh
./gemma2-9b-main.sh
- Use advanced training code based on LMSYS 1.0 & customize the parameters in
train_gemma_cls.yaml
.
cd gemma_cls_advanced
chmod +x ./gemma_cls.sh
./gemma_cls.sh
- Customize the parameters in
train.sh
.
cd mDeBERTa
chmod +x ./train.sh
./train.sh
(Note: The performance of microsoft/mdeberta-v3-base
is suboptimal, with a cv score of approximately 0.645
.)
- Customize the parameters in
data.sh
.
cd openai_finetune
chmod +x ./data.sh
python data.py
python calculate.py
python finetune.py
(Note: Remember your file and job IDs for later use.)
- Set the
fine_tune_job_id
and change theprompt
intest.py
.
python test.py
- Set your
validation file path
and two differentmodel names
incompare.py
.
python compare.py
First, I sincerely want to thank Kaggle and the Chatbot Arena for organizing this meaningful competition. I also want to extend my deepest gratitude to my teammates: @pingfan, @xuanmingzhang777, @xiaoqinglong1996, @tonyarobertson, with whom I've competed on Kaggle for almost a year. Unfortunately, we faced unexpected situations in previous competitions, leading to shake down or missing gold medals. However, today five Kaggle competition experts have become competition masters together. This is truly one of the most incredible moments of our journey, being able to share this success with my teammates!
-
Pseudo Labeling:
- Leveraged data generated by the 3rd-place team of LMSYS.
- Sampled prompts from a 1M dataset and used APIs to generate responses.
- Incorporated open-source DPO data (e.g., RLHFlow), mixing them to generate pseudo labels.
-
Distillation:
- Distilled Llama3.3-70B and Qwen2.5-72B models into Gemma2-9B and Qwen2.5-14B.
- Trained at a maximum sequence length of 2500 tokens, using 4-bit quantization.
-
Multilingual Strategy:
- Multilingual performance was not a primary focus, as Gemma and Qwen are already among the most powerful multilingual models.
- Prioritized the top five main languages, especially English, as we found English accuracy to be suboptimal.
Pseudo labeling played a crucial role in our approach. By effectively utilizing pseudo-labeled data, we achieved a LB score above 0.693, even without training directly on the competition dataset.
We aggregated multiple data sources, filtering short responses to obtain approximately 560K samples:
- Data generated by the 3rd-place LMSYS team (special thanks to @conjuring92).
- Prompts sampled from a 1M dataset, with API-generated responses from various models.
- Around 10 open-source DPO datasets (e.g., RLHFlow).
We processed these datasets to generate high-quality pseudo labels.
To ensure label accuracy and minimize data leakage, we experimented with two approaches for our judging model:
- Baseline Method: Fine-tuned Gemma2-9B on competition data.
- Enhanced Method: Fine-tuned Llama3.3-70B and Qwen2.5-72B on competition data.
While the enhanced method showed minor improvements, it required significantly longer inference times. Ultimately, retraining Gemma2-9B with pseudo-labeled data yielded comparable results across KL Divergence Loss and Cross-Entropy Loss.
- Teacher Models: Qwen2.5-72B, Llama3.3-70B
- Training Data: WSDM + LMSYS
- Loss Functions: KL Divergence Loss, Cross-Entropy Loss, and an equally weighted average of two losses
We conducted extensive post-training for distillation. While the distillation process had no significant effect on the Qwen2.5-14B model, it demonstrated measurable improvements on Gemma2-9B.
- Direct LoRA training on 4-bit quantized models for both Qwen2.5-14B and Gemma2-9B.
- Max sequence length: 2500 — extending this to 3072 did not yield noticeable benefits, and training time constraints discouraged further increases.
- Primary model: Qwen2.5-14B
- Supporting model: Gemma2-9B
- Inference mechanism:
- Used logits from Qwen2.5-14B for primary classification.
- Deployed Gemma2-9B selectively, prioritizing cases where Qwen2.5-14B struggled with classification.
- Managed inference time efficiently to fully utilize the 12-hour limit.
- TTA (Test-Time Augmentation) showed no measurable impact on sequence classification.
- LoRA Merge caused significant performance degradation, an issue that remained unresolved despite debugging.
- Multi-LoRA ensemble approaches failed to improve performance.
- Dynamic token allocation based on response length led to an abnormal length distribution.
- Model selection: Gemma2-9B and Qwen2.5-14B outperformed other models.
- Original labels from DPO datasets were ineffective for post-training but useful for pseudo-labeling.
- Chain-of-Thought prompting strategies did not yield meaningful improvements.
- Dynamic truncation during inference initially provided a +0.003 LB boost, but its effectiveness diminished after updating training code and truncation methods.
- I fine-tuned an mDeBERTa model for multiple-choice tasks using
AutoModelForMultipleChoice
. However, as expected, the model’s parameter limitations resulted in poor performance. - I attempted to use GPT-3.5 as a judge model for pseudo-labeling by fine-tuning it on 10K samples. However, OpenAI’s lack of transparency in the training process and sensitivity to parameter adjustments led to unsatisfactory results.
- Contrary to research suggesting that few-shot prompting improves accuracy, this did not hold in our competition setting. I experimented with 2-shot to 32-shot GPT-3.5 configurations, but there was no significant accuracy improvement — possibly due to excessive human intervention.
- Suboptimal cross-validation and leaderboard performance of teacher models — large-parameter model training remains a major challenge; for example, we could not achieve satisfactory results when training Gemma2-27B.
- LoRA merging and post-training quantization issues caused unexpected performance degradation, which remains a critical bottleneck.
Again, thanks to my four amazing teammates: @pingfan, @xuanmingzhang777, @xiaoqinglong1996, @tonyarobertson.