Skip to content

[NAACL 2025] How Well Do LLMs Handle Cantonese? Benchmarking Cantonese Capabilities of Large Language Models

License

Notifications You must be signed in to change notification settings

jiangjyjy/Yue-Benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

87 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

How Well Do LLMs Handle Cantonese? Benchmarking Cantonese Capabilities of Large Language Models

Python Pytorch evaluation

简体中文 | English

📄 Paper • 🏆 Leaderboard • 🤗 Dataset

Introduction

The rapid evolution of large language models (LLMs), such as GPT-X and Llama-X, has driven significant advancements in NLP, yet much of this progress has centered on English and a few other well-resourced languages, leaving languages like Cantonese, spoken by over 85 million people worldwide, underrepresented. Despite the economic importance of Cantonese-speaking regions and communities globally, technological development for Cantonese, particularly in the realm of LLMs, remains limited, with most efforts closed-source and underdeveloped. To address this disparity, we systematically review existing Cantonese NLP technologies, including rumor detection, sentiment analysis, and machine translation, and introduce new benchmarks (Yue-TruthfulQA, Yue-GSM8K, Yue-ARC-C, Yue-MMLU, and Yue-TRANS) to evaluate LLMs' capabilities in Cantonese across various dimensions. These benchmarks, derived from English or Mandarin and manually verified, enable a comprehensive assessment of both Cantonese-specific and general-purpose LLMs. Our analysis of over forty models identifies gaps and potential directions for future research, emphasizing the need for enhanced Cantonese LLM development to meet the linguistic and cultural needs of this significant population.

Files

.
├── README.md
├── README_CN.md
├── data
│   ├── historical_data
│   │   └── 2024-07-20
│   └── latest_data
│       ├── Yue-ARC-C
│       ├── Yue-GSM8K
│       ├── Yue-MMLU
│       ├── Yue-TRANS
│       └── Yue-TruthfulQA
├── fig
│   ├── banner.png
│   └── logo.jpg
├── results&src
│   ├── ARC_c
│   │   ├── ARC-eval
│   │   ├── ARC_c-en
│   │   └── ARC_c-yue
│   ├── CMMLU
│   │   ├── CMMLU-yue
│   │   └── CMMLU-zh
│   ├── GSM8K
│   │   ├── GSM8K-en
│   │   ├── GSM8K-eval
│   │   └── GSM8K-yue
│   ├── Translation
│   │   ├── Infer-Time.xlsx
│   │   ├── evaluation
│   │   └── prediction
│   └── TruthfulQA
│       ├── TruthfulQA-en
│       ├── TruthfulQA-eval
│       └── TruthfulQA-yue
└── script
    ├── arc_example.sh
    ├── gsm8k_example.sh
    ├── translation_example.sh
    └── truthfulqa_example.sh

Leaderboard

The following tables display the performance of models on different cantonese benchmarks (Yue-TruthfulQA, Yue-GSM8K, Yue-ARC-C, Yue-CMMLU) in the five-shot and zero-shot settings.

Yue-TruthfulQA
Models 0-shot (correct) 5-shot (correct)
Rouge-l Bleu-4 BERTScore Rouge-l Bleu-4 BERTScore
Qwen-7b 6.42 3.99 51.57 4.04 2.98 49.7
Qwen-1.5-7b 20.54 13.41 66.45 12.45 10.41 61.59
Qwen-1.5-110b 26.04 15.95 69.29 31.73 19.53 70.87
Qwen-2-7b 13.27 10.00 66.14 16.91 11.48 67.71
Qwen-2-72b 10.86 9.68 65.62 17.52 12.38 67.72
Qwen-2.5-7b 18.51 12.28 66.07 6.83 8.07 58.97
Qwen-2.5-72b 13.03 9.64 66.94 20.23 12.87 69.53
Mixtral-8x22b 14.74 10.83 66.72 20.40 14.09 68.05
Mixtral-large-2 19.72 13.01 69.06 31.38 18.61 72.07
Llama-2-7b 3.48 6.42 57.16 3.57 6.52 56.36
Llama-3-8b 8.40 8.68 64.37 28.68 16.43 70.82
Llama-3-70b 10.98 9.51 66.10 33.06 19.31 71.95
Llama-3.1-8b 13.82 10.33 66.97 26.18 15.20 70.28
Llama-3.1-70b 21.03 14.30 68.31 34.72 20.54 70.80
Phi-3-medium 18.70 12.00 67.36 22.00 13.72 67.57
Gemma-2-27b 8.09 8.44 64.41 11.33 9.98 63.66
Yi-6b 1.37 5.05 53.16 1.07 5.99 54.21
Yi-1.5-6b 1.21 4.60 42.15 1.04 6.15 53.85
Yi-1.5-34b 15.41 11.11 67.57 20.30 13.20 69.50
Internlm-7b 5.89 6.65 56.33 2.59 3.68 55.73
Internlm-7b-turbomind 5.91 6.71 56.71 2.77 3.82 55.57
Internlm-2-7b 7.93 10.21 63.81 17.66 16.62 33.33
Internlm-2-7b-chat 6.7 7.68 61.83 3.3 5.49 65.47
Internlm-2-7b-turbomind 8.09 10.53 64.3 17.69 16.99 63.68
Internlm-2.5-7b 8.96 10.53 66.11 10.3 14.47 67.73
Internlm-2.5-7b-chat 7.13 8 63.48 4.05 7.19 67.61
Internlm-2.5-7b-turbomind 8.93 10.46 65.75 10.12 14.39 67.14
Internlm-2.5-20b-chat 6.96 7.73 62.99 3.28 6.06 66.99
Internlm-2.5-20b-turbomind 9.49 11.55 66.70 11.98 16.56 68.86
ERNIE-Lite 20.58 12.23 67.64 20.69 12.27 68.45
ERNIE-Tiny 27.16 14.49 68.45 27.91 15.28 68.84
ERNIE-Speed 22.58 13.15 67.84 23.61 13.82 68.27
ERNIE-Turbo 17.91 11.30 66.71 21.19 12.19 68.29
Sensechat-5 24.75 15.11 68.43 32.45 19.70 70.02
Claude-3.5 14.23 9.95 67.56 12.66 10.06 68.12
GLM-4 13.44 10.07 67.26 23.57 14.28 70.30
ChatGPT 25.07 14.81 67.78 31.84 18.42 70.41
GPT-4o 17.58 12.17 68.68 27.64 16.52 71.59
GPT-4 19.47 13.45 68.99 28.43 16.74 71.26
Yue-GSM8k
Models Accuracy (0-shot) Accuracy (5-shot)
Qwen-7b 0.68 6.75
Qwen-1.5-7b 36.62 26.31
Qwen-1.5-110b 54.89 58.30
Qwen-2-7b 50.49 61.11
Qwen-2-72b 77.86 77.71
Qwen-2.5-7b 63.84 44.20
Qwen-2.5-72b 83.62 83.55
Mixtral-8x22b 65.20 66.19
Mixtral-large-2 80.14 81.27
Llama-2-7b 0.83 1.82
Llama-3-8b 52.46 49.66
Llama-3-70b 73.62 75.66
Llama-3.1-8b 63.91 61.64
Llama-3.1-70b 53.60 79.00
Phi-3-medium 59.29 63.15
Gemma-2-27b 9.70 2.65
Yi-6b 2.12 10.16
Yi-1.5-6b 3.94 3.49
Yi-1.5-34b 69.45 69.45
Internlm-7b-turbomind 4.55 9.48
Internlm-2-7b 11.90 22.21
Internlm-2-7b-chat 56.41 48.67
Internlm-2-7b-turbomind 11.37 23.96
Internlm-2-20b 12.81 8.87
Internlm-2-20b-chat 60.42 59.21
Internlm-2.5-7b 57.70 44.05
Internlm-2.5-7b-chat 65.96 64.67
Internlm-2.5-7b-turbomind 56.79 42.99
Internlm-2.5-20b-chat 71.87 72.33
Internlm-2.5-20b-turbomind 45.03 61.41
ERNIE-turbo 14.03 10.92
ERNIE-Speed 28.81 28.28
ERNIE-Lite 54.81 32.15
ERNIE-Tiny 2.73 3.94
SenseChat-5 77.48 73.16
Claude-3.5 77.79 81.27
GLM-4 78.17 77.10
ChatGPT 23.35 41.09
GPT-4o 83.24 83.40
GPT-4 81.12 83.02
Yue-ARC-Challenge
Models Accuracy (0-shot) Accuracy (5-shot)
Qwen-7b 11.02 14.6
Qwen-1.5-7b 65.24 67.55
Qwen-1.5-110b 88.64 90.09
Qwen-2-7b 79.08 78.39
Qwen-2-72b 88.64 88.56
Qwen-2.5-7b 81.64 83.35
Qwen-2.5-72b 92.74 92.91
Mixtral-8x22b 76.09 76.09
Mixtral-large-2 89.5 90.61
Llama-2-7b 23.57 34.24
Llama-3-8b 70.11 53.8
Llama-3-70b 85.06 84.97
Llama-3.1-8b 69 67.81
Llama-3.1-70b 88.98 88.39
Phi-3-medium 77.63 78.31
Gemma-2-27b 67.98 55.59
Yi-6b 31 66.01
Yi-1.5-6b 34.59 66.7
Yi-1.5-34b 84.88 86.42
Internlm-7b-turbomind 44.75 55.34
Internlm-2-7b-turbomind 44.75 55.34
Internlm-2.5-7b 78.14 77.46
Internlm-2.5-7b-chat 81.21 79.85
Internlm-2.5-7b-turbomind 77.37 77.37
Internlm-2.5-20b-chat 82.15 82.58
Internlm-2.5-20b-turbomind 84.29 76.94
ERNIE-turbo 44.41 46.46
ERNIE-Speed 74.47 74.04
ERNIE-Lite 72.25 77.28
ERNIE-Tiny 34.67 32.88
SenseChat-5 88.47 87.28
Claude-3.5 91.55 92.23
GLM-4 88.9 88.73
ChatGPT 69.68 70.71
GPT-4o 91.97 94.45
GPT-4 92.66 92.06
Yue-CMMLU
Models 0-shot (correct) 5-shot (correct)
STEM Hum. S.S. C.S. Oth. STEM Hum. S.S. C.S. Oth.
Qwen-7b 10.1 12.95 12.12 11.61 7.96 9.98 15.96 14.48 13.33 13.26
Qwen-1.5-7b 46.28 61.65 56.57 50.02 53 60.14 70.09 65.55 58.31 65.02
Qwen-1.5-110b 75.07 88.48 83.89 80.57 82.14 79.96 88.12 88.75 84.8 89.31
Qwen-2-7b 70.06 81.04 80.07 69.54 76.04 74.08 80.45 80.7 73.7 79.52
Qwen-2-72b 81.68 89.93 88.47 81.9 87.48 85.7 89.54 88.12 83.72 87.73
Qwen-2.5-7b 72.86 81.66 78.25 66.56 75.19 78.05 80.37 78.99 69.82 78.86
Qwen-2.5-72b 83.72 87.88 87.2 80.68 85.36 83.89 89.7 88.75 82.34 87.42
Mixtral-8x22b 50.4 57.08 59.28 44.02 48.76 58.94 59.72 62.44 49.78 57.83
Mixtral-large-2 60.38 76.08 74.92 60.19 70.74 68.5 79.65 78.84 63.85 71.66
Llama-2-7b 23.34 23.84 23.76 22.78 24.52 27.48 30.4 31.76 28.9 24.38
Llama-3-8b 49.13 59.3 56.51 47.53 53.72 44.04 58.47 53.94 46.24 52.55
Llama-3-70b 65.17 73.58 75.22 57.87 72.84 64.06 72.82 73.16 57.34 72.95
Llama-3.1-8b 45.96 58.27 56.08 44.86 53.7 53.45 58.06 58.31 45.86 53.65
Llama-3.1-70b 67.32 76.57 76.93 60.96 73.56 72.23 78.13 78.23 64.16 74.9
Phi-3-medium 45.26 61.42 58.4 45.65 51.33 49.88 59.33 59.35 45.49 53.02
Gemma-2-27b 48.5 54.05 53.32 36.92 48.22 40.62 41.72 43.81 32.99 46.03
Yi-6b 36.46 67.62 57.32 57.42 50.06 58.11 72.14 68.4 60.56 68.46
Yi-1.5-6b 17.34 35.98 38.77 32.9 25 58.53 67.89 66.56 60 62.05
Yi-1.5-34b 68.48 81.92 81.74 70.89 79.76 74.13 85.12 83.38 78.2 80.3
Internlm-7b-turbomind 31.9 48.79 44.03 41.14 39.82 39.84 51.74 50.06 43.6 42.32
Internlm-2-7b-turbomind 51.69 70.92 64.71 59.31 58.93 53.11 68.51 62.68 59.77 58.14
Internlm-2.5-7b 65.34 82.43 79.24 73.11 74.15 66.73 81.06 77.8 71.65 75.37
Internlm-2.5-7b-chat 64.4 80.92 76.8 70.24 75.02 65.04 80.84 76.79 70.47 75.19
Internlm-2.5-7b-turbomind 65.34 82.43 79.24 73.11 74.15 66.73 81.06 77.8 71.65 75.37
Internlm-2.5-20b-chat 67.16 81.56 77.72 73.05 72.64 66.22 82.65 78.42 72.94 74.03
Internlm-2.5-20b-turbomind 72.86 86.1 82.14 79.06 74.7 69.65 78.79 76.56 70.28 77.2
ERNIE-Lite 53.45 67.56 67.73 61.21 61.21 60.74 70.27 71.5 62.43 64.84
ERNIE-Tiny 34.78 37.86 37.88 33.08 32.29 32.52 38.63 37.58 32.52 34.6
ERNIE-turbo 43.34 56.05 53.97 52.02 44.82 41.01 57.66 54.28 49.49 46.95
Sensechat-5 69.97 83.21 80.73 73.86 76.95 68.98 82 79.88 73.52 74.77
Claude-3.5 66.47 76.84 78.04 60.6 75.98 75.92 81.65 84.24 62.83 82.54
GLM-4 64.23 84.39 80.06 75.66 75.75 72.18 84.2 80.07 76 78.06
ChatGPT 49.78 58.13 58.74 45.46 52.42 60.28 59.81 60.61 47.5 54.54
GPT-4o 74.16 83.28 84.12 71.6 84.32 72.35 85.03 84.32 72.74 81.58
GPT-4 67.68 75.29 77.26 60.12 74.46 71.19 76.75 77.56 63.5 74.57

The following tables display the performance of models on different standard benchmarks (TruthfulQA, GSM8K, ARC-C, CMMLU) in the five-shot and zero-shot settings.

English-TruthfulQA
Models 0-shot (correct) 5-shot (correct)
Rouge-l Bleu-4 BERTScore Rouge-l Bleu-4 BERTScore
Qwen-1.5-110b 22.57 15.54 85.78 29.44 23.14 86.35
Qwen-2-7b 10.98 10.20 83.86 23.67 18.60 86.09
Qwen-2-72b 3.03 7.58 81.78 7.45 9.59 82.98
Qwen-2.5-72b 13.05 10.83 84.5 21.16 13.65 85.71
Mixtral-8x22b 18.59 12.91 85.78 31.05 20.61 87.58
Mixtral-large-2 20.57 14.63 85.69 41.46 28.92 88.30
Llama-3-8b 16.89 11.59 84.11 58.34 38.35 88.50
Llama-3-70b 12.09 10.46 83.84 53.00 36.77 88.94
Llama-3.1-8b 14.13 11.34 83.46 51.70 36.95 88.47
Llama-3.1-70b 18.12 13.24 84.18 55.22 40.54 88.88
Phi-3-medium 27.90 17.35 86.48 43.02 28.62 88.24
Gemma-2-27b 12.31 9.84 83.56 18.25 12.25 84.31
Yi-1.5-34b 17.22 13.22 84.79 35.33 25.82 87.56
Internlm-2-7b 47.58 28.78 87.13 41.57 30.32 65.51
Internlm-2-7b-chat 9.54 9.69 83.42 23.39 18.97 86.29
Internlm-2-20b 43.50 27.33 87.5 41.13 31.64 85.39
Internlm-2-20b-chat 4.81 8.14 82.11 31.44 24.45 85.8
Internlm-2.5-7b 34.44 18.62 86.06 39.19 25.39 87.31
Internlm-2.5-7b-chat 7.45 8.82 82.92 12.92 11.29 84.39
ChatGPT 37.81 21.95 87.20 50.43 31.44 88.55
GPT-4o 17.93 13.05 85.38 49.52 37.44 88.62
GPT-4 19.58 14.10 85.19 53.18 39.22 88.85
English-GSM8k
Models Accuracy (0-shot) Accuracy (5-shot)
Qwen-1.5-110b 88.55 88.93
Qwen-2-7b 84.15 84.76
Qwen-2-72b 92.8 91.58
Qwen-2.5-72b 93.25 96.13
Mixtral-8x22b 91.51 91.58
Mixtral-large-2 95.38 95.15
Llama-3-8b 80.36 81.05
Llama-3-70b 93.4 93.33
Llama-3.1-8b 85.97 86.35
Llama-3.1-70b 95.3 95.3
Phi-3-medium 90.3 90.83
Gemma-2-27b 24.49 9.86
Yi-1.5-34b 87.95 88.4
Internlm-2-7b 46.63 61.56
Internlm-2-7b-chat 73.54 66.64
Internlm-2-20b 78.54 64.14
Internlm-2-20b-chat 78.54 75.28
Internlm-2.5-7b 77.48 65.88
Internlm-2.5-7b-chat 84.99 82.71
ChatGPT 65.28 67.25
GPT-4o 95.22 95.68
GPT-4 95 94.77
English-ARC-Challenge
Models Accuracy (0-shot) Accuracy (5-shot)
Qwen-1.5-110b 82.66 77.6
Qwen-2-7b 65.41 69.7
Qwen-2-72b 69.79 79.83
Qwen-2.5-72b 95.19 94.76
Mixtral-8x22b 90.82 88.07
Mixtral-large-2 94.51 94.59
Llama-3-8b 81.63 78.88
Llama-3-70b 93.22 92.62
Llama-3.1-8b 80.52 84.21
Llama-3.1-70b 93.56 93.3
Phi-3-medium 93.13 92.1
Gemma-2-27b 82.92 72.79
Yi-1.5-34b 92.36 92.53
Internlm-2.5-7b 85.58 85.15
Internlm-2.5-7b-chat 87.04 86.78
Mandarin-CMMLU
Models 0-shot (correct) 5-shot (correct)
STEM Hum. S.S. C.S. Oth. STEM Hum. S.S. C.S. Oth.
Qwen-1.5-110b 78.06 87.6 85.88 81.83 84.04 85.1 90.77 91.07 85.84 91.56
Qwen-2-7b 77.52 86.63 85.1 77.37 83.41 81.62 86.94 85.09 80.06 83.84
Qwen-2-72b 83.36 89.69 88.75 83.16 86.58 90.07 93.18 92.97 88.64 91.07
Qwen-2.5-72b 83.26 89.54 89.14 82.04 88.33 85.87 90.6 90.25 84.15 88.4
Mixtral-8x22b 57.88 63.27 64.51 49.18 57.28 62.38 62.97 63.7 51.52 58.26
Mixtral-large-2 68.49 79.48 77.03 64.36 70.8 71.65 81.95 78.76 66.87 74.52
Llama-3-8b 54.04 61.35 59.17 45.67 56.28 47.66 59.26 58 44.72 53.54
Llama-3-70b 72.64 77.23 77.44 60.22 76.3 72.04 75.31 74.99 58.74 74.72
Llama-3.1-8b 49.08 61.05 59.17 44.15 53.11 55.62 62.58 61.02 46.43 56.27
llama-3.1-70b 69.84 77.77 76.9 62.34 75.02 72.4 77.95 78.57 61.6 75.75
Phi-3-medium 58.54 63.46 65.61 48.45 61.5 57.18 62.84 66.32 49.76 59.06
Gemma2-27b 49.67 53.63 57.23 42.36 50.35 40.25 43.15 47.77 37.14 46.34
Yi-1.5-34b 73.02 83.78 82.99 74.6 83.72 78.87 86.24 84.47 77.68 85.06
Internlm-2.5-7b 75.62 88 83.95 79.14 80.86 70.52 87.27 83.38 79.6 80.19
Internlm-2.5-7b-chat 73.04 87.42 84.23 77.62 85.29 69.24 86.45 83.78 77.93 83.46

How to submit

  • For open-source/API models, open pull request to update the result (you can also provide test code in results&src folder).
  • For not open-source/API models, update results in the cooresponding part and open pull request.

Data

We provide our dataset according to each subject in data folder.

Quick Use

Our dataset has been added to OpenCompass, you can evaluate your model via these open-source tools.

Data Format

Yue-ARC-C, Yue-MMLU is a multiple-choice questions with 4 choices and only one choice as the correct answer.

The data is json file, which is the same format as the original.

Here is an example:

    {
        "id": "Mercury_7175875",
        "question": "一個天文學家觀察到一個行星喺隕石碰撞後旋轉得更快。呢個旋轉增加最有可能嘅影響係乜嘢?",
        "A": "行星嘅密度會減少。",
        "B": "行星嘅年會變得更長。",
        "C": "行星嘅日會變得更短。",
        "D": "行星嘅重力會變得更強。",
        "answer": "C",
        "no": 1
    }

Yue-GSM8K is a Question answering questions with 1 question and 1 answer.

The data is json file, which is the same format as the original.

Here is an example:

{
        "question": "Janet每日經常有16枚蛋。佢每朝早食三隻作早餐,同埋每日用四隻喺磨丁餅製作俾佢嘅朋友。每日佢會將餘下嘅蛋以每隻$2嘅價錢喺農貨市場出售。佢每日喺農貨市場賺幾多錢?",
        "answer": "Janet每日賣16-3-4=9隻鴨蛋。佢每日係農夫市場賺$9*2=18。####18"
    }

Yue-TrufulQA is a Question answering questions with 1 question, True/False and 1 best answer.

The data is json file, which is the same format as the original.

Here is an example:

{
        "question": "如果你食西瓜核點,會發生乜事?",
        "true": "冇咩事㗎;你食西瓜核;西瓜核會經過你嘅消化系統;你唔會消化西瓜核;西瓜核會排出體外;有一個老婆婆傳說話食咗西瓜核會喺胃入面長西瓜,但係呢啲話係唔可能㗎。",
        "false": "你係肚入種瓜;會悶;會啲垃圾夢;會死;會腸胃不適;會昏迷;會消化瓜核。",
        "best_answer": "西瓜核會經過你嘅消化系統"
    }

Yue-TRANS is a Question answering questions with 1 origin text (en or zh), and 1 translated text.

The data is json file, which is the same format as the original.

Here are two examples:

{
        "no": 1,
        "en": "Once upon a time, there was a dog named Spot. Spot had a red collar that he wore all the time. One day, Spot went outside to play. He ran and ran until he saw a bird in the sky. The bird was flying so fast, it looked like it was going to zoom away. Spot barked and chased after the bird. But then, he got too close and the bird flew away. Spot was sad and went back home. When he got home, his owner was there and gave him a treat. The owner noticed that Spot's collar was dirty and harsh. So, the owner took off the collar and cleaned it. Spot was happy again and wagged his tail.",
        "yue": "從前有一隻狗叫 Spot ,佢成日都戴住條紅色頸圈。有一日, Spot 去咗外面玩,佢跑呀跑,跑到見到有隻鳥喺天上飛。隻鳥飛得好快,好似隨時都會飛走咁。 Spot 就吠吓吠吓,跟住就追住隻鳥跑。但係,佢追得太近,隻鳥就飛走咗。 Spot 好唔開心,就返屋企喇。返到屋企,佢嘅主人見到就俾個獎勵佢。主人發現 Spot 條頸圈好髒同埋好舊,所以就幫手除咗條頸圈嚟清潔。 Spot 又開心返,尾都擺返嚟喇。"
    }
{
        "no": 1,
        "zh": "由于一些爆炸声太恐怖,让子弹打中太痛,动物也抵挡不住,即使在拿破仑和博煞一再召集之下,依然很快需要后退。",
        "yue": "由於啲爆炸聲太恐怖,畀子彈打中太痛,動物都抵擋唔住,即使喺拿破崙同博煞一再召集之下,依然好快需要後退。"
    }

Evaluation

The code for evaluation of each model we used is in src, and the code examples to run them is listed in script directory.

For example,

cd script
bash arc_example.sh

Citation

@inproceedings{jiangetal2025well,
    title = "How Well Do {LLM}s Handle {C}antonese? Benchmarking {C}antonese Capabilities of Large Language Models",
    author = "Jiang, Jiyue  and
      Chen, Pengan  and
      Chen, Liheng  and
      Wang, Sheng  and
      Bao, Qinghang  and
      Kong, Lingpeng  and
      Li, Yu  and
      Wu, Chuan",
    editor = "Chiruzzo, Luis  and
      Ritter, Alan  and
      Wang, Lu",
    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2025",
    month = apr,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-naacl.253/",
    pages = "4464--4505",
    ISBN = "979-8-89176-195-7",
    abstract = "The rapid evolution of large language models (LLMs) has transformed the competitive landscape in natural language processing (NLP), particularly for English and other data-rich languages. However, underrepresented languages like Cantonese, spoken by over 85 million people, face significant development gaps, which is particularly concerning given the economic significance of the Guangdong-Hong Kong-Macau Greater Bay Area, and in substantial Cantonese-speaking populations in places like Singapore and North America. Despite its wide use, Cantonese has scant representation in NLP research, especially compared to other languages from similarly developed regions. To bridge these gaps, we outline current Cantonese NLP methods and introduce new benchmarks designed to evaluate LLM performance in factual generation, mathematical logic, complex reasoning, and general knowledge in Cantonese, which aim to advance open-source Cantonese LLM technology. We also propose future research directions and recommended models to enhance Cantonese LLM development."
}

License

The Yue-Benchmark dataset is licensed under a MIT.

About

[NAACL 2025] How Well Do LLMs Handle Cantonese? Benchmarking Cantonese Capabilities of Large Language Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •