Skip to content

socialfoundations/lm-harmony

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LM-Harmony

train-before-test

LM-Harmony is an automatic evaluation tool for large language models. Unlike popular direct evaluations for LLMs, LM-Harmony uses a train-before-test evaluation paradigm. Each model is fine-tuned on the corresponding training set before evaluation. Our results demonstrate that LM-Harmony provides significantly more consistent model rankings across various tasks, revealing the true general capabilities of each model.

The leaderboard includes 24 benchmarks, covering language understanding, commonsense reasoning, question answering, math, physics, chemistry, biology, and medicine. We report scores for each benchmark and aggregate them using the first principal component score of the entire score matrix, which accounts for 85% of the variance. More details can be found in our paper.

Leaderboard

Rank PC1 MNLI QQP MedMCQA QNLI NQ-Open SST-2 Winogrande HellaSwag Social-IQA MathQA ANLI PIQA SciQ CommonsenseQA BoolQ CoLA GSM8K WiC OpenBookQA MRPC HeadQA RTE ARC-Easy ARC-Challenge
Qwen/Qwen2.5-14B 1 591.21 90.34 89.09 65.67 93.17 30 96.44 86.35 85.81 58.6 68.34 76.7 83.9 98.8 88.04 89.54 68.37 81.88 74.29 60.2 87.5 55.51 87.73 90.11 66.04
Qwen/Qwen2.5-14B-Instruct 2 587.23 90.61 88.59 65.19 92.97 29.75 96.33 85.95 85.86 57.98 68.21 77.3 84.49 98.8 87.22 90.03 69.32 83.17 74.92 58.6 86.52 55.14 90.61 89.31 66.81
google/gemma-2-9b-it 3 569.26 90.9 89.2 59.36 94.93 28.2 96.56 86.03 84.52 61.26 51.66 79.4 84.28 98.6 83.87 91.47 69.21 77.71 73.82 59.8 88.73 53.87 91.34 89.06 67.06
google/gemma-2-9b 4 562.02 91.03 89.16 62.51 93.45 33.6 96.33 85.48 85.79 59.93 54.77 71.8 85.2 98.5 84.36 89.91 67.07 69.9 73.82 60.2 86.52 57 88.81 89.86 67.75
google/gemma-7b 5 466.81 90.26 88.71 54.86 92.7 28.78 95.99 86.82 85.42 59.93 48.07 72 84.06 98.2 82.31 89.57 67.54 59.67 73.98 59 86.03 50.77 87 87.75 62.29
meta-llama/Llama-3.1-8B-Instruct 6 433.06 89.91 88.14 60.48 92.93 31.55 95.99 85 82.26 58.03 46.63 77.1 83.62 98.3 82.39 87.06 65.27 76.19 74.14 51.8 81.86 50.51 88.09 85.86 59.04
meta-llama/Meta-Llama-3-8B-Instruct 7 416.2 89.63 88.69 60.98 91.16 29.56 96.22 84.93 82.18 58.85 46.03 76 83.3 98.2 79.69 87.55 67.29 75.21 72.88 52.6 80.88 49.31 89.17 84.85 57.68
Qwen/Qwen1.5-14B 8 403.39 90.55 88.88 54.41 91.32 25.6 96.44 82.64 82.38 56.04 57.99 72.7 81.94 97.8 87.14 88.38 68.66 68.01 73.82 53.8 84.07 46.57 89.89 83.42 56.31
meta-llama/Meta-Llama-3-8B 9 401.15 88.79 88.51 58.74 92.13 31.47 96.22 84.45 84.31 59.47 46.97 68 84.49 98.1 80.51 87.46 64.67 55.8 71 57.2 79.9 51.09 83.75 86.45 60.15
meta-llama/Llama-3.1-8B 10 399.69 89.65 88.49 58.4 92.51 31.3 95.41 84.45 83.72 58.5 46.93 66.4 84.28 98.2 81.74 88.5 64.32 55.72 71.94 56.2 80.39 51.42 84.12 86.74 61.01
01-ai/Yi-1.5-9B 11 397 90.26 88.83 54.67 92.31 24.1 95.87 81.93 82.43 58.85 50.32 71.4 82.86 98.2 84.52 88.07 61.69 65.96 74.76 54.6 82.11 47.48 89.17 85.86 60.41
Qwen/Qwen1.5-14B-Chat 12 393.47 90.17 88.66 54.82 91.2 24.9 95.99 81.61 82.18 56.55 57.05 76.4 82.48 97.8 86.81 87.95 67.24 66.34 73.35 52.6 84.31 45.55 90.97 83.12 55.63
Qwen/Qwen2-7B 13 386.51 89.17 87.99 57.35 90.21 27.76 96.33 82.64 82.01 57.68 52.23 67.6 82.7 97.8 85.59 86.57 64.47 74.98 75.08 53.8 81.62 47.01 86.64 84.89 58.7
Qwen/Qwen2.5-7B 14 386.34 89.76 87.85 61.63 89.44 24.29 95.53 80.03 81.82 56.65 56.85 70.6 81.66 97.7 86.49 87.37 59.66 75.82 71.94 56.4 82.6 48.47 85.2 86.32 60.58
Qwen/Qwen2.5-7B-Instruct 15 374.28 89.88 88.04 61.08 90.3 21.44 95.76 82.16 81.19 56.04 56.25 74 81.56 97.5 85.18 87.28 57.16 73.84 71.79 53.4 82.35 47.96 88.45 85.52 58.96
01-ai/Yi-1.5-9B-Chat 16 373.74 90.39 88.55 51.85 94.07 18.98 95.99 80.51 81.06 62.08 52.86 75.4 82.37 97.5 84.68 89.57 63.4 72.63 73.82 49.4 82.84 43.91 85.56 83.54 58.96
Qwen/Qwen2-7B-Instruct 17 370.57 89.2 88.13 57.14 90.48 25.76 96.56 81.53 81.52 58.96 52.46 68 82.21 97.6 84.6 86.24 63.11 75.06 72.1 52.8 82.6 46.35 88.09 83.38 57.17
01-ai/Yi-9B 18 362.12 89.97 88.56 54.22 92.15 22.33 95.76 81.37 80.85 57.93 45.49 69.3 81.77 98.4 84.03 88.72 62.48 53.75 75.71 54.8 82.6 46.39 90.25 84.26 60.58
google/gemma-7b-it 19 312.1 89.48 88.74 49.06 93.21 20.53 96.33 80.82 79.95 57.68 48.64 73.9 81.72 97.4 78.71 89.2 61.97 55.19 71.16 54.2 84.31 43.65 85.56 80.22 53.16
01-ai/Yi-1.5-6B-Chat 20 285.53 89.43 87.99 49.32 93.1 19 95.87 76.8 78.82 58.75 47.27 72.5 81.45 97.9 81.57 86.42 58.46 64.52 72.1 51.6 83.33 41.87 85.92 81.78 54.1
01-ai/Yi-1.5-6B 21 285.49 89.06 87.82 50.9 90.55 24.13 95.99 78.3 80.42 56.81 46.7 67.3 81.99 97.3 79.93 86.06 59.39 57.01 72.26 50.8 83.09 44.02 85.2 83.75 57.59
Qwen/Qwen1.5-7B 22 275.62 88.9 88.21 50.42 89.24 23.91 96.22 79.08 79.52 54.15 45.73 69.6 80.41 97.3 84.36 86.48 66.81 57.24 71.79 52.2 82.11 42.78 89.53 81.9 54.78
01-ai/Yi-6B-Chat 23 257.08 88.07 87.74 50.8 90.33 25.37 96.33 79.08 78.89 56.24 38.86 69 81.45 97.7 81.16 87.19 62.84 40.56 73.51 50.6 80.88 43.69 89.53 81.78 52.47
google/gemma-2-2b-it 24 255.49 87.65 88.08 47.79 92.95 20.61 96.56 77.43 76.65 57.42 38.93 71.6 80.85 97.8 76.25 87.92 61.58 55.72 71.79 51.2 81.62 45 84.84 82.45 53.75
meta-llama/Llama-3.2-3B-Instruct 25 250.82 88.05 87.38 55.2 90.48 28.2 96.22 77.11 75.92 56.4 42.11 71.4 79.38 97.1 76.9 85.9 58.66 64.67 72.73 48.6 80.39 44.09 84.84 80.81 52.22
Qwen/Qwen1.5-7B-Chat 26 240.56 88.96 88.1 50.66 88.63 21.91 95.76 78.06 78.91 54.5 43.28 70.7 80.3 97.1 84.19 86.27 64.38 52.31 69.91 50.4 83.82 41.79 90.25 80.43 50.43
01-ai/Yi-6B 27 237.67 88.03 88.13 50.35 90.23 25.65 96.22 78.77 78.97 56.45 37.69 65.7 81.07 97.9 81.41 85.5 63.01 42.38 73.35 49.2 80.39 43.84 85.56 81.82 53.92
Qwen/Qwen2.5-3B 28 223.02 88.58 87.36 54.63 89.24 18.12 95.64 75.77 76.18 54.15 49.08 63.4 79.87 97.9 80.75 85.96 56.34 68.76 70.53 48.6 79.17 45.22 87.73 82.53 54.01
Qwen/Qwen2.5-3B-Instruct 29 217.06 88.62 87.56 54.67 89.68 15.07 95.76 75.69 76.16 54.4 48.61 68.4 79.22 97.2 80.67 86.36 56.96 66.64 70.85 48.2 80.39 44.38 86.64 81.57 52.22
meta-llama/Llama-3.2-3B 30 206.73 88.15 87.42 51.33 91.56 25.01 96.44 78.61 78.8 56.19 42.24 59.3 80.85 97.9 74.53 85.96 63.71 35.25 70.53 50.4 78.43 46.24 80.87 81.78 50.68
google/gemma-2-2b 31 175.76 87.58 87.64 47.84 90.15 21.02 96.33 78.61 78.21 57.83 39.5 57.6 80.74 98.3 72.97 84.53 61.98 34.57 69.44 51 77.45 46.02 79.42 83.42 52.47
Qwen/Qwen1.5-4B 32 144.4 87.77 87.63 47.33 90.72 17.81 95.76 74.82 74.43 55.78 46.16 58.8 79.22 97.6 81.33 83.55 58.98 52.84 70.22 45.4 80.88 38.66 85.92 78.87 48.21
Qwen/Qwen1.5-4B-Chat 33 130.19 87.89 87.81 46.59 90.81 16.59 95.41 75.3 73.45 55.78 40.9 67.1 79.22 97 81.49 83.15 59.24 46.47 69.59 45.6 82.11 38.04 86.28 78.16 45.22
Qwen/Qwen2.5-1.5B 34 38.53 86.48 86.33 50.47 88.34 13.66 95.18 71.19 70.16 51.74 42.41 54.8 76.88 96.8 76.74 81.62 53.27 63.15 65.99 45.4 79.17 40.48 77.62 79.88 49.74
Qwen/Qwen2.5-1.5B-Instruct 35 27.14 87.31 86.49 49.82 88.38 11.47 95.53 71.19 70.09 51.33 42.38 57.4 77.31 96.7 77.15 82.05 44.86 50.49 68.34 44.4 78.68 41.17 81.59 77.82 47.01
Qwen/Qwen2-1.5B 36 -27.68 85.85 85.99 45.61 88.82 14.07 95.76 70.56 68.94 53.22 39.16 57.1 76.77 96.6 72.73 80.67 54.91 52.24 65.99 41.8 76.72 36.73 79.42 74.28 43.09
Qwen/Qwen2-1.5B-Instruct 37 -30.16 85.83 85.95 44.44 88.69 12.19 95.53 70.72 68.47 52.92 38.56 60.2 76.93 96.1 73.96 80.12 56.13 46.85 68.65 41 77.21 36.58 81.95 73.57 41.98
google/gemma-2b 38 -49.37 84.93 86.52 38.47 87.61 15.24 95.99 72.38 74.64 54.45 34.74 49.5 79.76 96.9 60.52 81.77 50.57 20.32 63.48 45.4 77.21 39.86 77.26 77.78 44.97
EleutherAI/pythia-12b 39 -113.24 84.74 87.14 39.45 88.56 17.4 94.61 71.51 71.88 54.76 29.92 47.8 78.67 97.4 27.85 81.22 53.57 14.33 62.7 44.2 76.47 41.98 71.48 75.08 43.26
google/gemma-2b-it 40 -119.89 83.26 86.29 36.77 91.12 9.97 94.95 66.46 66.22 53.22 33.27 49 76.44 96.8 66.09 82.94 41.76 19.94 65.67 44.4 79.41 36.65 78.7 71.59 42.41
Qwen/Qwen1.5-1.8B 41 -149.58 83.88 86.25 41.5 86.55 12.13 94.61 65.75 63.58 52.41 33.03 46.5 74.54 96.6 72.81 77.58 43.25 37.07 66.77 41.4 79.9 35.05 76.17 71.68 38.99
EleutherAI/pythia-6.9b-deduped 42 -177.14 84.34 86.46 35.52 88.21 10.86 95.3 72.14 69.62 54.45 28.54 46.7 78.45 96.4 20.64 78.53 51.52 8.87 65.52 44.8 74.75 40.15 71.12 72.94 41.13
meta-llama/Llama-3.2-1B-Instruct 43 -177.25 83.25 85.43 44.54 87.99 15.68 95.07 66.69 65.14 52.56 33.9 50.2 75.73 96 66.99 77.03 33.81 36.54 60.19 41.8 73.04 37.53 72.2 72.14 41.04
Qwen/Qwen1.5-1.8B-Chat 44 -177.62 84.09 85.48 41.14 87.53 10.5 95.53 65.51 62.6 51.48 34.84 45.9 75.3 96.2 72.73 77.83 33.24 33.43 65.05 39.6 78.68 34.54 79.42 69.11 38.48
meta-llama/Llama-3.2-1B 45 -254.44 83.66 85.18 39.76 81.93 13.96 95.87 68.43 68.82 52.61 31.52 43.2 77.97 96 63.88 76.42 32.94 11.37 56.27 43 70.83 38.77 63.54 73.02 39.85
EleutherAI/pythia-2.8b-deduped 46 -266.76 83.61 86.36 34.09 88.56 10.66 93.46 67.01 64.05 52.35 27.24 42.8 75.95 97 20.97 78.07 52.38 6.67 63.17 41.6 74.75 38.99 70.4 70.12 36.77
Qwen/Qwen2.5-0.5B 47 -356.88 81.52 84.61 40.55 86.02 7.67 92.43 60.38 54.19 48.11 32.83 40.8 71.16 96 61.75 72.69 31.63 33.97 61.13 35.6 75.98 33.63 71.84 69.28 34.3
Qwen/Qwen2.5-0.5B-Instruct 48 -357.34 81.44 84.96 40.38 85.3 6.65 92.66 60.46 53.75 47.65 32.6 41.9 70.46 95.7 61.34 72.57 29.28 29.8 62.38 39.4 76.23 33.11 71.84 68.81 34.81
Qwen/Qwen2-0.5B-Instruct 49 -401.47 79.25 84.27 38.82 85.36 6.26 92.43 60.77 50.91 48.72 30.79 40.6 70.13 94.8 56.35 72.11 39.62 29.19 61.91 37.6 74.51 31.77 75.45 62.37 31.14
EleutherAI/pythia-1.4b-deduped 50 -431.18 80.84 83.98 34.14 86.49 5.76 94.27 62.67 57.63 50.46 25.59 40.4 73.34 95.7 21.79 74.31 34.95 3.49 60.19 37.4 72.55 35.27 65.7 65.32 32.25
Qwen/Qwen2-0.5B 51 -441.38 79.21 84.46 37.92 84.99 7.09 91.63 59.98 50.87 48.21 29.72 40.1 70.24 94.6 57.66 71.83 8.36 33.97 62.07 34.8 75.98 31.62 73.29 64.35 30.46
Qwen/Qwen1.5-0.5B 52 -443.59 79.6 83.51 36.1 86.16 5.73 92.32 60.14 51.35 48.31 27.07 42.2 71.06 94.4 56.59 70.7 37.58 22.82 57.99 34.2 76.47 31.36 72.92 61.95 30.46
Qwen/Qwen1.5-0.5B-Chat 53 -539.97 78.15 83.82 35.41 85.58 4.88 92.32 57.14 47.93 47.49 29.41 43.4 69.15 94.4 54.3 67.68 10.03 16.15 54.23 35.2 72.3 29.72 72.56 59.3 31.23
EleutherAI/pythia-1b-deduped 54 -565.75 75.74 82.45 33.54 84.92 5.71 92.55 60.46 51.36 49.54 24.36 39.6 71.49 94.4 19.33 66.97 19.14 2.43 58.93 35 71.32 34.68 62.82 62.08 28.67
openai-community/gpt2-xl 55 -579.6 78.04 82.75 33.4 88.61 8.23 94.15 64.25 53.24 50.61 23.99 35.4 71.22 94.7 19.57 69.17 0.29 2.05 54.86 34.2 68.87 30.34 57.04 62.16 31.48
openai-community/gpt2-large 56 -708.22 73.96 78.46 31.7 82.74 5.9 93.81 57.77 47.39 48.67 23.48 36.2 70.62 93.7 20.39 67.95 0 1.97 53.61 33.2 68.14 29.14 52.71 58.29 25.94
EleutherAI/pythia-410m-deduped 57 -733.36 73.47 83.13 32.37 84.2 2.44 91.4 55.41 43.18 45.65 22.98 38.2 68.06 92.7 19.08 65.87 4.64 1.82 52.98 30 70.59 31.25 57.4 54.25 26.11
openai-community/gpt2-medium 58 -836.43 61.92 76.66 31.58 81.42 4.4 92.2 55.01 41.13 46.52 22.95 36.9 66.38 91.6 19.49 63.79 0 1.9 50 30.2 69.36 29.25 52.71 52.15 26.96
EleutherAI/pythia-160m-deduped 59 -1012.88 58.35 77.52 31.46 70.35 0.42 86.01 52.25 31.47 39.61 23.22 33.2 62.35 84.6 19.41 62.32 3.48 2.12 55.49 26.8 69.61 27.61 59.57 41.33 26.11
openai-community/gpt2 60 -1142.2 43.74 65.41 31.89 57.53 2.47 87.73 51.62 31.5 39.76 21.31 35.4 63 87.7 21.54 60.95 1.26 1.9 49.53 27.2 68.14 26.77 53.79 43.9 22.7
EleutherAI/pythia-70m-deduped 61 -1343.86 40.41 68.35 32.03 58.05 0.03 76.72 50.04 27.58 35.98 22.55 32.9 58.38 67.7 19.66 62.02 0 1.36 52.04 24.6 68.38 26.62 50.54 36.11 21.42

Dependencies

conda env create -f environment.yml

Example Usage

python main.py --exp_name arc_easy/EleutherAI/pythia-70m-deduped --base_model EleutherAI/pythia-70m-deduped --task_name arc_easy --train_param.num_train_epochs 5 --eval_before_train --eval_every_epoch --eval_after_train --train_param.greater_is_better --train_param.metric_for_best_model eval_acc_norm,none --dataset_param.max_num_train 50000 --dataset_param.max_num_valid 1000 --dataset_param.max_num_test 10000 --no-use_git

You can specify the model by --base_model. We use the following models in our paper.

  • openai-community/gpt2
  • openai-community/gpt2-medium
  • openai-community/gpt2-large
  • openai-community/gpt2-xl
  • meta-llama/Meta-Llama-3-8B
  • meta-llama/Llama-3.1-8B
  • meta-llama/Llama-3.2-1B
  • meta-llama/Llama-3.2-3B
  • meta-llama/Meta-Llama-3-8B-Instruct
  • meta-llama/Llama-3.1-8B-Instruct
  • meta-llama/Llama-3.2-1B-Instruct
  • meta-llama/Llama-3.2-3B-Instruct
  • Qwen/Qwen1.5-0.5B
  • Qwen/Qwen1.5-1.8B
  • Qwen/Qwen1.5-4B
  • Qwen/Qwen1.5-7B
  • Qwen/Qwen1.5-14B
  • Qwen/Qwen1.5-0.5B-Chat
  • Qwen/Qwen1.5-1.8B-Chat
  • Qwen/Qwen1.5-4B-Chat
  • Qwen/Qwen1.5-7B-Chat
  • Qwen/Qwen1.5-14B-Chat
  • Qwen/Qwen2-0.5B
  • Qwen/Qwen2-1.5B
  • Qwen/Qwen2-7B
  • Qwen/Qwen2-0.5B-Instruct
  • Qwen/Qwen2-1.5B-Instruct
  • Qwen/Qwen2-7B-Instruct
  • Qwen/Qwen2.5-0.5B
  • Qwen/Qwen2.5-1.5B
  • Qwen/Qwen2.5-3B
  • Qwen/Qwen2.5-7B
  • Qwen/Qwen2.5-14B
  • Qwen/Qwen2.5-0.5B-Instruct
  • Qwen/Qwen2.5-1.5B-Instruct
  • Qwen/Qwen2.5-3B-Instruct
  • Qwen/Qwen2.5-7B-Instruct
  • Qwen/Qwen2.5-14B-Instruct
  • EleutherAI/pythia-70m-deduped
  • EleutherAI/pythia-160m-deduped
  • EleutherAI/pythia-410m-deduped
  • EleutherAI/pythia-1b-deduped
  • EleutherAI/pythia-1.4b-deduped
  • EleutherAI/pythia-2.8b-deduped
  • EleutherAI/pythia-6.9b-deduped
  • EleutherAI/pythia-12b
  • google/gemma-2b
  • google/gemma-7b
  • google/gemma-2-2b
  • google/gemma-2-9b
  • google/gemma-2b-it
  • google/gemma-7b-it
  • google/gemma-2-2b-it
  • google/gemma-2-9b-it
  • 01-ai/Yi-6B
  • 01-ai/Yi-6B-Chat
  • 01-ai/Yi-9B
  • 01-ai/Yi-1.5-6B
  • 01-ai/Yi-1.5-6B-Chat
  • 01-ai/Yi-1.5-9B
  • 01-ai/Yi-1.5-9B-Chat

Add --dataset_param.apply_chat_template to enable evaluation with chat templates. Other LLMs available on the HuggingFace Hub can also be run using this code.

You can specify the benchmark by --task_name. We use the following tasks in our paper:

  • cola
  • mnli
  • qnli
  • rte
  • mrpc
  • qqp
  • sst2
  • boolq
  • wic
  • anli_r1
  • commonsense_qa
  • social_iqa
  • winogrande
  • openbookqa
  • hellaswag
  • arc_easy
  • arc_challenge
  • headqa_en
  • mathqa
  • medmcqa
  • piqa
  • sciq
  • nq_open
  • gsm8k

For more supported benchmarks, please check lm-eval. To determine if a benchmark has a training set, check the corresponding *.yml file for the presence of the keyword training_split.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published