LM-Harmony is an automatic evaluation tool for large language models. Unlike popular direct evaluations for LLMs, LM-Harmony uses a train-before-test evaluation paradigm. Each model is fine-tuned on the corresponding training set before evaluation. Our results demonstrate that LM-Harmony provides significantly more consistent model rankings across various tasks, revealing the true general capabilities of each model.
The leaderboard includes 24 benchmarks, covering language understanding, commonsense reasoning, question answering, math, physics, chemistry, biology, and medicine. We report scores for each benchmark and aggregate them using the first principal component score of the entire score matrix, which accounts for 85% of the variance. More details can be found in our paper.
Rank | PC1 | MNLI | QQP | MedMCQA | QNLI | NQ-Open | SST-2 | Winogrande | HellaSwag | Social-IQA | MathQA | ANLI | PIQA | SciQ | CommonsenseQA | BoolQ | CoLA | GSM8K | WiC | OpenBookQA | MRPC | HeadQA | RTE | ARC-Easy | ARC-Challenge | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Qwen/Qwen2.5-14B | 1 | 591.21 | 90.34 | 89.09 | 65.67 | 93.17 | 30 | 96.44 | 86.35 | 85.81 | 58.6 | 68.34 | 76.7 | 83.9 | 98.8 | 88.04 | 89.54 | 68.37 | 81.88 | 74.29 | 60.2 | 87.5 | 55.51 | 87.73 | 90.11 | 66.04 |
Qwen/Qwen2.5-14B-Instruct | 2 | 587.23 | 90.61 | 88.59 | 65.19 | 92.97 | 29.75 | 96.33 | 85.95 | 85.86 | 57.98 | 68.21 | 77.3 | 84.49 | 98.8 | 87.22 | 90.03 | 69.32 | 83.17 | 74.92 | 58.6 | 86.52 | 55.14 | 90.61 | 89.31 | 66.81 |
google/gemma-2-9b-it | 3 | 569.26 | 90.9 | 89.2 | 59.36 | 94.93 | 28.2 | 96.56 | 86.03 | 84.52 | 61.26 | 51.66 | 79.4 | 84.28 | 98.6 | 83.87 | 91.47 | 69.21 | 77.71 | 73.82 | 59.8 | 88.73 | 53.87 | 91.34 | 89.06 | 67.06 |
google/gemma-2-9b | 4 | 562.02 | 91.03 | 89.16 | 62.51 | 93.45 | 33.6 | 96.33 | 85.48 | 85.79 | 59.93 | 54.77 | 71.8 | 85.2 | 98.5 | 84.36 | 89.91 | 67.07 | 69.9 | 73.82 | 60.2 | 86.52 | 57 | 88.81 | 89.86 | 67.75 |
google/gemma-7b | 5 | 466.81 | 90.26 | 88.71 | 54.86 | 92.7 | 28.78 | 95.99 | 86.82 | 85.42 | 59.93 | 48.07 | 72 | 84.06 | 98.2 | 82.31 | 89.57 | 67.54 | 59.67 | 73.98 | 59 | 86.03 | 50.77 | 87 | 87.75 | 62.29 |
meta-llama/Llama-3.1-8B-Instruct | 6 | 433.06 | 89.91 | 88.14 | 60.48 | 92.93 | 31.55 | 95.99 | 85 | 82.26 | 58.03 | 46.63 | 77.1 | 83.62 | 98.3 | 82.39 | 87.06 | 65.27 | 76.19 | 74.14 | 51.8 | 81.86 | 50.51 | 88.09 | 85.86 | 59.04 |
meta-llama/Meta-Llama-3-8B-Instruct | 7 | 416.2 | 89.63 | 88.69 | 60.98 | 91.16 | 29.56 | 96.22 | 84.93 | 82.18 | 58.85 | 46.03 | 76 | 83.3 | 98.2 | 79.69 | 87.55 | 67.29 | 75.21 | 72.88 | 52.6 | 80.88 | 49.31 | 89.17 | 84.85 | 57.68 |
Qwen/Qwen1.5-14B | 8 | 403.39 | 90.55 | 88.88 | 54.41 | 91.32 | 25.6 | 96.44 | 82.64 | 82.38 | 56.04 | 57.99 | 72.7 | 81.94 | 97.8 | 87.14 | 88.38 | 68.66 | 68.01 | 73.82 | 53.8 | 84.07 | 46.57 | 89.89 | 83.42 | 56.31 |
meta-llama/Meta-Llama-3-8B | 9 | 401.15 | 88.79 | 88.51 | 58.74 | 92.13 | 31.47 | 96.22 | 84.45 | 84.31 | 59.47 | 46.97 | 68 | 84.49 | 98.1 | 80.51 | 87.46 | 64.67 | 55.8 | 71 | 57.2 | 79.9 | 51.09 | 83.75 | 86.45 | 60.15 |
meta-llama/Llama-3.1-8B | 10 | 399.69 | 89.65 | 88.49 | 58.4 | 92.51 | 31.3 | 95.41 | 84.45 | 83.72 | 58.5 | 46.93 | 66.4 | 84.28 | 98.2 | 81.74 | 88.5 | 64.32 | 55.72 | 71.94 | 56.2 | 80.39 | 51.42 | 84.12 | 86.74 | 61.01 |
01-ai/Yi-1.5-9B | 11 | 397 | 90.26 | 88.83 | 54.67 | 92.31 | 24.1 | 95.87 | 81.93 | 82.43 | 58.85 | 50.32 | 71.4 | 82.86 | 98.2 | 84.52 | 88.07 | 61.69 | 65.96 | 74.76 | 54.6 | 82.11 | 47.48 | 89.17 | 85.86 | 60.41 |
Qwen/Qwen1.5-14B-Chat | 12 | 393.47 | 90.17 | 88.66 | 54.82 | 91.2 | 24.9 | 95.99 | 81.61 | 82.18 | 56.55 | 57.05 | 76.4 | 82.48 | 97.8 | 86.81 | 87.95 | 67.24 | 66.34 | 73.35 | 52.6 | 84.31 | 45.55 | 90.97 | 83.12 | 55.63 |
Qwen/Qwen2-7B | 13 | 386.51 | 89.17 | 87.99 | 57.35 | 90.21 | 27.76 | 96.33 | 82.64 | 82.01 | 57.68 | 52.23 | 67.6 | 82.7 | 97.8 | 85.59 | 86.57 | 64.47 | 74.98 | 75.08 | 53.8 | 81.62 | 47.01 | 86.64 | 84.89 | 58.7 |
Qwen/Qwen2.5-7B | 14 | 386.34 | 89.76 | 87.85 | 61.63 | 89.44 | 24.29 | 95.53 | 80.03 | 81.82 | 56.65 | 56.85 | 70.6 | 81.66 | 97.7 | 86.49 | 87.37 | 59.66 | 75.82 | 71.94 | 56.4 | 82.6 | 48.47 | 85.2 | 86.32 | 60.58 |
Qwen/Qwen2.5-7B-Instruct | 15 | 374.28 | 89.88 | 88.04 | 61.08 | 90.3 | 21.44 | 95.76 | 82.16 | 81.19 | 56.04 | 56.25 | 74 | 81.56 | 97.5 | 85.18 | 87.28 | 57.16 | 73.84 | 71.79 | 53.4 | 82.35 | 47.96 | 88.45 | 85.52 | 58.96 |
01-ai/Yi-1.5-9B-Chat | 16 | 373.74 | 90.39 | 88.55 | 51.85 | 94.07 | 18.98 | 95.99 | 80.51 | 81.06 | 62.08 | 52.86 | 75.4 | 82.37 | 97.5 | 84.68 | 89.57 | 63.4 | 72.63 | 73.82 | 49.4 | 82.84 | 43.91 | 85.56 | 83.54 | 58.96 |
Qwen/Qwen2-7B-Instruct | 17 | 370.57 | 89.2 | 88.13 | 57.14 | 90.48 | 25.76 | 96.56 | 81.53 | 81.52 | 58.96 | 52.46 | 68 | 82.21 | 97.6 | 84.6 | 86.24 | 63.11 | 75.06 | 72.1 | 52.8 | 82.6 | 46.35 | 88.09 | 83.38 | 57.17 |
01-ai/Yi-9B | 18 | 362.12 | 89.97 | 88.56 | 54.22 | 92.15 | 22.33 | 95.76 | 81.37 | 80.85 | 57.93 | 45.49 | 69.3 | 81.77 | 98.4 | 84.03 | 88.72 | 62.48 | 53.75 | 75.71 | 54.8 | 82.6 | 46.39 | 90.25 | 84.26 | 60.58 |
google/gemma-7b-it | 19 | 312.1 | 89.48 | 88.74 | 49.06 | 93.21 | 20.53 | 96.33 | 80.82 | 79.95 | 57.68 | 48.64 | 73.9 | 81.72 | 97.4 | 78.71 | 89.2 | 61.97 | 55.19 | 71.16 | 54.2 | 84.31 | 43.65 | 85.56 | 80.22 | 53.16 |
01-ai/Yi-1.5-6B-Chat | 20 | 285.53 | 89.43 | 87.99 | 49.32 | 93.1 | 19 | 95.87 | 76.8 | 78.82 | 58.75 | 47.27 | 72.5 | 81.45 | 97.9 | 81.57 | 86.42 | 58.46 | 64.52 | 72.1 | 51.6 | 83.33 | 41.87 | 85.92 | 81.78 | 54.1 |
01-ai/Yi-1.5-6B | 21 | 285.49 | 89.06 | 87.82 | 50.9 | 90.55 | 24.13 | 95.99 | 78.3 | 80.42 | 56.81 | 46.7 | 67.3 | 81.99 | 97.3 | 79.93 | 86.06 | 59.39 | 57.01 | 72.26 | 50.8 | 83.09 | 44.02 | 85.2 | 83.75 | 57.59 |
Qwen/Qwen1.5-7B | 22 | 275.62 | 88.9 | 88.21 | 50.42 | 89.24 | 23.91 | 96.22 | 79.08 | 79.52 | 54.15 | 45.73 | 69.6 | 80.41 | 97.3 | 84.36 | 86.48 | 66.81 | 57.24 | 71.79 | 52.2 | 82.11 | 42.78 | 89.53 | 81.9 | 54.78 |
01-ai/Yi-6B-Chat | 23 | 257.08 | 88.07 | 87.74 | 50.8 | 90.33 | 25.37 | 96.33 | 79.08 | 78.89 | 56.24 | 38.86 | 69 | 81.45 | 97.7 | 81.16 | 87.19 | 62.84 | 40.56 | 73.51 | 50.6 | 80.88 | 43.69 | 89.53 | 81.78 | 52.47 |
google/gemma-2-2b-it | 24 | 255.49 | 87.65 | 88.08 | 47.79 | 92.95 | 20.61 | 96.56 | 77.43 | 76.65 | 57.42 | 38.93 | 71.6 | 80.85 | 97.8 | 76.25 | 87.92 | 61.58 | 55.72 | 71.79 | 51.2 | 81.62 | 45 | 84.84 | 82.45 | 53.75 |
meta-llama/Llama-3.2-3B-Instruct | 25 | 250.82 | 88.05 | 87.38 | 55.2 | 90.48 | 28.2 | 96.22 | 77.11 | 75.92 | 56.4 | 42.11 | 71.4 | 79.38 | 97.1 | 76.9 | 85.9 | 58.66 | 64.67 | 72.73 | 48.6 | 80.39 | 44.09 | 84.84 | 80.81 | 52.22 |
Qwen/Qwen1.5-7B-Chat | 26 | 240.56 | 88.96 | 88.1 | 50.66 | 88.63 | 21.91 | 95.76 | 78.06 | 78.91 | 54.5 | 43.28 | 70.7 | 80.3 | 97.1 | 84.19 | 86.27 | 64.38 | 52.31 | 69.91 | 50.4 | 83.82 | 41.79 | 90.25 | 80.43 | 50.43 |
01-ai/Yi-6B | 27 | 237.67 | 88.03 | 88.13 | 50.35 | 90.23 | 25.65 | 96.22 | 78.77 | 78.97 | 56.45 | 37.69 | 65.7 | 81.07 | 97.9 | 81.41 | 85.5 | 63.01 | 42.38 | 73.35 | 49.2 | 80.39 | 43.84 | 85.56 | 81.82 | 53.92 |
Qwen/Qwen2.5-3B | 28 | 223.02 | 88.58 | 87.36 | 54.63 | 89.24 | 18.12 | 95.64 | 75.77 | 76.18 | 54.15 | 49.08 | 63.4 | 79.87 | 97.9 | 80.75 | 85.96 | 56.34 | 68.76 | 70.53 | 48.6 | 79.17 | 45.22 | 87.73 | 82.53 | 54.01 |
Qwen/Qwen2.5-3B-Instruct | 29 | 217.06 | 88.62 | 87.56 | 54.67 | 89.68 | 15.07 | 95.76 | 75.69 | 76.16 | 54.4 | 48.61 | 68.4 | 79.22 | 97.2 | 80.67 | 86.36 | 56.96 | 66.64 | 70.85 | 48.2 | 80.39 | 44.38 | 86.64 | 81.57 | 52.22 |
meta-llama/Llama-3.2-3B | 30 | 206.73 | 88.15 | 87.42 | 51.33 | 91.56 | 25.01 | 96.44 | 78.61 | 78.8 | 56.19 | 42.24 | 59.3 | 80.85 | 97.9 | 74.53 | 85.96 | 63.71 | 35.25 | 70.53 | 50.4 | 78.43 | 46.24 | 80.87 | 81.78 | 50.68 |
google/gemma-2-2b | 31 | 175.76 | 87.58 | 87.64 | 47.84 | 90.15 | 21.02 | 96.33 | 78.61 | 78.21 | 57.83 | 39.5 | 57.6 | 80.74 | 98.3 | 72.97 | 84.53 | 61.98 | 34.57 | 69.44 | 51 | 77.45 | 46.02 | 79.42 | 83.42 | 52.47 |
Qwen/Qwen1.5-4B | 32 | 144.4 | 87.77 | 87.63 | 47.33 | 90.72 | 17.81 | 95.76 | 74.82 | 74.43 | 55.78 | 46.16 | 58.8 | 79.22 | 97.6 | 81.33 | 83.55 | 58.98 | 52.84 | 70.22 | 45.4 | 80.88 | 38.66 | 85.92 | 78.87 | 48.21 |
Qwen/Qwen1.5-4B-Chat | 33 | 130.19 | 87.89 | 87.81 | 46.59 | 90.81 | 16.59 | 95.41 | 75.3 | 73.45 | 55.78 | 40.9 | 67.1 | 79.22 | 97 | 81.49 | 83.15 | 59.24 | 46.47 | 69.59 | 45.6 | 82.11 | 38.04 | 86.28 | 78.16 | 45.22 |
Qwen/Qwen2.5-1.5B | 34 | 38.53 | 86.48 | 86.33 | 50.47 | 88.34 | 13.66 | 95.18 | 71.19 | 70.16 | 51.74 | 42.41 | 54.8 | 76.88 | 96.8 | 76.74 | 81.62 | 53.27 | 63.15 | 65.99 | 45.4 | 79.17 | 40.48 | 77.62 | 79.88 | 49.74 |
Qwen/Qwen2.5-1.5B-Instruct | 35 | 27.14 | 87.31 | 86.49 | 49.82 | 88.38 | 11.47 | 95.53 | 71.19 | 70.09 | 51.33 | 42.38 | 57.4 | 77.31 | 96.7 | 77.15 | 82.05 | 44.86 | 50.49 | 68.34 | 44.4 | 78.68 | 41.17 | 81.59 | 77.82 | 47.01 |
Qwen/Qwen2-1.5B | 36 | -27.68 | 85.85 | 85.99 | 45.61 | 88.82 | 14.07 | 95.76 | 70.56 | 68.94 | 53.22 | 39.16 | 57.1 | 76.77 | 96.6 | 72.73 | 80.67 | 54.91 | 52.24 | 65.99 | 41.8 | 76.72 | 36.73 | 79.42 | 74.28 | 43.09 |
Qwen/Qwen2-1.5B-Instruct | 37 | -30.16 | 85.83 | 85.95 | 44.44 | 88.69 | 12.19 | 95.53 | 70.72 | 68.47 | 52.92 | 38.56 | 60.2 | 76.93 | 96.1 | 73.96 | 80.12 | 56.13 | 46.85 | 68.65 | 41 | 77.21 | 36.58 | 81.95 | 73.57 | 41.98 |
google/gemma-2b | 38 | -49.37 | 84.93 | 86.52 | 38.47 | 87.61 | 15.24 | 95.99 | 72.38 | 74.64 | 54.45 | 34.74 | 49.5 | 79.76 | 96.9 | 60.52 | 81.77 | 50.57 | 20.32 | 63.48 | 45.4 | 77.21 | 39.86 | 77.26 | 77.78 | 44.97 |
EleutherAI/pythia-12b | 39 | -113.24 | 84.74 | 87.14 | 39.45 | 88.56 | 17.4 | 94.61 | 71.51 | 71.88 | 54.76 | 29.92 | 47.8 | 78.67 | 97.4 | 27.85 | 81.22 | 53.57 | 14.33 | 62.7 | 44.2 | 76.47 | 41.98 | 71.48 | 75.08 | 43.26 |
google/gemma-2b-it | 40 | -119.89 | 83.26 | 86.29 | 36.77 | 91.12 | 9.97 | 94.95 | 66.46 | 66.22 | 53.22 | 33.27 | 49 | 76.44 | 96.8 | 66.09 | 82.94 | 41.76 | 19.94 | 65.67 | 44.4 | 79.41 | 36.65 | 78.7 | 71.59 | 42.41 |
Qwen/Qwen1.5-1.8B | 41 | -149.58 | 83.88 | 86.25 | 41.5 | 86.55 | 12.13 | 94.61 | 65.75 | 63.58 | 52.41 | 33.03 | 46.5 | 74.54 | 96.6 | 72.81 | 77.58 | 43.25 | 37.07 | 66.77 | 41.4 | 79.9 | 35.05 | 76.17 | 71.68 | 38.99 |
EleutherAI/pythia-6.9b-deduped | 42 | -177.14 | 84.34 | 86.46 | 35.52 | 88.21 | 10.86 | 95.3 | 72.14 | 69.62 | 54.45 | 28.54 | 46.7 | 78.45 | 96.4 | 20.64 | 78.53 | 51.52 | 8.87 | 65.52 | 44.8 | 74.75 | 40.15 | 71.12 | 72.94 | 41.13 |
meta-llama/Llama-3.2-1B-Instruct | 43 | -177.25 | 83.25 | 85.43 | 44.54 | 87.99 | 15.68 | 95.07 | 66.69 | 65.14 | 52.56 | 33.9 | 50.2 | 75.73 | 96 | 66.99 | 77.03 | 33.81 | 36.54 | 60.19 | 41.8 | 73.04 | 37.53 | 72.2 | 72.14 | 41.04 |
Qwen/Qwen1.5-1.8B-Chat | 44 | -177.62 | 84.09 | 85.48 | 41.14 | 87.53 | 10.5 | 95.53 | 65.51 | 62.6 | 51.48 | 34.84 | 45.9 | 75.3 | 96.2 | 72.73 | 77.83 | 33.24 | 33.43 | 65.05 | 39.6 | 78.68 | 34.54 | 79.42 | 69.11 | 38.48 |
meta-llama/Llama-3.2-1B | 45 | -254.44 | 83.66 | 85.18 | 39.76 | 81.93 | 13.96 | 95.87 | 68.43 | 68.82 | 52.61 | 31.52 | 43.2 | 77.97 | 96 | 63.88 | 76.42 | 32.94 | 11.37 | 56.27 | 43 | 70.83 | 38.77 | 63.54 | 73.02 | 39.85 |
EleutherAI/pythia-2.8b-deduped | 46 | -266.76 | 83.61 | 86.36 | 34.09 | 88.56 | 10.66 | 93.46 | 67.01 | 64.05 | 52.35 | 27.24 | 42.8 | 75.95 | 97 | 20.97 | 78.07 | 52.38 | 6.67 | 63.17 | 41.6 | 74.75 | 38.99 | 70.4 | 70.12 | 36.77 |
Qwen/Qwen2.5-0.5B | 47 | -356.88 | 81.52 | 84.61 | 40.55 | 86.02 | 7.67 | 92.43 | 60.38 | 54.19 | 48.11 | 32.83 | 40.8 | 71.16 | 96 | 61.75 | 72.69 | 31.63 | 33.97 | 61.13 | 35.6 | 75.98 | 33.63 | 71.84 | 69.28 | 34.3 |
Qwen/Qwen2.5-0.5B-Instruct | 48 | -357.34 | 81.44 | 84.96 | 40.38 | 85.3 | 6.65 | 92.66 | 60.46 | 53.75 | 47.65 | 32.6 | 41.9 | 70.46 | 95.7 | 61.34 | 72.57 | 29.28 | 29.8 | 62.38 | 39.4 | 76.23 | 33.11 | 71.84 | 68.81 | 34.81 |
Qwen/Qwen2-0.5B-Instruct | 49 | -401.47 | 79.25 | 84.27 | 38.82 | 85.36 | 6.26 | 92.43 | 60.77 | 50.91 | 48.72 | 30.79 | 40.6 | 70.13 | 94.8 | 56.35 | 72.11 | 39.62 | 29.19 | 61.91 | 37.6 | 74.51 | 31.77 | 75.45 | 62.37 | 31.14 |
EleutherAI/pythia-1.4b-deduped | 50 | -431.18 | 80.84 | 83.98 | 34.14 | 86.49 | 5.76 | 94.27 | 62.67 | 57.63 | 50.46 | 25.59 | 40.4 | 73.34 | 95.7 | 21.79 | 74.31 | 34.95 | 3.49 | 60.19 | 37.4 | 72.55 | 35.27 | 65.7 | 65.32 | 32.25 |
Qwen/Qwen2-0.5B | 51 | -441.38 | 79.21 | 84.46 | 37.92 | 84.99 | 7.09 | 91.63 | 59.98 | 50.87 | 48.21 | 29.72 | 40.1 | 70.24 | 94.6 | 57.66 | 71.83 | 8.36 | 33.97 | 62.07 | 34.8 | 75.98 | 31.62 | 73.29 | 64.35 | 30.46 |
Qwen/Qwen1.5-0.5B | 52 | -443.59 | 79.6 | 83.51 | 36.1 | 86.16 | 5.73 | 92.32 | 60.14 | 51.35 | 48.31 | 27.07 | 42.2 | 71.06 | 94.4 | 56.59 | 70.7 | 37.58 | 22.82 | 57.99 | 34.2 | 76.47 | 31.36 | 72.92 | 61.95 | 30.46 |
Qwen/Qwen1.5-0.5B-Chat | 53 | -539.97 | 78.15 | 83.82 | 35.41 | 85.58 | 4.88 | 92.32 | 57.14 | 47.93 | 47.49 | 29.41 | 43.4 | 69.15 | 94.4 | 54.3 | 67.68 | 10.03 | 16.15 | 54.23 | 35.2 | 72.3 | 29.72 | 72.56 | 59.3 | 31.23 |
EleutherAI/pythia-1b-deduped | 54 | -565.75 | 75.74 | 82.45 | 33.54 | 84.92 | 5.71 | 92.55 | 60.46 | 51.36 | 49.54 | 24.36 | 39.6 | 71.49 | 94.4 | 19.33 | 66.97 | 19.14 | 2.43 | 58.93 | 35 | 71.32 | 34.68 | 62.82 | 62.08 | 28.67 |
openai-community/gpt2-xl | 55 | -579.6 | 78.04 | 82.75 | 33.4 | 88.61 | 8.23 | 94.15 | 64.25 | 53.24 | 50.61 | 23.99 | 35.4 | 71.22 | 94.7 | 19.57 | 69.17 | 0.29 | 2.05 | 54.86 | 34.2 | 68.87 | 30.34 | 57.04 | 62.16 | 31.48 |
openai-community/gpt2-large | 56 | -708.22 | 73.96 | 78.46 | 31.7 | 82.74 | 5.9 | 93.81 | 57.77 | 47.39 | 48.67 | 23.48 | 36.2 | 70.62 | 93.7 | 20.39 | 67.95 | 0 | 1.97 | 53.61 | 33.2 | 68.14 | 29.14 | 52.71 | 58.29 | 25.94 |
EleutherAI/pythia-410m-deduped | 57 | -733.36 | 73.47 | 83.13 | 32.37 | 84.2 | 2.44 | 91.4 | 55.41 | 43.18 | 45.65 | 22.98 | 38.2 | 68.06 | 92.7 | 19.08 | 65.87 | 4.64 | 1.82 | 52.98 | 30 | 70.59 | 31.25 | 57.4 | 54.25 | 26.11 |
openai-community/gpt2-medium | 58 | -836.43 | 61.92 | 76.66 | 31.58 | 81.42 | 4.4 | 92.2 | 55.01 | 41.13 | 46.52 | 22.95 | 36.9 | 66.38 | 91.6 | 19.49 | 63.79 | 0 | 1.9 | 50 | 30.2 | 69.36 | 29.25 | 52.71 | 52.15 | 26.96 |
EleutherAI/pythia-160m-deduped | 59 | -1012.88 | 58.35 | 77.52 | 31.46 | 70.35 | 0.42 | 86.01 | 52.25 | 31.47 | 39.61 | 23.22 | 33.2 | 62.35 | 84.6 | 19.41 | 62.32 | 3.48 | 2.12 | 55.49 | 26.8 | 69.61 | 27.61 | 59.57 | 41.33 | 26.11 |
openai-community/gpt2 | 60 | -1142.2 | 43.74 | 65.41 | 31.89 | 57.53 | 2.47 | 87.73 | 51.62 | 31.5 | 39.76 | 21.31 | 35.4 | 63 | 87.7 | 21.54 | 60.95 | 1.26 | 1.9 | 49.53 | 27.2 | 68.14 | 26.77 | 53.79 | 43.9 | 22.7 |
EleutherAI/pythia-70m-deduped | 61 | -1343.86 | 40.41 | 68.35 | 32.03 | 58.05 | 0.03 | 76.72 | 50.04 | 27.58 | 35.98 | 22.55 | 32.9 | 58.38 | 67.7 | 19.66 | 62.02 | 0 | 1.36 | 52.04 | 24.6 | 68.38 | 26.62 | 50.54 | 36.11 | 21.42 |
conda env create -f environment.yml
python main.py --exp_name arc_easy/EleutherAI/pythia-70m-deduped --base_model EleutherAI/pythia-70m-deduped --task_name arc_easy --train_param.num_train_epochs 5 --eval_before_train --eval_every_epoch --eval_after_train --train_param.greater_is_better --train_param.metric_for_best_model eval_acc_norm,none --dataset_param.max_num_train 50000 --dataset_param.max_num_valid 1000 --dataset_param.max_num_test 10000 --no-use_git
You can specify the model by --base_model. We use the following models in our paper.
- openai-community/gpt2
- openai-community/gpt2-medium
- openai-community/gpt2-large
- openai-community/gpt2-xl
- meta-llama/Meta-Llama-3-8B
- meta-llama/Llama-3.1-8B
- meta-llama/Llama-3.2-1B
- meta-llama/Llama-3.2-3B
- meta-llama/Meta-Llama-3-8B-Instruct
- meta-llama/Llama-3.1-8B-Instruct
- meta-llama/Llama-3.2-1B-Instruct
- meta-llama/Llama-3.2-3B-Instruct
- Qwen/Qwen1.5-0.5B
- Qwen/Qwen1.5-1.8B
- Qwen/Qwen1.5-4B
- Qwen/Qwen1.5-7B
- Qwen/Qwen1.5-14B
- Qwen/Qwen1.5-0.5B-Chat
- Qwen/Qwen1.5-1.8B-Chat
- Qwen/Qwen1.5-4B-Chat
- Qwen/Qwen1.5-7B-Chat
- Qwen/Qwen1.5-14B-Chat
- Qwen/Qwen2-0.5B
- Qwen/Qwen2-1.5B
- Qwen/Qwen2-7B
- Qwen/Qwen2-0.5B-Instruct
- Qwen/Qwen2-1.5B-Instruct
- Qwen/Qwen2-7B-Instruct
- Qwen/Qwen2.5-0.5B
- Qwen/Qwen2.5-1.5B
- Qwen/Qwen2.5-3B
- Qwen/Qwen2.5-7B
- Qwen/Qwen2.5-14B
- Qwen/Qwen2.5-0.5B-Instruct
- Qwen/Qwen2.5-1.5B-Instruct
- Qwen/Qwen2.5-3B-Instruct
- Qwen/Qwen2.5-7B-Instruct
- Qwen/Qwen2.5-14B-Instruct
- EleutherAI/pythia-70m-deduped
- EleutherAI/pythia-160m-deduped
- EleutherAI/pythia-410m-deduped
- EleutherAI/pythia-1b-deduped
- EleutherAI/pythia-1.4b-deduped
- EleutherAI/pythia-2.8b-deduped
- EleutherAI/pythia-6.9b-deduped
- EleutherAI/pythia-12b
- google/gemma-2b
- google/gemma-7b
- google/gemma-2-2b
- google/gemma-2-9b
- google/gemma-2b-it
- google/gemma-7b-it
- google/gemma-2-2b-it
- google/gemma-2-9b-it
- 01-ai/Yi-6B
- 01-ai/Yi-6B-Chat
- 01-ai/Yi-9B
- 01-ai/Yi-1.5-6B
- 01-ai/Yi-1.5-6B-Chat
- 01-ai/Yi-1.5-9B
- 01-ai/Yi-1.5-9B-Chat
Add --dataset_param.apply_chat_template to enable evaluation with chat templates. Other LLMs available on the HuggingFace Hub can also be run using this code.
You can specify the benchmark by --task_name. We use the following tasks in our paper:
- cola
- mnli
- qnli
- rte
- mrpc
- qqp
- sst2
- boolq
- wic
- anli_r1
- commonsense_qa
- social_iqa
- winogrande
- openbookqa
- hellaswag
- arc_easy
- arc_challenge
- headqa_en
- mathqa
- medmcqa
- piqa
- sciq
- nq_open
- gsm8k
For more supported benchmarks, please check lm-eval. To determine if a benchmark has a training set, check the corresponding *.yml file for the presence of the keyword training_split.