-
Notifications
You must be signed in to change notification settings - Fork 166
cmmlu_en
This project tests the related model effects on the CMMLU Evaluation Dataset, which includes 11K multiple-choice questions covering 67 subjects. The following will introduce the prediction method for the CMMLU dataset.
Download the evaluation dataset from the CMMLU official specified path and unzip it to the data folder:
wget https://huggingface.co/datasets/haonan-li/cmmlu/resolve/main/cmmlu_v1_0_1.zip
unzip cmmlu_v1_0_1.zip -d data
Place the data folder under the scripts/cmmlu
directory of this project.
Execute the following script:
model_path=path/to/llama-3-chinese
output_path=path/to/your_output_dir
cd scripts/cmmlu
python eval.py \
--model_path ${model_path} \
--few_shot False \
--with_prompt True \
--output_dir ${output_path} \
--input_dir data
-
model_path
: Directory where the evaluation model is located (complete Llama-3-Chinese or Llama-3-Chinese-Instruct model, not LoRA) -
few_shot
: Whether to use few-shot -
ntrain
: Whenfew_shot=True
, specify the number of few-shot instances (5-shot:ntrain=5
); not applicable whenfew_shot=False
-
with_prompt
: Whether the model input includes instructions specifically for the Llama-3-Instruct model -
n_times
: Specify the number of repetitions for evaluation, which will create folders underoutput_dir
for the specified number of times -
load_in_4bit
: Load the model in 4bit quantization -
use_flash_attention_2
: Use flash-attn2 for accelerated inference, otherwise use SDPA for acceleration. -
output_dir
: Specify the output path for the evaluation results -
input_dir
: Specify the path for the evaluation data
-
After the model prediction is complete, directories
outputs/take*
are generated, where*
represents a number, ranging from 0 ton_times-1
, storing the results ofn_times
decoding. -
outputs/take*
contains two JSON files,submission.json
andsummary.json
.
-
submission.json
is a file storing the model evaluation answers, formatted as:
{
"arts": {
"0": "A",
"1": "B",
...
},
"nutrition": {
"0": "B",
"1": "A",
...
},
...
}
-
summary.json
includes the model's evaluation results under 67 topics, 5 major categories, and the overall average. For example, theAll
field in the JSON file will show the overall average performance:
"All": {
"score": 0.39984458642721465,
"num": 11582,
"correct": 4631.0
}
where score
is the accuracy, num
is the total number of test samples, and correct
is the number of correct answers.
- Model Reconstruction
- Model Quantization, Inference and Deployment
- System Performance
- Training Scripts
- FAQ