cmmlu_en

CMMLU Inference Script

This project tests the related model effects on the CMMLU Evaluation Dataset, which includes 11K multiple-choice questions covering 67 subjects. The following will introduce the prediction method for the CMMLU dataset.

Data Preparation

Download the evaluation dataset from the CMMLU official specified path and unzip it to the data folder:

wget https://huggingface.co/datasets/haonan-li/cmmlu/resolve/main/cmmlu_v1_0_1.zip
unzip cmmlu_v1_0_1.zip -d data

Place the data folder under the scripts/cmmlu directory of this project.

Run the Prediction Script

Execute the following script:

model_path=path/to/llama-3-chinese
output_path=path/to/your_output_dir

cd scripts/cmmlu
python eval.py \
    --model_path ${model_path} \
    --few_shot False \
    --with_prompt True \
    --output_dir ${output_path} \
    --input_dir data

Parameter Explanation

model_path: Directory where the evaluation model is located (complete Llama-3-Chinese or Llama-3-Chinese-Instruct model, not LoRA)
few_shot: Whether to use few-shot
ntrain: When few_shot=True, specify the number of few-shot instances (5-shot: ntrain=5); not applicable when few_shot=False
with_prompt: Whether the model input includes instructions specifically for the Llama-3-Instruct model
n_times: Specify the number of repetitions for evaluation, which will create folders under output_dir for the specified number of times
load_in_4bit: Load the model in 4bit quantization
use_flash_attention_2: Use flash-attn2 for accelerated inference, otherwise use SDPA for acceleration.
output_dir: Specify the output path for the evaluation results
input_dir: Specify the path for the evaluation data

Evaluation Output

After the model prediction is complete, directories outputs/take* are generated, where * represents a number, ranging from 0 to n_times-1, storing the results of n_times decoding.
outputs/take* contains two JSON files, submission.json and summary.json.

submission.json is a file storing the model evaluation answers, formatted as:

{
    "arts": {
        "0": "A",
        "1": "B",
        ...
    },
    "nutrition": {
        "0": "B",
        "1": "A",
        ...
    },
    ...
}

summary.json includes the model's evaluation results under 67 topics, 5 major categories, and the overall average. For example, the All field in the JSON file will show the overall average performance:

  "All": {
    "score": 0.39984458642721465,
    "num": 11582,
    "correct": 4631.0
  }

where score is the accuracy, num is the total number of test samples, and correct is the number of correct answers.

中文文档

English Docs

Model Reconstruction
Model Quantization, Inference and Deployment
System Performance
Training Scripts
- Pre-training Scripts
- Instruction Fine-tuning Scripts
FAQ

cmmlu_en

CMMLU Inference Script

Data Preparation

Run the Prediction Script

Parameter Explanation

Evaluation Output

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

中文文档

English Docs

Clone this wiki locally