The data after evaluating n times will be placed in the "dataset_ntimes" folder, organized by model name, with the structure as follows:
dataset_ntimes/
├── model_1/
│ ├── model_1_nt.json
└── model_2/
├── model_2_nt.json
└── ...
Modify the dir_path=path/to/dataset_ntimes in 1_Split_filename.py and run the file to split different task files based on the "filename" field.
Modify the file_dir=path/to/dataset_ntimes in 2_Extract.py and run the file to extract the answers from the model responses and generate JSONL files, which will be placed in the result_ntimes folder with the following structure:
result_ntimes/
├── model_1/
│ ├── 0shot/
│ │ ├── 1t/
│ │ │ ├── task_dir_1/
│ │ │ │ ├──task_1.json
│ │ │ │ ├──task_1.jsonl
│ │ │ │ ├──task_2.json
│ │ │ │ ├──task_2.jsonl
│ │ │ │ ├──...
│ │ ├── 2t/
│ │ │ ├── task_dir_1/
│ │ │ │ ├──task_1.json
│ │ │ │ ├──task_1.jsonl
│ │ │ │ ├──...
│ ├── 3shot/
│ │ ├──...
└── ...
Modify the root_path=path/to/result_ntimes in 3_Evaluate.py and run the file to obtain the evaluation results.
Modify the folder_path and output_file in 1_prompt_chem.py, then run the file to submit for LLM evaluation.
Modify the file_path in 2_L1_task_eval.py and run to obtain metrics for multiple-choice, true/false, fill-in-the-blank, short answer, and calculation tasks. The results will be displayed in an Excel file.
Modify the foler_path and excel_path in 3_other_task_eval.py,then run the file to obtain metrics for abstract writing, outlining, reaction intermediates, single-step synthesis, multi-step synthesis, and physicochemical property tasks. The results will be displayed in an Excel file.
Run 1_prompt_chem to obtain the evaluation data from the LLM.
Run 2_LLM_evaluate to obtain the evaluation results from the LLM.
Run 3_code_evaluate to obtain the evaluation results using code.
https://huggingface.co/datasets/Ooo1/ChemEval
The ChemEval dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License
Please cite our paper if you use our dataset.
@article{huang2024chemeval,
title={ChemEval: A Comprehensive Multi-Level Chemical Evaluation for Large Language Models},
author={Huang, Yuqing and Zhang, Rongyang and He, Xuesong and Zhi, Xuyang and Wang, Hao and Li, Xin and Xu, Feiyang and Liu, Deguang and Liang, Huadong and Li, Yi and others},
journal={arXiv preprint arXiv:2409.13989},
year={2024}
}