In this repository, we provide the data and the replication package of the submitted paper.
In our experiments, we implement the code in Python. Before running the experiments, please install the required libraries. We have provided a requirements.txt
file, which lists all the necessary libraries in our Python environment. You can install these libraries by running the following command:
python -m pip install -r requirements.txt
|-- data # the raw data of the benchmarks
|-- runtime # the runtime files, including code and tests generated by LLMs, the intermediate execution processes, and the results of our approach
|-- "benchmark"/solutions.jsonl # the solutions generated by LLMs
|-- "benchmark"/test_cases.pkl # the test cases generated by LLMs
|-- "benchmark"/fix_process_gpt # the process of fixing tests using GPT
|-- "benchmark"/total_results # the results obtained by our approach
|-- src
|-- generate.py # the program that generates multiple code candidates with diversity
|-- api.py # the program that invokes the OpenAI API to generate code candidates
|-- evaluation.py, _evaluation.py # the programs to evaluate the results
|-- execution.py, _execution.py # the programs to execute the codes
|-- contested_gt.py, contested_o1.py # the entry point of our apporaches
"benchmark" represents human_eval, human_eval_plus, or mbpp; since the size of HumanEvalPlus is big and you can download in from HuggingFace by your self and put it into data
.
-
Generate multiple code candidates
python generate.py data_type gen_model
where data_type represents the type of benchmark (human_eval, mbpp), and gen_model represents the model you will use to generate codes (gpt-3.5, gpt-4o, etc.). We also provide the generated codes, and you can directly use that.
-
Execute ConTested
python contested_gt.py data_type gen_model
This program is the entry point of our approach, which will save the results to
"benchmark"/total_results
, and save the fix process to"benchmark"/fix_process_gpt/
python contested_o1.py data_type gen_model
This program is the entry point of our approach when using o1 to simulate the user feedback.
-
Evaluation
python evaluation.py data_type gen_model type isplus
where data_type represents the type of benchmark, gen_model represents the model you use, type represents the type of our approach (GT, or o1), isplus means when using the dataset HumanEval, whether it is HumanEvalPlus or not. We provide the results obtained from our experiments in
"benchmark"/total_results
, and you can directly evaluate them.
In the file assign_problem.csv
, we list the assignment of problems to users. The "User ID" represents the unique identifier for each user, and there are 12 users in total. The "Problem Index" indicates the position of each problem assigned to the user, with each user being assigned 20 problems. The "Problem ID" refers to the identifier of the specific problem, and there are 40 problems in total. The "Setting" refers to the task the user needs to solve for the problem.