GitHub

In this repository, we provide the data and the replication package of the submitted paper.

Preparation

In our experiments, we implement the code in Python. Before running the experiments, please install the required libraries. We have provided a requirements.txt file, which lists all the necessary libraries in our Python environment. You can install these libraries by running the following command:

python -m pip install -r requirements.txt

Directory Architecture

|-- data # the raw data of the benchmarks
|-- runtime # the runtime files, including code and tests generated by LLMs, the intermediate execution processes, and the results of our approach
    |-- "benchmark"/solutions.jsonl # the solutions generated by LLMs
    |-- "benchmark"/test_cases.pkl # the test cases generated by LLMs
    |-- "benchmark"/fix_process_gpt # the process of fixing tests using GPT
    |-- "benchmark"/total_results # the results obtained by our approach
|-- src
    |-- generate.py # the program that generates multiple code candidates with diversity
    |-- api.py # the program that invokes the OpenAI API to generate code candidates 
    |-- evaluation.py, _evaluation.py # the programs to evaluate the results
    |-- execution.py, _execution.py # the programs to execute the codes
    |-- contested_gt.py, contested_o1.py # the entry point of our apporaches

"benchmark" represents human_eval, human_eval_plus, or mbpp; since the size of HumanEvalPlus is big and you can download in from HuggingFace by your self and put it into data.

Reproduction

Generate multiple code candidates
```
python generate.py data_type gen_model
```
where data_type represents the type of benchmark (human_eval, mbpp), and gen_model represents the model you will use to generate codes (gpt-3.5, gpt-4o, etc.). We also provide the generated codes, and you can directly use that.
Execute ConTested
```
python contested_gt.py data_type gen_model
```
This program is the entry point of our approach, which will save the results to "benchmark"/total_results, and save the fix process to "benchmark"/fix_process_gpt/
```
python contested_o1.py data_type gen_model
```
This program is the entry point of our approach when using o1 to simulate the user feedback.
Evaluation
```
python evaluation.py data_type gen_model type isplus
```
where data_type represents the type of benchmark, gen_model represents the model you use, type represents the type of our approach (GT, or o1), isplus means when using the dataset HumanEval, whether it is HumanEvalPlus or not. We provide the results obtained from our experiments in "benchmark"/total_results, and you can directly evaluate them.

User Study

In the file assign_problem.csv, we list the assignment of problems to users. The "User ID" represents the unique identifier for each user, and there are 12 users in total. The "Problem Index" indicates the position of each problem assigned to the user, with each user being assigned 20 problems. The "Problem ID" refers to the identifier of the specific problem, and there are 40 problems in total. The "Setting" refers to the task the user needs to solve for the problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Preparation

Directory Architecture

Reproduction

User Study

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
runtime		runtime
src		src
.gitignore		.gitignore
README.md		README.md
assign_problem.csv		assign_problem.csv
requirements.txt		requirements.txt

DJjjjhao/replication_package

Folders and files

Latest commit

History

Repository files navigation

Preparation

Directory Architecture

Reproduction

User Study

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages