Skip to content

DJjjjhao/replication_package

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

In this repository, we provide the data and the replication package of the submitted paper.

Preparation

In our experiments, we implement the code in Python. Before running the experiments, please install the required libraries. We have provided a requirements.txt file, which lists all the necessary libraries in our Python environment. You can install these libraries by running the following command:

python -m pip install -r requirements.txt

Directory Architecture

|-- data # the raw data of the benchmarks
|-- runtime # the runtime files, including code and tests generated by LLMs, the intermediate execution processes, and the results of our approach
    |-- "benchmark"/solutions.jsonl # the solutions generated by LLMs
    |-- "benchmark"/test_cases.pkl # the test cases generated by LLMs
    |-- "benchmark"/fix_process_gpt # the process of fixing tests using GPT
    |-- "benchmark"/total_results # the results obtained by our approach
|-- src
    |-- generate.py # the program that generates multiple code candidates with diversity
    |-- api.py # the program that invokes the OpenAI API to generate code candidates 
    |-- evaluation.py, _evaluation.py # the programs to evaluate the results
    |-- execution.py, _execution.py # the programs to execute the codes
    |-- contested_gt.py, contested_o1.py # the entry point of our apporaches

"benchmark" represents human_eval, human_eval_plus, or mbpp; since the size of HumanEvalPlus is big and you can download in from HuggingFace by your self and put it into data.

Reproduction

  • Generate multiple code candidates

    python generate.py data_type gen_model

    where data_type represents the type of benchmark (human_eval, mbpp), and gen_model represents the model you will use to generate codes (gpt-3.5, gpt-4o, etc.). We also provide the generated codes, and you can directly use that.

  • Execute ConTested

    python contested_gt.py data_type gen_model

    This program is the entry point of our approach, which will save the results to "benchmark"/total_results, and save the fix process to "benchmark"/fix_process_gpt/

    python contested_o1.py data_type gen_model

    This program is the entry point of our approach when using o1 to simulate the user feedback.

  • Evaluation

    python evaluation.py data_type gen_model type isplus

    where data_type represents the type of benchmark, gen_model represents the model you use, type represents the type of our approach (GT, or o1), isplus means when using the dataset HumanEval, whether it is HumanEvalPlus or not. We provide the results obtained from our experiments in "benchmark"/total_results, and you can directly evaluate them.

User Study

In the file assign_problem.csv, we list the assignment of problems to users. The "User ID" represents the unique identifier for each user, and there are 12 users in total. The "Problem Index" indicates the position of each problem assigned to the user, with each user being assigned 20 problems. The "Problem ID" refers to the identifier of the specific problem, and there are 40 problems in total. The "Setting" refers to the task the user needs to solve for the problem.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages