CodeFair: Fair Evaluation for Code Generation

1. Introduction

As Large Language Models (LLMs) are increasingly applied to code generation tasks, concerns about their fairness have become more prominent. Existing research primarily focuses on bias in text generation and dialogue scenarios, while fairness issues in code generation remain relatively under-explored. Current studies on code generation fairness often rely on custom task settings and follow traditional discriminative model evaluation frameworks to identify biases in a simplistic manner. Corresponding bias mitigation strategies are mostly limited to basic techniques such as prompt engineering. Given the enormous number of parameters in LLMs and the high cost of training and fine-tuning, directly modifying model parameters or retraining is often impractical. In contrast, post-processing methods, which do not require altering the model's internal structure and have lower computational costs, present a viable approach to addressing fairness in code generation. Therefore, this work proposes a systematic framework for fairness evaluation and optimization, aiming to uncover potential biases in code generation and enhance fairness through lightweight post-processing strategies.

2. Structure

codefair/
|-- figures           : figures of some pictures
|-- human-eval        : human-eval dataset
|-- livecodebench     : livecodebench dataset and related code implementation
|-- log               : position of running log to save
|-- results           : location to save experimental results
|-- roles             : social roles for each sensitive attribute
|-- codegen.py        : code generation and evaluation for human-eval(+) and mbpp(+)
|-- metrics.py        : fairness evaluation metrics
|-- README.md         : user guide of our work
|-- utils.py          : necessary functions

3. Datasets

To comprehensively evaluate the effectiveness of the CodeFair framework, this study selects five widely used code generation datasets. The following table summarizes the basic information of each dataset. Note that "# Test Cases" refers to the average number of test cases per task. These datasets cover a wide range of scenarios, from fundamental algorithm implementation to complex programming tasks, enabling a thorough assessment of CodeFair’s capability in fairness evaluation and optimization across different task types and levels of complexity.

NOTE: All datasets used in our study could be obtained from their homepages (HumanEval, HumanEval+, MBPP, MBPP+, LiveCodeBench).

4. Reproducibility

4.1. Environment

Python: 3.10.16
CUDA Version: 11.4
GPU: NVIDIA GeForce RTX 3090
OS: Linux (Ubuntu 20.04.6 LTS)
CPU: Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz

To facilitate other researchers in reproducing our codefair, we provide a docker image for the experiment. It can be easily obtained by the following command:

docker pull junjie1003/codefair:latest

docker run -it --name codefair --gpus all --net host -v /your/local/path/:/data --shm-size="200g" junjie1003/codefair:latest /bin/bash

cd /root/codefair

Then you will enter a container. Remember to change /your/local/path/ to the real path.😊

4.2. HumanEval(+) & MBPP(+)

conda activate codefair

# GPT-4o
nohup python codegen.py --dataset humaneval --model gpt-4o >log/humaneval_gpt4o.log 2>&1 &
nohup python codegen.py --dataset mbpp --model gpt-4o >log/mbpp_gpt4o.log 2>&1 &

# GPT-4o-mini
nohup python codegen.py --dataset humaneval --model gpt-4o-mini >log/humaneval_gpt4omini.log 2>&1 &
nohup python codegen.py --dataset mbpp --model gpt-4o-mini >log/mbpp_gpt4omini.log 2>&1 &

# Claude-3.5-haiku
nohup python codegen.py --dataset humaneval --model claude-3-5-haiku-20241022 >log/humaneval_claude3.5haiku.log 2>&1 &
nohup python codegen.py --dataset mbpp --model claude-3-5-haiku-20241022 >log/mbpp_claude3.5haiku.log 2>&1 &

# DeepSeek-V3
nohup python codegen.py --dataset humaneval --model deepseek-chat >log/humaneval_deepseek.log 2>&1 &
nohup python codegen.py --dataset mbpp --model deepseek-chat >log/mbpp_deepseek.log 2>&1 &

4.3. LiveCodeBench

conda activate livecodebench
cd livecodebench

# GPT-4o
nohup python -m lcb_runner.runner.main --model gpt-4o --evaluate --multiprocess 48 >../log/livecodebench_gpt4o.log 2>&1 &
python -m lcb_runner.evaluation.compute_scores --model gpt-4o

# GPT-4o-mini
nohup python -m lcb_runner.runner.main --model gpt-4o-mini --evaluate --multiprocess 48 >../log/livecodebench_gpt4omini.log 2>&1 &
python -m lcb_runner.evaluation.compute_scores --model gpt-4o-mini

# Claude-3.5-haiku
nohup python -m lcb_runner.runner.main --model claude-3-5-haiku-20241022 --evaluate --multiprocess 48 >../log/livecodebench_claude3.5haiku.log 2>&1 &
python -m lcb_runner.evaluation.compute_scores --model claude-3-5-haiku-20241022

# DeepSeek-V3
nohup python -m lcb_runner.runner.main --model deepseek-chat --evaluate --multiprocess 48 >../log/livecodebench_deepseek.log 2>&1 &
python -m lcb_runner.evaluation.compute_scores --model deepseek-chat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CodeFair: Fair Evaluation for Code Generation

1. Introduction

2. Structure

3. Datasets

4. Reproducibility

4.1. Environment

4.2. HumanEval(+) & MBPP(+)

4.3. LiveCodeBench

5. Experimental Results

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
figures		figures
human-eval		human-eval
livecodebench		livecodebench
roles		roles
README.md		README.md
codegen.py		codegen.py
metrics.py		metrics.py
utils.py		utils.py

junjie1003/CodeFair

Folders and files

Latest commit

History

Repository files navigation

CodeFair: Fair Evaluation for Code Generation

1. Introduction

2. Structure

3. Datasets

4. Reproducibility

4.1. Environment

4.2. HumanEval(+) & MBPP(+)

4.3. LiveCodeBench

5. Experimental Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages