This repository contains the CoDocBench dataset, a dataset for code-documentation alignment in software maintenance. The dataset is composed of 4,573 code-documentation pairs extracted from 200 open-source Python projects.
To use the CoDocBench dataset mentioned in the paper, you can find the dataset in the dataset
folder. The folder contains the following files:
codocbench.jsonl
: The main dataset file containing 4573 code-documentation pairs.test.jsonl
: The test dataset file containing 2273 code-documentation pairs from a random selection of 50% of the projects.train.jsonl
: The training dataset file containing 2300 code-documentation pairs from the remaining 50% of the projects.
The dataset is in JSONL format, and each line contains a JSON file with the following fields:
{
"file": "string", // File name or path.
"function": "string", // Fully qualified function/method name.
"version_data": [ // List of version-specific data.
{
"version1": "string", // Version identifier.
"docstring_lines": { // Docstring line range.
"start_line": "integer",
"end_line": "integer"
},
"code_lines": { // Code line range.
"start_line": "integer",
"end_line": "integer"
},
"commit_date_time": "string",// Timestamp of the commit.
"commit_sha": "string", // Commit hash.
"commit_message": "string", // Commit message.
"docstring": "string", // Function docstring.
"code": "string" // Function code.
},
{
"version2": "string", // Version identifier.
"docstring_lines": { // Docstring line range.
"start_line": "integer",
"end_line": "integer"
},
"code_lines": { // Code line range.
"start_line": "integer",
"end_line": "integer"
},
"commit_date_time": "string",// Timestamp of the commit.
"commit_sha": "string", // Commit hash.
"commit_message": "string", // Commit message.
"docstring": "string", // Function docstring.
"code": "string" // Function code.
}
],
"diff_code": "string", // Unified diff for the function code.
"diff_docstring": "string", // Unified diff for the docstring.
"whitespace_only_code": "boolean", // Indicates if code diff is whitespace-only.
"whitespace_only_docstring": "boolean", // Indicates if docstring diff is whitespace-only.
"file_path": "string", // Full file path.
"filename": "string", // File name.
"project": "string", // Project name.
"owner": "string" // Owner of the repository.
}
To extract your own dataset, follow these steps:
-
Clone the repository:
git clone https://github.com/kunpai/codocbench.git
-
Install the required dependencies:
./setup.sh
NOTE: This script sets up a virtual environment and installs the required dependencies. It defaults to Python version 3.13.
If you have a different Python version:
./setup.sh <PYTHON_VERSION>
where
<PYTHON_VERSION>
is the version of Python you want to use.If you prefer to use your own environment, you can install the dependencies manually by running:
pip install -r requirements
(Be sure to give the appropriate permissions to the script by running
chmod +x setup.sh
) -
Run the virtual environment:
source codocbench-env/bin/activate
-
To extract your own dataset, you can use the
parse.py
script. The script has a few variants that you can use to customize the extraction process.-
Variant 1: Extracting from a single project
To extract code-documentation pairs from a single project, you can use the following command:
python parse.py owner repo
where
owner
is the owner of the repository andrepo
is the name of the repository. -
Variant 2: Extracting from multiple projects
To extract code-documentation pairs from multiple projects, you can use the following command:
python parse.py
This command will extract code-documentation pairs from all the projects listed in the
projects.csv
file. Ensure that theprojects.csv
file contains the owner and repository name of the projects you want to extract, separated by a comma.The
projects.csv
file in this repository contains the owner and repository name of the projects used in the CoDocBench dataset. -
Variant 3: Extracting from a specific file
To extract code-documentation pairs from a specific file, you can use the following command:
python parse.py owner repo path
where
owner
is the owner of the repository,repo
is the name of the repository, andpath
is the path to the file.NOTE: The path should be relative to the root of the repository, and it should exist in the latest commit of the repository.
-
-
The extracted code-documentation pairs will be saved in the
differ_files/
folder in JSONL format. The file name will be in the formatcodocbench.jsonl
.
The parse.py
script also records solitary docstring changes and solitary code changes in the differ_files/
folder. The file name will be in the format combined_diff_mapping_docstring_.jsonl
and combined_diff_mapping_code_.jsonl
, respectively. However, these are not post-processed and may contain false positives.
Example scripts of using the dataset are provided in the examples
folder. The scripts demonstrate how to load the dataset and use it for various tasks.
For most of the examples, you can run the script using the following command:
python examples/<FILENAME>.py <PATH_TO_DATASET>
where <FILENAME>
is the name of the script and <PATH_TO_DATASET>
is the path to the dataset file.
In case of the 3-shot learning examples, you can run the script using the following command:
python examples/<FILENAME>.py <PATH_TO_DATASET> <PATH_TO_TRAIN_DATASET>
where <FILENAME>
is the name of the script, <PATH_TO_DATASET>
is the path to the dataset file, and <PATH_TO_TRAIN_DATASET>
is the path to the training dataset file.
All these files load meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo
as the default model. You can change the model by running the script with the --model
flag:
python examples/<FILENAME>.py <PATH_TO_DATASET> --model=<MODEL_NAME>
where <MODEL_NAME>
is the name of the model you want to use.