CoDocBench: A Dataset for Code-Documentation Alignment in Software Maintenance

This repository contains the CoDocBench dataset, a dataset for code-documentation alignment in software maintenance. The dataset is composed of 4,573 code-documentation pairs extracted from 200 open-source Python projects.

Dataset Description

To use the CoDocBench dataset mentioned in the paper, you can find the dataset in the dataset folder. The folder contains the following files:

codocbench.jsonl: The main dataset file containing 4573 code-documentation pairs.
test.jsonl: The test dataset file containing 2273 code-documentation pairs from a random selection of 50% of the projects.
train.jsonl: The training dataset file containing 2300 code-documentation pairs from the remaining 50% of the projects.

The dataset is in JSONL format, and each line contains a JSON file with the following fields:

{
  "file": "string",                // File name or path.
  "function": "string",            // Fully qualified function/method name.
  "version_data": [                // List of version-specific data.
    {
      "version1": "string",         // Version identifier.
      "docstring_lines": {         // Docstring line range.
        "start_line": "integer",
        "end_line": "integer"
      },
      "code_lines": {              // Code line range.
        "start_line": "integer",
        "end_line": "integer"
      },
      "commit_date_time": "string",// Timestamp of the commit.
      "commit_sha": "string",      // Commit hash.
      "commit_message": "string",  // Commit message.
      "docstring": "string",       // Function docstring.
      "code": "string"             // Function code.
    },
    {
      "version2": "string",         // Version identifier.
      "docstring_lines": {         // Docstring line range.
        "start_line": "integer",
        "end_line": "integer"
      },
      "code_lines": {              // Code line range.
        "start_line": "integer",
        "end_line": "integer"
      },
      "commit_date_time": "string",// Timestamp of the commit.
      "commit_sha": "string",      // Commit hash.
      "commit_message": "string",  // Commit message.
      "docstring": "string",       // Function docstring.
      "code": "string"             // Function code.
    }
  ],
  "diff_code": "string",           // Unified diff for the function code.
  "diff_docstring": "string",      // Unified diff for the docstring.
  "whitespace_only_code": "boolean",  // Indicates if code diff is whitespace-only.
  "whitespace_only_docstring": "boolean", // Indicates if docstring diff is whitespace-only.
  "file_path": "string",           // Full file path.
  "filename": "string",            // File name.
  "project": "string",             // Project name.
  "owner": "string"                // Owner of the repository.
}

Extracting Your Own Dataset

To extract your own dataset, follow these steps:

Clone the repository:

git clone https://github.com/kunpai/codocbench.git

Install the required dependencies:
```
./setup.sh
```
NOTE: This script sets up a virtual environment and installs the required dependencies. It defaults to Python version 3.13.

If you have a different Python version:
```
./setup.sh <PYTHON_VERSION>
```
where <PYTHON_VERSION> is the version of Python you want to use.

If you prefer to use your own environment, you can install the dependencies manually by running:
```
pip install -r requirements
```
(Be sure to give the appropriate permissions to the script by running chmod +x setup.sh)
Run the virtual environment:
```
source codocbench-env/bin/activate
```
To extract your own dataset, you can use the parse.py script. The script has a few variants that you can use to customize the extraction process.
1. Variant 1: Extracting from a single project
  
  To extract code-documentation pairs from a single project, you can use the following command:
```
python parse.py owner repo
```
  where owner is the owner of the repository and repo is the name of the repository.
2. Variant 2: Extracting from multiple projects
  
  To extract code-documentation pairs from multiple projects, you can use the following command:
```
python parse.py
```
  This command will extract code-documentation pairs from all the projects listed in the projects.csv file. Ensure that the projects.csv file contains the owner and repository name of the projects you want to extract, separated by a comma.
  
  The projects.csv file in this repository contains the owner and repository name of the projects used in the CoDocBench dataset.
3. Variant 3: Extracting from a specific file
  
  To extract code-documentation pairs from a specific file, you can use the following command:
```
python parse.py owner repo path
```
  where owner is the owner of the repository, repo is the name of the repository, and path is the path to the file.
  
  NOTE: The path should be relative to the root of the repository, and it should exist in the latest commit of the repository.
The extracted code-documentation pairs will be saved in the differ_files/ folder in JSONL format. The file name will be in the format codocbench.jsonl.

The parse.py script also records solitary docstring changes and solitary code changes in the differ_files/ folder. The file name will be in the format combined_diff_mapping_docstring_.jsonl and combined_diff_mapping_code_.jsonl, respectively. However, these are not post-processed and may contain false positives.

Examples

Example scripts of using the dataset are provided in the examples folder. The scripts demonstrate how to load the dataset and use it for various tasks.

For most of the examples, you can run the script using the following command:

python examples/<FILENAME>.py <PATH_TO_DATASET>

where <FILENAME> is the name of the script and <PATH_TO_DATASET> is the path to the dataset file.

In case of the 3-shot learning examples, you can run the script using the following command:

python examples/<FILENAME>.py <PATH_TO_DATASET> <PATH_TO_TRAIN_DATASET>

where <FILENAME> is the name of the script, <PATH_TO_DATASET> is the path to the dataset file, and <PATH_TO_TRAIN_DATASET> is the path to the training dataset file.

All these files load meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo as the default model. You can change the model by running the script with the --model flag:

python examples/<FILENAME>.py <PATH_TO_DATASET> --model=<MODEL_NAME>

where <MODEL_NAME> is the name of the model you want to use.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
dataset		dataset
examples		examples
language		language
plots		plots
util		util
.gitignore		.gitignore
CITATION.cff		CITATION.cff
README.md		README.md
diff_to_jsonl.py		diff_to_jsonl.py
parse.py		parse.py
projects.csv		projects.csv
requirements		requirements
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CoDocBench: A Dataset for Code-Documentation Alignment in Software Maintenance

Dataset Description

Extracting Your Own Dataset

Examples

About

Uh oh!

Releases 4

Packages

Languages

kunpai/codocbench

Folders and files

Latest commit

History

Repository files navigation

CoDocBench: A Dataset for Code-Documentation Alignment in Software Maintenance

Dataset Description

Extracting Your Own Dataset

Examples

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Languages

Packages