Automatic Code Comment Generation using GraphCodeBERT

'GraphCode-Bert' is a transformer-based pre-trained model jointly trained on the code, codecomments, and the data-flow graph of the code. This repository contains the code for automatically generally code documentation in python using 'GraphCodeBERT' model. The jupyter notebook 'Demo.ipynb' contains all the code for for loading the 'CodeSearchNet' dataset for the task, pre-processing the dataset, training the 'GraphCodeBert' model on the dataset and testing it. The following sections contain information about the steps executed as part of the ''Demo.ipynb' notebook.

The notebook has been tested on Colab. Steps to clone the repo, download the dataset and other dependencies are given in the notebook. It is recommended to run the notebook on Colab and using Drive because of the large size of the models and dataset.

Dataset

First, we download 'CodeSearchNet' dataset for python language. The 'curr_lang' variable can be updated to download the dataset for a different language.
The 'sample' variable specifies the percentage of dataset to be downloaded. For our use-case, we used 10% of the 'CodeSearchNet' dataset.
Than we removed all the code-comment pairs which have non-ascii charactors.
We removed the comments from the code to avoid the model from cheating as it can simply remember the comment from the method itself.
After removing the comments from the code, we removed all code-comment pairs where the size of the code is smaller than the method, have empty comments and duplicate code-comment pairs.
We also removed all code-comment pairs where method length is outside the 95th per-centile of the method lengths and converting everything to lowercase.
Finally, we move all the data in 'training', 'testing' and 'validation' dataframes into the 'python/train.jsonl', 'python/test.jsonl' and 'python/valid.jsonl' which is the expected file format for the models.

Task Definition

The task is to generate natural language comments for python code, and evaluated by smoothed bleu-4 score.

The following commands can be used to train and test the different combinations of model architectures, provided the dataset has been loaded correctly based on the Demo.py notebook

Training

lang = 'python' # programming language
lr = 5e-5
batch_size = 16 # change depending on the GPU Colab gives you
beam_size = 10
source_length = 256
target_length = max_cmt_len
data_dir = '.'
output_dir = f'model/{lang}'
train_file = f'{data_dir}/{lang}/train.jsonl'
dev_file = f'{data_dir}/{lang}/valid.jsonl'
epochs = 5 
pretrained_model = 'microsoft/graphcodebert-base'
load_model_path = f'model/{lang}/checkpoint-best-bleu/pytorch_model.bin'

! python run.py \
    --do_train \
    --do_eval \
    --do_lower_case \
    --model_type roberta \
    --model_name_or_path {pretrained_model} \
    --train_filename {train_file} \
    --dev_filename {dev_file} \
    --output_dir {output_dir} \
    --max_source_length {source_length} \
    --max_target_length {target_length} \
    --beam_size {beam_size} \
    --train_batch_size {batch_size} \
    --eval_batch_size {batch_size} \
    --learning_rate {lr} \
    --num_train_epochs {epochs} \
    --load_model_path {load_model_path}

Testing

batch_size=64
lang = curr_lang 
model_lang='python'
data_dir = '.'
beam_size = 10
source_length = 256
target_length = max_cmt_len
output_dir = f'model/{model_lang}' #f'model/{lang}_new' 
dev_file=f"{data_dir}/{lang}/valid.jsonl"
test_file=f"{data_dir}/{lang}/test.jsonl"
test_model=f"{output_dir}/checkpoint-best-bleu/pytorch_model.bin" 

! python run.py \
    --do_test \
    --model_type roberta \
    --model_name_or_path microsoft/graphcodebert-base \
    --load_model_path {test_model} \
    --dev_filename {dev_file} \
    --test_filename {test_file} \
    --output_dir {output_dir} \
    --max_source_length {source_length} \
    --max_target_length {target_length} \
    --beam_size {beam_size} \
    --eval_batch_size {batch_size}

Retrieval Dataset

We have used CodeSearchNet code dataset to intermediate fine tune our model. But for the final finetuning we required data which was not readily available anywhere. We decided to scrap the data from official websites of Pandas, Scikit Learn, PyTorch, Numpy, Tensorflow.

We needed to get the description of each function which can come in python code. By using this extra information model can generate a better comment.

For scraping we have used Beautiful soup library. It saved us a lot of time rather than copying all the functions manually. We retrieved total 5633 functions from these four libraries.

Library	No of functions
Pandas	1216
Scikit Learn	519
PyTorch	702
Numpy	1869
Tensorflow-keras	1327

To scrap the data we used retrieval_task.ipynb. In this notebook there are two parts, first one is used for Pandas, Scikit, PyTorch, Numpy and second one is used for Tensorflow. We just need to mention links of all the webpages which contains functions. We created a list of links and passed it to loop. It will create that many (no of webpages) excel files and save it to working directory. At the end we have given code to merge all the excel files to one. In the for loop we need to go through all the file names (which are numbered) and in the loop it will call excel_write_all fun to dump in one excel.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
Retrieval Dataset		Retrieval Dataset
code		code
.gitignore		.gitignore
Demo.ipynb		Demo.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Automatic Code Comment Generation using GraphCodeBERT

Dataset

Task Definition

Training

Testing

Retrieval Dataset

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

mushus96/CodeToCommentGenerator

Folders and files

Latest commit

History

Repository files navigation

Automatic Code Comment Generation using GraphCodeBERT

Dataset

Task Definition

Training

Testing

Retrieval Dataset

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages