A Long-Context Enhancing LLM For Vulnerability Detection.
Explore the docs »
Note
The CTX-Coder is a modified version of Llama-3.2-11B-Vision-Instruct We remove the vision encoder and use Llama's last hidden layers.
# Install
pip3 install -r requirements.txt
If you want to collect your own call graph dataset, use the following steps:
- Download the github projects into a directory
root
. - Generate the call graph using the following command:
cd doxygen
bash doxygen.sh
Note: please replace the root directory in doxygen.sh
- Extract the function and format into json str using the python script:
python extract_doxygen.py
. It will output json files.
The CTX-Vul dataset is a dataset contains contextual functions of a vulnerable function. We format it in the following json string:
{
"index_to_funcname": {"0": "<func1_name>", "1": "<func2_name>"},
"adj": ["# a n*n Matrix of the call relationships, A_{ij} = 1 means the function i is called by j"],
"index_to_code": {"0": "<func1_code>", "1": "<func2_code>"},
"vul_type": "Vulnerable/Not Vulnerable"
}
Note
The 0 function is the target function. The dataset and checkpoint is comming soon!
We provide the training scripts in ctx_coder/train_ctxcoder.py, to use this script please fill the MODEL_PATH
, LLAMA_3_PATH
, and OUTPUT_PATH
.
You can train the model using the following command:
deepspeed ctx_coder/train_ctxcoder.py
We provide a pipeline, you can just replace the trained checkpoint and dataset for inference. Using the following command:
python ctx_coder/pipeline.py
-
To evaluate CTX-Coder, you should generate fisrst the result using
pipeline.py
. Then evaluate the result usingevaluation/test.py
. -
For code document generation, we use the default dataset of CodeBert and use the official code of Big-Code.
-
CrossCodeEval: project url https://github.com/amazon-science/cceval.