Skip to content

VulDetect-llm/CTX-Coder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


CTX-Coder: Cross-Attention Architectures Empower LLMs for Long-Context Vulnerability Detection

A Long-Context Enhancing LLM For Vulnerability Detection.
Explore the docs »

Note

The CTX-Coder is a modified version of Llama-3.2-11B-Vision-Instruct We remove the vision encoder and use Llama's last hidden layers.

Quick Install & Setup

# Install
pip3 install -r requirements.txt

Call Graph Data Collection

If you want to collect your own call graph dataset, use the following steps:

  1. Download the github projects into a directory root.
  2. Generate the call graph using the following command:
cd doxygen
bash doxygen.sh

Note: please replace the root directory in doxygen.sh
  1. Extract the function and format into json str using the python script: python extract_doxygen.py. It will output json files.

CTX-VUL

The CTX-Vul dataset is a dataset contains contextual functions of a vulnerable function. We format it in the following json string:

{
    "index_to_funcname": {"0": "<func1_name>", "1": "<func2_name>"},
    "adj": ["# a n*n Matrix of the call relationships, A_{ij} = 1 means the function i is called by j"], 
    "index_to_code": {"0": "<func1_code>", "1": "<func2_code>"},
    "vul_type": "Vulnerable/Not Vulnerable"
}

Note

The 0 function is the target function. The dataset and checkpoint is comming soon!

CTX-Coder

Training

We provide the training scripts in ctx_coder/train_ctxcoder.py, to use this script please fill the MODEL_PATH, LLAMA_3_PATH, and OUTPUT_PATH. You can train the model using the following command:

deepspeed ctx_coder/train_ctxcoder.py

Inference

We provide a pipeline, you can just replace the trained checkpoint and dataset for inference. Using the following command:

python ctx_coder/pipeline.py

Evaluation

  • To evaluate CTX-Coder, you should generate fisrst the result using pipeline.py. Then evaluate the result using evaluation/test.py.

  • For code document generation, we use the default dataset of CodeBert and use the official code of Big-Code.

  • CrossCodeEval: project url https://github.com/amazon-science/cceval.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published