Official Code for the paper: "Advanced Large Language Models Prompting Strategies for Reentrancy Classification and Explanation in Smart Contracts".
This repository contains the official implementation for our research on using advanced prompting strategies for Large Language Models (LLMs) in smart contract security. We introduce a novel approach combining structurally-aware Retrieval-Augmented Generation (RAG) with reasoning-optimized LLMs to reliably detect vulnerabilities and generate human-understandable explanations.
Our key finding is that grounding LLMs in structural evidence (like Control Flow Graphs) is more effective than prescribing a rigid thought process. This method not only achieves state-of-the-art accuracy but also produces trustworthy, actionable explanations, bridging the gap between automated analysis and human expertise.
- Scope: Detecting reentrancy vulnerabilities in Solidity smart contracts.
- Dataset: Our manually verified dataset is available here.
- Models Evaluated:
- Traditional ML: BERT, LSTM, FFNN, GNB, GB, XGB, KNN, LR, RF, SVM (code for reproducibility available here)
- Large Language Models: GPT-4o, GPT-4.1, GPT-4.1-mini, GPT-4.1-nano, o3-mini, o4-mini
- Prompts: All prompts used in our experiments are available in
src/prompts.py
.
- Python 3.8+
- A virtual environment manager like
conda
orvenv
is recommended.
-
Clone the repository:
git clone https://github.com/matteo-rizzo/advanced-llm-prompting-for-reentrancy.git cd advanced-llm-prompting-for-reentrancy
-
Create and activate a virtual environment:
python3 -m venv .venv source .venv/bin/activate
-
Install dependencies:
pip install -r requirements.txt
-
Set up your API keys:
- Create a
.env
file in thesrc/
directory by copying the example:cp src/.env.example src/.env
- Add your OpenAI API key (and any other required keys) to the
src/.env
file.
- Create a
We provide convenient shell scripts to reproduce the main experiments of the paper in a three-step process.
This script executes the core Structurally-Aware RAG pipeline for vulnerability classification and explanation generation. It is highly configurable.
-
To run the pipeline with default settings (mode=
cfg
, k=3
, all splits, default models):./src/scripts/xrag.sh
-
To run with a different RAG mode and k-value for a specific model:
./src/scripts/xrag.sh --mode ast --k 5 o4-mini
-
To run on a single data split (e.g., split #2):
./src/scripts/xrag.sh --split 2
This script runs the baseline (non-RAG) models for comparison.
-
To run the default baseline models (
o3-mini
,gpt-4o
):./src/scripts/baseline.sh
-
To run a specific baseline model (e.g.,
gpt-4.1-mini
):./src/scripts/baseline.sh gpt-4.1-mini
This script uses a powerful "evaluator" model to score the quality of the explanations generated by other models.
-
To evaluate default models using the default evaluator (
o4-mini
):./src/scripts/eval_explanations.sh
-
To specify a different evaluator model (e.g.,
gpt-4o
):./src/scripts/eval_explanations.sh --evaluator gpt-4o gpt-4.1
All results are logged and saved. You can use the main Jupyter Notebook to visualize and analyze the outputs from all your runs.
jupyter notebook notebooks/rag-results.ipynb
This project is licensed under the MIT License. See the LICENSE file for more details.