This repository contains the codebase, data, models, and report for our project on improving prompt sensitivity in Large Language Models (LLMs). We explore both inference-time prompting strategies and training-time fine-tuning methods to reduce variability in outputs across semantically equivalent prompt variants. This is Project 3 which is part of the Natural Language Processing course for the academic year 2024/2025.
Gonçalo Cardoso, Gopika Krishnan, Ali Muhammad
Prompt sensitivity refers to the variability in model responses when given different prompts with the same underlying meaning. This project investigates the issue through:
- Vanilla prompting
- Chain of Thought
- Self-Refinement
- Self-Consistency
- Iterative Refinement
- Fine-tuning LLMs using Low-Rank Adaptation (LoRA) with Parameter-Efficient Fine-Tuning (PEFT) on prompt variant groups that share the same intended output.
We experimented with multiple open-weight Large Language Models (LLMs) at the 7B scale for both inference and fine-tuning:
- Mistral-7B
- Falcon-7B Instruct
- LLaMA-2 7B
- Falcon-7B
- Falcon-RW-1B
- LLaMA-2 7B
- Mistral-7B
- Prompt Sensitivity Index (POSIX): Quantifies output variability across prompt perturbations.
- AlpacaEval: Uses a Language Model (LLM) as a judge to assess output consistency and quality.
The key components of this repository are organized as follows:
├── data/
│ └── alpaca_prompts.json # Original POSIX and Alpaca-style prompt variant data
│ └── Posix_results # Results of the Posix scores as well as all the outputs generated by our models
├── models/ # Fine-tuned LoRA weights and saved checkpoints
│
├── code/
│ ├── lora-llm-finetune/ # Training-time methods (LoRA, PEFT)
│ │ ├── requirements.txt # Pinned dependencies for environment setup
│ │ ├── finetune.py # Fine-tuning on grouped prompt variants
│ └── └── run_finetune.slurm # SLURM script for running LoRA fine-tuning on HPC
│
│ ├── POSIX/ # Inference-time prompting + POSIX scoring
│ │ ├── requirements.txt # Pinned dependencies for environment setup
│ │ ├── Posix_script.py # Runs inference with CoT, Self-Consistency, etc.
│ │ ├── posix_general.slurm # SLURM job script for POSIX evaluation
│ │ ├── sft_train.jsonl # training set for fine-tuning
│ │ ├── sft_test.jsonl # test set for evaluation
│ └── └── Posix_analysis.ipynb #Jupyter Notebook analysing results
│
│ ├── llm-as-judge/ # LLM-as-a-judge evaluation framework
│ │ ├── requirements.txt # Pinned dependencies for environment setup
│ │ ├── eval_judge.py # Uses an LLM to judge consistency/quality of outputs
│ └── └── submit_alpaca_eval.slurm # SLURM script for LLM-as-judge evaluation
│
├── report.pdf # Final report of the project
│
└── README.md # Project overview, reproducibility guide
This project was designed and tested on a High-Performance Computing (HPC) environment using the ARNES SLURM-based cluster. While local execution is possible for smaller models or testing, we recommend running on an HPC system with SLURM for full-scale fine-tuning and evaluation.
To reproduce our results from scratch:
We recommend Python 3.10+. Create a virtual environment and install dependencies:
git clone https://github.com/UL-FRI-NLP-Course/ul-fri-nlp-course-project-2024-2025-nlpizza.git
cd ul-fri-nlp-course-project-2024-2025-nlpizza
The code/POSIX/ directory already contains annotated and LLM-extended prompt variant groups with named perturbation types and target outputs - sft_train.jsonl and sft_test.jsonl.
To fine-tune a model like Falcon-RW-1B using LoRA, run:
cd code/lora-llm-finetune
conda create -n finetune_env python=3.10 -y
pip install -r requirements.txt
sbatch run_finetune.slurm tiiuae/falcon-7b q_proj,k_proj,v_proj,o_proj
This command loads the model tiiuae/falcon-rw-1b
from Hugging Face and applies LoRA to the attention modules (q_proj,k_proj,v_proj,o_proj
) to finetune for the prompt variants.
For the fine-tuning, the expected dataset is a .jsonl
file with the following fields per line:
- instruction: the input prompt
- output: the corresponding target response
- group_id: identifier for a group of semantically similar prompts
{
"instruction": "Rephrase this sentence: The cat sat on the mat.",
"output": "The feline rested on the rug.",
"group_id": 42
}
In the above script, we use what is already been prepped (code/POSIX/sft_train.jsonl
).
The script is configurable with arguments for model name, dataset path, LoRA settings, number of groups to sample, and more. You can edit run_finetune.py
to do this. Here's an example of training Falcon-7B instruct model:
python finetune.py \
--model_name "tiiuae/Falcon-7B-Instruct" \
--dataset_path data/sft_train.jsonl \
--output_dir falcon7b_lora_output \
--quant4bit \
--lora_r 16 \
--lora_alpha 32 \
--target_modules "query_key_value,dense,dense_h_to_4h,dense_4h_to_h"
After training, the model and LoRA adapters will be saved to the specified output directory:
falcon7b_lora_output/
├── adapter_config.json
├── adapter_model.bin
├── tokenizer_config.json
├── tokenizer.json
These can be loaded at inference time using:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
model = AutoModelForCausalLM.from_pretrained("tiiuae/Falcon-7B-Instruct")
model = PeftModel.from_pretrained(model, "falcon7b_lora_output")
tokenizer = AutoTokenizer.from_pretrained("falcon7b_lora_output")
All our fine-tuned models are present at /d/hpc/projects/onj_fri/gk40784/pt2/output/
Evaluate output stability across prompt variants using the Prompt Sensitivity Index. Supports both base and fine-tuned models, and allows switching between prompting techniques.
From the base folder of the repository, run the following:
cd code/POSIX
pip install -r requirements.txt
export TASK_ID=0
export BATCH_SIZE=5
export TECHNIQUE=None
export MODEL_ID="tiiuae/falcon-rw-1b"
python Posix_script.py
To use a prompting strategy like Chain of Thought:
export TECHNIQUE=chain_of_thought
To use a fine-tuned model:
export FINETUNE_FLAG=1
export FINETUNE_PATH=models/falcon7b_lora_output/
The script is SLURM-array-ready for parallel POSIX computation:
sbatch posix_general.slurm
Each job generates a .csv file with generated responses, formatted prompts, and POSIX scores (overall, per type). The script will also print the average POSIX score for that batch.
We use a judge model (e.g. Mistral-7B-Instruct) to evaluate how semantically similar the candidate responses (produced by target models) are to gold references. This helps measure output consistency and quality.
- Input: Reference responses generated by the judge model and target responses from the models you want to evaluate.
- Output: CSV files with similarity scores by technique and perturbation, plus summary plots and tables.
- Gold files:
posix_{technique}_{judge_model_short}_merged.csv
e.g. posix_chain_of_thought_mistralai_Mistral-7B-Instruct-v0.1_merged.csv
- Target files:
posix_{technique}_{target_model_short}_merged.csv
e.g.
posix_chain_of_thought_tiiuae_falcon-7b-instruct_merged.csv
- Output files:
judged_scores_{technique}_{target_model_short}.csv judged_scores_all.csv summary_scores_table.csv
Navigate to your project folder:
cd /d/hpc/projects/FRI/ma76193
Run the script with:
python judge_model_eval.py \
--base_dir /d/hpc/projects/FRI/ma76193 \
--judge_model mistralai/Mistral-7B-Instruct-v0.1 \
--target_model tiiuae/falcon-7b-instruct
--judge_model can be any HF-compatible judge model (e.g. mistralai/Mistral-7B-Instruct-v0.1).
--target_model can be any model you want to evaluate.
How to Run on SLURM HPC
Use the provided SLURM batch file to run the evaluation on a GPU node.
Example:
sbatch run_alpaca_only.sh mistralai/Mistral-7B-Instruct-v0.1 tiiuae/falcon-7b-instruct
arg1: Judge model name or path.
arg2: Target model name or path.