Natural language processing course: `Improving Prompt Sensitivity in Large Language Models`

This repository contains the codebase, data, models, and report for our project on improving prompt sensitivity in Large Language Models (LLMs). We explore both inference-time prompting strategies and training-time fine-tuning methods to reduce variability in outputs across semantically equivalent prompt variants. This is Project 3 which is part of the Natural Language Processing course for the academic year 2024/2025.

Team

Gonçalo Cardoso, Gopika Krishnan, Ali Muhammad

Project Overview

Prompt sensitivity refers to the variability in model responses when given different prompts with the same underlying meaning. This project investigates the issue through:

Inference-Time Prompting Techniques

Vanilla prompting
Chain of Thought
Self-Refinement
Self-Consistency
Iterative Refinement

Training-Time Fine-Tuning

Fine-tuning LLMs using Low-Rank Adaptation (LoRA) with Parameter-Efficient Fine-Tuning (PEFT) on prompt variant groups that share the same intended output.

Models Used

We experimented with multiple open-weight Large Language Models (LLMs) at the 7B scale for both inference and fine-tuning:

Inference-Time Evaluation

Mistral-7B
Falcon-7B Instruct
LLaMA-2 7B

Fine-Tuning (PEFT with LoRA)

Falcon-7B
Falcon-RW-1B
LLaMA-2 7B
Mistral-7B

Evaluation Methods

Prompt Sensitivity Index (POSIX): Quantifies output variability across prompt perturbations.
AlpacaEval: Uses a Language Model (LLM) as a judge to assess output consistency and quality.

Repository Structure

The key components of this repository are organized as follows:

├── data/
│   └── alpaca_prompts.json      # Original POSIX and Alpaca-style prompt variant data
│   └── Posix_results            # Results of the Posix scores as well as all the outputs generated by our models
├── models/                      # Fine-tuned LoRA weights and saved checkpoints
│
├── code/
│   ├── lora-llm-finetune/        # Training-time methods (LoRA, PEFT)
│   │   ├── requirements.txt      # Pinned dependencies for environment setup
│   │   ├── finetune.py           # Fine-tuning on grouped prompt variants
│   └── └── run_finetune.slurm    # SLURM script for running LoRA fine-tuning on HPC
│
│   ├── POSIX/                   # Inference-time prompting + POSIX scoring
│   │   ├── requirements.txt      # Pinned dependencies for environment setup
│   │   ├── Posix_script.py       # Runs inference with CoT, Self-Consistency, etc.
│   │   ├── posix_general.slurm    # SLURM job script for POSIX evaluation
│   │   ├── sft_train.jsonl       # training set for fine-tuning 
│   │   ├── sft_test.jsonl        # test set for evaluation
│   └── └── Posix_analysis.ipynb  #Jupyter Notebook analysing results
│
│   ├── llm-as-judge/             # LLM-as-a-judge evaluation framework
│   │   ├── requirements.txt      # Pinned dependencies for environment setup
│   │   ├── eval_judge.py        # Uses an LLM to judge consistency/quality of outputs
│   └── └── submit_alpaca_eval.slurm  # SLURM script for LLM-as-judge evaluation
│
├── report.pdf                   # Final report of the project
│
└── README.md                    # Project overview, reproducibility guide

Reproducibility

This project was designed and tested on a High-Performance Computing (HPC) environment using the ARNES SLURM-based cluster. While local execution is possible for smaller models or testing, we recommend running on an HPC system with SLURM for full-scale fine-tuning and evaluation.

To reproduce our results from scratch:

1. Set up the environment

We recommend Python 3.10+. Create a virtual environment and install dependencies:

git clone https://github.com/UL-FRI-NLP-Course/ul-fri-nlp-course-project-2024-2025-nlpizza.git
cd ul-fri-nlp-course-project-2024-2025-nlpizza

2. Prepare the data

The code/POSIX/ directory already contains annotated and LLM-extended prompt variant groups with named perturbation types and target outputs - sft_train.jsonl and sft_test.jsonl.

3. Fine-tune the model with LoRA

To fine-tune a model like Falcon-RW-1B using LoRA, run:

cd code/lora-llm-finetune
conda create -n finetune_env python=3.10 -y
pip install -r requirements.txt
sbatch run_finetune.slurm tiiuae/falcon-7b q_proj,k_proj,v_proj,o_proj

This command loads the model tiiuae/falcon-rw-1b from Hugging Face and applies LoRA to the attention modules (q_proj,k_proj,v_proj,o_proj) to finetune for the prompt variants.

For the fine-tuning, the expected dataset is a .jsonl file with the following fields per line:

instruction: the input prompt
output: the corresponding target response
group_id: identifier for a group of semantically similar prompts

{
  "instruction": "Rephrase this sentence: The cat sat on the mat.",
  "output": "The feline rested on the rug.",
  "group_id": 42
}

In the above script, we use what is already been prepped (code/POSIX/sft_train.jsonl).

The script is configurable with arguments for model name, dataset path, LoRA settings, number of groups to sample, and more. You can edit run_finetune.py to do this. Here's an example of training Falcon-7B instruct model:

python finetune.py \
  --model_name "tiiuae/Falcon-7B-Instruct" \
  --dataset_path data/sft_train.jsonl \
  --output_dir falcon7b_lora_output \
  --quant4bit \
  --lora_r 16 \
  --lora_alpha 32 \
  --target_modules "query_key_value,dense,dense_h_to_4h,dense_4h_to_h"

After training, the model and LoRA adapters will be saved to the specified output directory:

falcon7b_lora_output/
├── adapter_config.json
├── adapter_model.bin
├── tokenizer_config.json
├── tokenizer.json

These can be loaded at inference time using:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

model = AutoModelForCausalLM.from_pretrained("tiiuae/Falcon-7B-Instruct")
model = PeftModel.from_pretrained(model, "falcon7b_lora_output")
tokenizer = AutoTokenizer.from_pretrained("falcon7b_lora_output")

All our fine-tuned models are present at /d/hpc/projects/onj_fri/gk40784/pt2/output/

4. Evaluation with POSIX

Evaluate output stability across prompt variants using the Prompt Sensitivity Index. Supports both base and fine-tuned models, and allows switching between prompting techniques.

Option A: Local Run

From the base folder of the repository, run the following:

cd code/POSIX
pip install -r requirements.txt
export TASK_ID=0
export BATCH_SIZE=5
export TECHNIQUE=None
export MODEL_ID="tiiuae/falcon-rw-1b"
python Posix_script.py

To use a prompting strategy like Chain of Thought:

export TECHNIQUE=chain_of_thought

To use a fine-tuned model:

export FINETUNE_FLAG=1
export FINETUNE_PATH=models/falcon7b_lora_output/

Option B: Run on SLURM HPC (recommended for full evaluation)

The script is SLURM-array-ready for parallel POSIX computation:

sbatch posix_general.slurm

Each job generates a .csv file with generated responses, formatted prompts, and POSIX scores (overall, per type). The script will also print the average POSIX score for that batch.

5. Evaluation with LLM-as-a-Judge

We use a judge model (e.g. Mistral-7B-Instruct) to evaluate how semantically similar the candidate responses (produced by target models) are to gold references. This helps measure output consistency and quality.

How It Works

Input: Reference responses generated by the judge model and target responses from the models you want to evaluate.
Output: CSV files with similarity scores by technique and perturbation, plus summary plots and tables.

File Naming Conventions

Gold files:
posix_{technique}_{judge_model_short}_merged.csv

e.g. posix_chain_of_thought_mistralai_Mistral-7B-Instruct-v0.1_merged.csv

Target files:
posix_{technique}_{target_model_short}_merged.csv

e.g.
posix_chain_of_thought_tiiuae_falcon-7b-instruct_merged.csv

Output files:
judged_scores_{technique}_{target_model_short}.csv judged_scores_all.csv summary_scores_table.csv

How to Run (Command Line)

Navigate to your project folder:

cd /d/hpc/projects/FRI/ma76193

Run the script with:

python judge_model_eval.py \
  --base_dir /d/hpc/projects/FRI/ma76193 \
  --judge_model mistralai/Mistral-7B-Instruct-v0.1 \
  --target_model tiiuae/falcon-7b-instruct

  --judge_model can be any HF-compatible judge model (e.g. mistralai/Mistral-7B-Instruct-v0.1).

--target_model can be any model you want to evaluate.

How to Run on SLURM HPC

Use the provided SLURM batch file to run the evaluation on a GPU node.

Example:

sbatch run_alpaca_only.sh mistralai/Mistral-7B-Instruct-v0.1 tiiuae/falcon-7b-instruct

arg1: Judge model name or path.

arg2: Target model name or path.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
code		code
data		data
results		results
.DS_Store		.DS_Store
.gitignore		.gitignore
Improving Prompt Sensitivity of LLMs.pdf		Improving Prompt Sensitivity of LLMs.pdf
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Natural language processing course: `Improving Prompt Sensitivity in Large Language Models`

Table of Contents

Team

Project Overview

Inference-Time Prompting Techniques

Training-Time Fine-Tuning

Models Used

Inference-Time Evaluation

Fine-Tuning (PEFT with LoRA)

Evaluation Methods

Repository Structure

Reproducibility

1. Set up the environment

2. Prepare the data

3. Fine-tune the model with LoRA

4. Evaluation with POSIX

Option A: Local Run

Option B: Run on SLURM HPC (recommended for full evaluation)

5. Evaluation with LLM-as-a-Judge

How It Works

File Naming Conventions

How to Run (Command Line)

About

Uh oh!

Releases

Packages

Languages

License

UL-FRI-NLP-Course/ul-fri-nlp-course-project-2024-2025-nlpizza

Folders and files

Latest commit

History

Repository files navigation

Natural language processing course: Improving Prompt Sensitivity in Large Language Models

Table of Contents

Team

Project Overview

Inference-Time Prompting Techniques

Training-Time Fine-Tuning

Models Used

Inference-Time Evaluation

Fine-Tuning (PEFT with LoRA)

Evaluation Methods

Repository Structure

Reproducibility

1. Set up the environment

2. Prepare the data

3. Fine-tune the model with LoRA

4. Evaluation with POSIX

Option A: Local Run

Option B: Run on SLURM HPC (recommended for full evaluation)

5. Evaluation with LLM-as-a-Judge

How It Works

File Naming Conventions

How to Run (Command Line)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Natural language processing course: `Improving Prompt Sensitivity in Large Language Models`

Packages