Skip to content

UL-FRI-NLP-Course/ul-fri-nlp-course-project-2024-2025-nlpizza

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

75 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Natural language processing course: Improving Prompt Sensitivity in Large Language Models

This repository contains the codebase, data, models, and report for our project on improving prompt sensitivity in Large Language Models (LLMs). We explore both inference-time prompting strategies and training-time fine-tuning methods to reduce variability in outputs across semantically equivalent prompt variants. This is Project 3 which is part of the Natural Language Processing course for the academic year 2024/2025.

Table of Contents

Team

Gonçalo Cardoso, Gopika Krishnan, Ali Muhammad

Project Overview

Prompt sensitivity refers to the variability in model responses when given different prompts with the same underlying meaning. This project investigates the issue through:

Inference-Time Prompting Techniques

  • Vanilla prompting
  • Chain of Thought
  • Self-Refinement
  • Self-Consistency
  • Iterative Refinement

Training-Time Fine-Tuning

  • Fine-tuning LLMs using Low-Rank Adaptation (LoRA) with Parameter-Efficient Fine-Tuning (PEFT) on prompt variant groups that share the same intended output.

Models Used

We experimented with multiple open-weight Large Language Models (LLMs) at the 7B scale for both inference and fine-tuning:

Inference-Time Evaluation

  • Mistral-7B
  • Falcon-7B Instruct
  • LLaMA-2 7B

Fine-Tuning (PEFT with LoRA)

  • Falcon-7B
  • Falcon-RW-1B
  • LLaMA-2 7B
  • Mistral-7B

Evaluation Methods

  • Prompt Sensitivity Index (POSIX): Quantifies output variability across prompt perturbations.
  • AlpacaEval: Uses a Language Model (LLM) as a judge to assess output consistency and quality.

Repository Structure

The key components of this repository are organized as follows:

├── data/
│   └── alpaca_prompts.json      # Original POSIX and Alpaca-style prompt variant data
│   └── Posix_results            # Results of the Posix scores as well as all the outputs generated by our models
├── models/                      # Fine-tuned LoRA weights and saved checkpoints
│
├── code/
│   ├── lora-llm-finetune/        # Training-time methods (LoRA, PEFT)
│   │   ├── requirements.txt      # Pinned dependencies for environment setup
│   │   ├── finetune.py           # Fine-tuning on grouped prompt variants
│   └── └── run_finetune.slurm    # SLURM script for running LoRA fine-tuning on HPC
│
│   ├── POSIX/                   # Inference-time prompting + POSIX scoring
│   │   ├── requirements.txt      # Pinned dependencies for environment setup
│   │   ├── Posix_script.py       # Runs inference with CoT, Self-Consistency, etc.
│   │   ├── posix_general.slurm    # SLURM job script for POSIX evaluation
│   │   ├── sft_train.jsonl       # training set for fine-tuning 
│   │   ├── sft_test.jsonl        # test set for evaluation
│   └── └── Posix_analysis.ipynb  #Jupyter Notebook analysing results
│
│   ├── llm-as-judge/             # LLM-as-a-judge evaluation framework
│   │   ├── requirements.txt      # Pinned dependencies for environment setup
│   │   ├── eval_judge.py        # Uses an LLM to judge consistency/quality of outputs
│   └── └── submit_alpaca_eval.slurm  # SLURM script for LLM-as-judge evaluation
│
├── report.pdf                   # Final report of the project
│
└── README.md                    # Project overview, reproducibility guide

Reproducibility

This project was designed and tested on a High-Performance Computing (HPC) environment using the ARNES SLURM-based cluster. While local execution is possible for smaller models or testing, we recommend running on an HPC system with SLURM for full-scale fine-tuning and evaluation.

To reproduce our results from scratch:

1. Set up the environment

We recommend Python 3.10+. Create a virtual environment and install dependencies:

git clone https://github.com/UL-FRI-NLP-Course/ul-fri-nlp-course-project-2024-2025-nlpizza.git
cd ul-fri-nlp-course-project-2024-2025-nlpizza

2. Prepare the data

The code/POSIX/ directory already contains annotated and LLM-extended prompt variant groups with named perturbation types and target outputs - sft_train.jsonl and sft_test.jsonl.

3. Fine-tune the model with LoRA

To fine-tune a model like Falcon-RW-1B using LoRA, run:

cd code/lora-llm-finetune
conda create -n finetune_env python=3.10 -y
pip install -r requirements.txt
sbatch run_finetune.slurm tiiuae/falcon-7b q_proj,k_proj,v_proj,o_proj

This command loads the model tiiuae/falcon-rw-1b from Hugging Face and applies LoRA to the attention modules (q_proj,k_proj,v_proj,o_proj) to finetune for the prompt variants.

For the fine-tuning, the expected dataset is a .jsonl file with the following fields per line:

  • instruction: the input prompt
  • output: the corresponding target response
  • group_id: identifier for a group of semantically similar prompts
{
  "instruction": "Rephrase this sentence: The cat sat on the mat.",
  "output": "The feline rested on the rug.",
  "group_id": 42
}

In the above script, we use what is already been prepped (code/POSIX/sft_train.jsonl).

The script is configurable with arguments for model name, dataset path, LoRA settings, number of groups to sample, and more. You can edit run_finetune.py to do this. Here's an example of training Falcon-7B instruct model:

python finetune.py \
  --model_name "tiiuae/Falcon-7B-Instruct" \
  --dataset_path data/sft_train.jsonl \
  --output_dir falcon7b_lora_output \
  --quant4bit \
  --lora_r 16 \
  --lora_alpha 32 \
  --target_modules "query_key_value,dense,dense_h_to_4h,dense_4h_to_h"

After training, the model and LoRA adapters will be saved to the specified output directory:

falcon7b_lora_output/
├── adapter_config.json
├── adapter_model.bin
├── tokenizer_config.json
├── tokenizer.json

These can be loaded at inference time using:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

model = AutoModelForCausalLM.from_pretrained("tiiuae/Falcon-7B-Instruct")
model = PeftModel.from_pretrained(model, "falcon7b_lora_output")
tokenizer = AutoTokenizer.from_pretrained("falcon7b_lora_output")

All our fine-tuned models are present at /d/hpc/projects/onj_fri/gk40784/pt2/output/

4. Evaluation with POSIX

Evaluate output stability across prompt variants using the Prompt Sensitivity Index. Supports both base and fine-tuned models, and allows switching between prompting techniques.

Option A: Local Run

From the base folder of the repository, run the following:

cd code/POSIX
pip install -r requirements.txt
export TASK_ID=0
export BATCH_SIZE=5
export TECHNIQUE=None
export MODEL_ID="tiiuae/falcon-rw-1b"
python Posix_script.py

To use a prompting strategy like Chain of Thought:

export TECHNIQUE=chain_of_thought

To use a fine-tuned model:

export FINETUNE_FLAG=1
export FINETUNE_PATH=models/falcon7b_lora_output/

Option B: Run on SLURM HPC (recommended for full evaluation)

The script is SLURM-array-ready for parallel POSIX computation:

sbatch posix_general.slurm

Each job generates a .csv file with generated responses, formatted prompts, and POSIX scores (overall, per type). The script will also print the average POSIX score for that batch.

5. Evaluation with LLM-as-a-Judge

We use a judge model (e.g. Mistral-7B-Instruct) to evaluate how semantically similar the candidate responses (produced by target models) are to gold references. This helps measure output consistency and quality.

How It Works

  • Input: Reference responses generated by the judge model and target responses from the models you want to evaluate.
  • Output: CSV files with similarity scores by technique and perturbation, plus summary plots and tables.

File Naming Conventions

  • Gold files:
    posix_{technique}_{judge_model_short}_merged.csv

e.g. posix_chain_of_thought_mistralai_Mistral-7B-Instruct-v0.1_merged.csv

  • Target files:
    posix_{technique}_{target_model_short}_merged.csv

e.g.
posix_chain_of_thought_tiiuae_falcon-7b-instruct_merged.csv

  • Output files:
    judged_scores_{technique}_{target_model_short}.csv judged_scores_all.csv summary_scores_table.csv

How to Run (Command Line)

Navigate to your project folder:

cd /d/hpc/projects/FRI/ma76193

Run the script with:

python judge_model_eval.py \
  --base_dir /d/hpc/projects/FRI/ma76193 \
  --judge_model mistralai/Mistral-7B-Instruct-v0.1 \
  --target_model tiiuae/falcon-7b-instruct

  --judge_model can be any HF-compatible judge model (e.g. mistralai/Mistral-7B-Instruct-v0.1).

--target_model can be any model you want to evaluate.

How to Run on SLURM HPC

Use the provided SLURM batch file to run the evaluation on a GPU node.

Example:

sbatch run_alpaca_only.sh mistralai/Mistral-7B-Instruct-v0.1 tiiuae/falcon-7b-instruct

arg1: Judge model name or path.

arg2: Target model name or path.

About

ul-fri-nlp-classroom-ul-fri-nlp-course-project-2024-2025-Project-template created by GitHub Classroom

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 89.2%
  • Python 10.5%
  • Shell 0.3%