Skip to content

This repository contains an experiment setup to evaluate how Large Language Models (LLMs) handle fire safety training materials. The core question: Do they stick to provided course materials, or do they supplement answers with pre-trained knowledge?

License

Notifications You must be signed in to change notification settings

WSE-research/LLM-context-vs-training-experiment

Repository files navigation

LLM Context vs. Training Experiment

This project tests how different Large Language Models (LLMs) handle various types of context materials when answering questions. The experiment systematically evaluates whether models rely on their training knowledge or properly use provided context information when answering domain-specific questions.

Overview

The experiment presents LLMs with questions about a specific domain (fire safety) and provides different types of context materials:

  • Complete and correct materials

  • Modified materials with information removed

  • Materials containing intentional errors

  • No relevant context

By analyzing how models respond to these different context conditions, we can determine:

  • How well models use provided context information

  • Whether models can detect and correct errors in context

  • If models fall back to training knowledge when context is insufficient

  • How models handle contradictory information

Installation and Requirements

Prerequisites

  • Python 3.8 or higher

  • pip (Python package manager)

  • An OpenRouter API key (for accessing various LLM models)

Dependencies

The project requires the following Python libraries:

  • rdflib (≥6.0.0): For working with RDF data

  • openai (≥1.0.0): For OpenAI API integration

  • requests (≥2.31.0): For HTTP requests

  • python-dotenv (≥1.0.0): For loading environment variables

  • matplotlib (≥3.5.0): For visualizing results

  • pandas (≥1.3.0): For data analysis

  • sparqlwrapper (≥2.0.0): For SPARQL queries

Installation Steps

  1. Clone the repository: ` git clone https://github.com/WSE-research/LLM-context-vs-training-experiment.git. cd LLM-context-vs-training-experiment `

  2. Create and activate a virtual environment (recommended): ` python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate `

  3. Install the required dependencies: ` pip install -r requirements.txt `

  4. Set up your API key:

    The project requires an OpenRouter API key to access various LLM models. You can configure this by creating a `.env` file in the project root directory containing your API key:
    ```
    OPENROUTER_API_KEY=your_openrouter_api_key
    ```
    A template `.env.example` file is provided in the repository that you can use as a starting point.

Data Structure

The experiment uses RDF (Resource Description Framework) data stored in Turtle (.ttl) format, following semantic web principles with a clear separation between schema and instance data.

Schema

The schema defines the ontological structure of the experiment, including classes, properties, and their relationships:

  • schema/consolidated_schema.ttl: Complete schema definition

  • schema/shacl_constraints.ttl: SHACL constraints for data validation

Key classes in the schema include:

  • Question: Represents a question with text, category, and expected answers

  • Answer: Represents a model’s response to a question

  • Material: Represents learning materials with different sections

  • ValidationResult: Represents the validation of an answer

  • LeakageAnalysis: Analyzes whether an answer is based on training knowledge

Data Files

The experiment uses the following data files that conform to the schema:

  • data/questions.ttl - Questions in German

  • data/questions_en.ttl - Questions in English

  • data/materials.ttl - Learning materials in German

  • data/materials_en.ttl - Learning materials in English

  • data/answers.ttl - Answers collected during experiments

Material Types

The experiment tests how models handle different types of context materials:

  • complete: Complete and correct learning materials

  • modified: Modified materials with some information removed

  • errors: Materials with intentional errors

  • no_context: No relevant context provided

Question Types

The Project focuses on the following type of questions:

  • multiple_choice: Questions with predefined answer options

Running the Experiment

Basic Usage

To run the full experiment with all questions, all material types, and the default Google Gemini model:

python run_experiment_rdf.py

This will process the data according to the schema and generate new answer instances.

Options

The experiment script supports several command-line options:

  • --test: Run in test mode with a smaller subset of questions

  • --questions-per-category N: Use N questions per category in test mode

  • --material-types TYPE1 TYPE2…​: Specify which material types to test (complete, modified, errors, contradictions, no_context)

  • --question-types TYPE1 TYPE2…​: Specify which question types to test (multiple_choice, free_text, list, yes_no)

  • --language LANG: Language to use (default: de)

  • --model MODEL: Model to use (default: google/gemini-2.0-flash-lite-001)

Example

Run a test experiment with 2 questions per category, using only complete and error materials, with the Gemini Pro model:

python run_experiment_rdf.py --test --questions-per-category 2 --material-types complete errors --model google/gemini-2.0-pro-001

Results and Analysis

Output Formats

The experiment results are saved in two formats:

  1. JSON logfile: results_{model}_{timestamp}.json - Contains detailed results for analysis

  2. RDF data: data/answers.ttl - Contains all answer instances in RDF format

Analysis Features

The experiment includes several analysis features:

  • Validation of answers against expected responses defined in the schema

  • Detection of training knowledge leakage

  • Error detection analysis

  • Statistical analysis of model performance across different material types

Visualization

Results can be visualized using the included scripts:

python scripts/visualize_results.py results/results_{model}_{timestamp}.json

Schema Visualization

The RDF schema can be visualized using the schema visualizer:

python schema_visualizer/visualize_schema.py

This generates a graphical representation of the schema to help understand the data model.

SPARQL Querying

The RDF knowledge base can be queried using SPARQL:

from rdflib import Graph
g = Graph()
g.parse("data/answers.ttl", format="turtle")
g.parse("schema/consolidated_schema.ttl", format="turtle")

# Example SPARQL query to find all answers that replicated errors
results = g.query("""
    PREFIX sqare: <http://purl.org/sqare#>
    SELECT ?answer ?question ?model
    WHERE {
        ?validation a sqare:ValidationResult ;
                   sqare:category sqare:ErrorReplicated ;
                   ^sqare:validationResult ?answer .
        ?answer sqare:givenFor ?question ;
                sqare:usedModel ?model .
    }
""")

Contributing

Contributions to both the schema and data are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the Apache License, Version 2.0. See the LICENSE file in the repository for the full license text.

About

This repository contains an experiment setup to evaluate how Large Language Models (LLMs) handle fire safety training materials. The core question: Do they stick to provided course materials, or do they supplement answers with pre-trained knowledge?

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published