LLM Context vs. Training Experiment

This project tests how different Large Language Models (LLMs) handle various types of context materials when answering questions. The experiment systematically evaluates whether models rely on their training knowledge or properly use provided context information when answering domain-specific questions.

Overview

The experiment presents LLMs with questions about a specific domain (fire safety) and provides different types of context materials:

Complete and correct materials
Modified materials with information removed
Materials containing intentional errors
No relevant context

By analyzing how models respond to these different context conditions, we can determine:

How well models use provided context information
Whether models can detect and correct errors in context
If models fall back to training knowledge when context is insufficient
How models handle contradictory information

Installation and Requirements

Prerequisites

Python 3.8 or higher
pip (Python package manager)
An OpenRouter API key (for accessing various LLM models)

Dependencies

The project requires the following Python libraries:

rdflib (≥6.0.0): For working with RDF data
openai (≥1.0.0): For OpenAI API integration
requests (≥2.31.0): For HTTP requests
python-dotenv (≥1.0.0): For loading environment variables
matplotlib (≥3.5.0): For visualizing results
pandas (≥1.3.0): For data analysis
sparqlwrapper (≥2.0.0): For SPARQL queries

Installation Steps

Clone the repository: ` git clone https://github.com/WSE-research/LLM-context-vs-training-experiment.git. cd LLM-context-vs-training-experiment`
Create and activate a virtual environment (recommended): ` python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate`
Install the required dependencies: ` pip install -r requirements.txt`

Set up your API key:

The project requires an OpenRouter API key to access various LLM models. You can configure this by creating a `.env` file in the project root directory containing your API key:
```
OPENROUTER_API_KEY=your_openrouter_api_key
```

A template `.env.example` file is provided in the repository that you can use as a starting point.

Data Structure

The experiment uses RDF (Resource Description Framework) data stored in Turtle (.ttl) format, following semantic web principles with a clear separation between schema and instance data.

Schema

The schema defines the ontological structure of the experiment, including classes, properties, and their relationships:

schema/consolidated_schema.ttl: Complete schema definition
schema/shacl_constraints.ttl: SHACL constraints for data validation

Key classes in the schema include:

Question: Represents a question with text, category, and expected answers
Answer: Represents a model’s response to a question
Material: Represents learning materials with different sections
ValidationResult: Represents the validation of an answer
LeakageAnalysis: Analyzes whether an answer is based on training knowledge

Data Files

The experiment uses the following data files that conform to the schema:

data/questions.ttl - Questions in German
data/questions_en.ttl - Questions in English
data/materials.ttl - Learning materials in German
data/materials_en.ttl - Learning materials in English
data/answers.ttl - Answers collected during experiments

Material Types

The experiment tests how models handle different types of context materials:

complete: Complete and correct learning materials
modified: Modified materials with some information removed
errors: Materials with intentional errors
no_context: No relevant context provided

Question Types

The Project focuses on the following type of questions:

multiple_choice: Questions with predefined answer options

Running the Experiment

Basic Usage

To run the full experiment with all questions, all material types, and the default Google Gemini model:

python run_experiment_rdf.py

This will process the data according to the schema and generate new answer instances.

Options

The experiment script supports several command-line options:

--test: Run in test mode with a smaller subset of questions
--questions-per-category N: Use N questions per category in test mode
--material-types TYPE1 TYPE2…: Specify which material types to test (complete, modified, errors, contradictions, no_context)
--question-types TYPE1 TYPE2…: Specify which question types to test (multiple_choice, free_text, list, yes_no)
--language LANG: Language to use (default: de)
--model MODEL: Model to use (default: google/gemini-2.0-flash-lite-001)

Example

Run a test experiment with 2 questions per category, using only complete and error materials, with the Gemini Pro model:

python run_experiment_rdf.py --test --questions-per-category 2 --material-types complete errors --model google/gemini-2.0-pro-001

Results and Analysis

Output Formats

The experiment results are saved in two formats:

JSON logfile: results_{model}_{timestamp}.json - Contains detailed results for analysis
RDF data: data/answers.ttl - Contains all answer instances in RDF format

Analysis Features

The experiment includes several analysis features:

Validation of answers against expected responses defined in the schema
Detection of training knowledge leakage
Error detection analysis
Statistical analysis of model performance across different material types

Visualization

Results can be visualized using the included scripts:

python scripts/visualize_results.py results/results_{model}_{timestamp}.json

Schema Visualization

The RDF schema can be visualized using the schema visualizer:

python schema_visualizer/visualize_schema.py

This generates a graphical representation of the schema to help understand the data model.

SPARQL Querying

The RDF knowledge base can be queried using SPARQL:

from rdflib import Graph
g = Graph()
g.parse("data/answers.ttl", format="turtle")
g.parse("schema/consolidated_schema.ttl", format="turtle")

# Example SPARQL query to find all answers that replicated errors
results = g.query("""
    PREFIX sqare: <http://purl.org/sqare#>
    SELECT ?answer ?question ?model
    WHERE {
        ?validation a sqare:ValidationResult ;
                   sqare:category sqare:ErrorReplicated ;
                   ^sqare:validationResult ?answer .
        ?answer sqare:givenFor ?question ;
                sqare:usedModel ?model .
    }
""")

Contributing

Contributions to both the schema and data are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the Apache License, Version 2.0. See the LICENSE file in the repository for the full license text.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Context vs. Training Experiment

Overview

Installation and Requirements

Prerequisites

Dependencies

Installation Steps

Data Structure

Schema

Data Files

Material Types

Question Types

Running the Experiment

Basic Usage

Options

Example

Results and Analysis

Output Formats

Analysis Features

Visualization

Schema Visualization

SPARQL Querying

Contributing

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
results		results
schema		schema
schema_visualizer		schema_visualizer
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.adoc		README.adoc
requirements.txt		requirements.txt
run_experiment_rdf.py		run_experiment_rdf.py

License

WSE-research/LLM-context-vs-training-experiment

Folders and files

Latest commit

History

Repository files navigation

LLM Context vs. Training Experiment

Overview

Installation and Requirements

Prerequisites

Dependencies

Installation Steps

Data Structure

Schema

Data Files

Material Types

Question Types

Running the Experiment

Basic Usage

Options

Example

Results and Analysis

Output Formats

Analysis Features

Visualization

Schema Visualization

SPARQL Querying

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages