This project tests how different Large Language Models (LLMs) handle various types of context materials when answering questions. The experiment systematically evaluates whether models rely on their training knowledge or properly use provided context information when answering domain-specific questions.
The experiment presents LLMs with questions about a specific domain (fire safety) and provides different types of context materials:
-
Complete and correct materials
-
Modified materials with information removed
-
Materials containing intentional errors
-
No relevant context
By analyzing how models respond to these different context conditions, we can determine:
-
How well models use provided context information
-
Whether models can detect and correct errors in context
-
If models fall back to training knowledge when context is insufficient
-
How models handle contradictory information
-
Python 3.8 or higher
-
pip (Python package manager)
-
An OpenRouter API key (for accessing various LLM models)
The project requires the following Python libraries:
-
rdflib
(≥6.0.0): For working with RDF data -
openai
(≥1.0.0): For OpenAI API integration -
requests
(≥2.31.0): For HTTP requests -
python-dotenv
(≥1.0.0): For loading environment variables -
matplotlib
(≥3.5.0): For visualizing results -
pandas
(≥1.3.0): For data analysis -
sparqlwrapper
(≥2.0.0): For SPARQL queries
-
Clone the repository:
` git clone https://github.com/WSE-research/LLM-context-vs-training-experiment.git. cd LLM-context-vs-training-experiment
` -
Create and activate a virtual environment (recommended):
` python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
` -
Install the required dependencies:
` pip install -r requirements.txt
` -
Set up your API key:
The project requires an OpenRouter API key to access various LLM models. You can configure this by creating a `.env` file in the project root directory containing your API key: ``` OPENROUTER_API_KEY=your_openrouter_api_key ```
A template `.env.example` file is provided in the repository that you can use as a starting point.
The experiment uses RDF (Resource Description Framework) data stored in Turtle (.ttl) format, following semantic web principles with a clear separation between schema and instance data.
The schema defines the ontological structure of the experiment, including classes, properties, and their relationships:
-
schema/consolidated_schema.ttl
: Complete schema definition -
schema/shacl_constraints.ttl
: SHACL constraints for data validation
Key classes in the schema include:
-
Question
: Represents a question with text, category, and expected answers -
Answer
: Represents a model’s response to a question -
Material
: Represents learning materials with different sections -
ValidationResult
: Represents the validation of an answer -
LeakageAnalysis
: Analyzes whether an answer is based on training knowledge
The experiment uses the following data files that conform to the schema:
-
data/questions.ttl
- Questions in German -
data/questions_en.ttl
- Questions in English -
data/materials.ttl
- Learning materials in German -
data/materials_en.ttl
- Learning materials in English -
data/answers.ttl
- Answers collected during experiments
The experiment tests how models handle different types of context materials:
-
complete
: Complete and correct learning materials -
modified
: Modified materials with some information removed -
errors
: Materials with intentional errors -
no_context
: No relevant context provided
To run the full experiment with all questions, all material types, and the default Google Gemini model:
python run_experiment_rdf.py
This will process the data according to the schema and generate new answer instances.
The experiment script supports several command-line options:
-
--test
: Run in test mode with a smaller subset of questions -
--questions-per-category N
: Use N questions per category in test mode -
--material-types TYPE1 TYPE2…
: Specify which material types to test (complete, modified, errors, contradictions, no_context) -
--question-types TYPE1 TYPE2…
: Specify which question types to test (multiple_choice, free_text, list, yes_no) -
--language LANG
: Language to use (default: de) -
--model MODEL
: Model to use (default: google/gemini-2.0-flash-lite-001)
The experiment results are saved in two formats:
-
JSON logfile:
results_{model}_{timestamp}.json
- Contains detailed results for analysis -
RDF data:
data/answers.ttl
- Contains all answer instances in RDF format
The experiment includes several analysis features:
-
Validation of answers against expected responses defined in the schema
-
Detection of training knowledge leakage
-
Error detection analysis
-
Statistical analysis of model performance across different material types
The RDF schema can be visualized using the schema visualizer:
python schema_visualizer/visualize_schema.py
This generates a graphical representation of the schema to help understand the data model.
The RDF knowledge base can be queried using SPARQL:
from rdflib import Graph
g = Graph()
g.parse("data/answers.ttl", format="turtle")
g.parse("schema/consolidated_schema.ttl", format="turtle")
# Example SPARQL query to find all answers that replicated errors
results = g.query("""
PREFIX sqare: <http://purl.org/sqare#>
SELECT ?answer ?question ?model
WHERE {
?validation a sqare:ValidationResult ;
sqare:category sqare:ErrorReplicated ;
^sqare:validationResult ?answer .
?answer sqare:givenFor ?question ;
sqare:usedModel ?model .
}
""")
Contributions to both the schema and data are welcome! Please feel free to submit a Pull Request.