An automatic conversion system from natural language questions in French to SPARQL queries for querying historical knowledge graphs.
This project developed as part of a Bachelor's Thesis at the University of Geneva allows to:
- Convert French questions to SPARQL queries
- Validate automatically generated queries
- Correct errors via a feedback system
- Evaluate performance with precise metrics
- Interact via a conversational web interface
๐ฌ Based on SIB work : This system builds on and extends the sparql-llm framework developed by the Swiss Institute of Bioinformatics, adapted for French historical data and conversational interface.
โ Question: "Liste tous les registres et leur numรฉro."
๐ Automatic SPARQL query generation:
PREFIX rc: <http://purl.org/rcnum/onto/rc#>
SELECT ?registre ?numero WHERE {
SERVICE <http://localhost:7200/repositories/Calvin> {
?registre a rc:Registre ;
rc:no_registre ?numero .
}
}
- Python 3.9+
- GraphDB (>= 9.0)
- Qdrant Vector Database
- OpenAI API Key
# Clone the repository
git clone https://gitlab.unige.ch/Filipe.Ramos/nl2sparql_calvin.git
cd nl2sparql_calvin
# Create a virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
- GraphDB:
Start GraphDB on port 7200
- Qdrant:
# Docker
docker run -p 6333:6333 qdrant/qdrant
# Or local installation
# See: https://qdrant.tech/documentation/quick-start/
Copy the .env.example
file to .env
and configure:
# LLM Model
MODEL_NAME=xxxxx
OPENAI_API_KEY=your_openai_api_key
# Database
GRAPHDB_ENDPOINT=http://localhost:7200/repositories/Calvin
VOID_FILE=/path/to/void.ttl
# Qdrant
QDRANT_HOST=http://localhost:6333
QDRANT_COLLECTION_NAME=calvin-sparql-docs
# LangSmith (optional)
LANGSMITH_API_KEY=your_langsmith_key
LANGSMITH_PROJECT=nl2sparql_calvin
# Compile SPARQL examples
bash compile_examples.sh
# Initialize vector database
python -c "from llm.index import check_collection; check_collection()"
# Start the Chainlit interface
bash ./run_ui.sh
# Access: http://localhost:8000
# Interactive command-line mode
python -m llm.app
from llm.chain_pipeline import build_chain
from sparql_llm.utils import get_prefixes_and_schema_for_endpoints
from llm.index import endpoints
# Initialization
prefixes_map, endpoints_void_dict = get_prefixes_and_schema_for_endpoints(endpoints)
chain = build_chain(skip_retriever=False)
# Usage
ctx = {
"prefixes_map": prefixes_map,
"endpoints_void_dict": endpoints_void_dict,
}
result = chain.invoke({
"question": "Your question here",
**ctx
})
nl2sparql_calvin/
โโโ llm/ # Main LLM module
โ โโโ chain_pipeline.py # Processing pipeline
โ โโโ calvin_sparql_validation.py # SPARQL validation
โ โโโ retriever.py # Context retrieval
โ โโโ index.py # Data management
โโโ UI/ # User interface
โ โโโ app.py # Chainlit application
โ โโโ chainlit.yaml # UI configuration
โโโ benchmark/ # Evaluation system
โ โโโ eval.py # Evaluation pipeline
โ โโโ sparql/ # Reference queries
โโโ folds/ # Cross-validation data
โ โโโ fold_1/ ... fold_5/
โโโ Graph/ # Data and ontologies
- User question โ Interface
- Context retrieval โ Qdrant vector database
- SPARQL generation โ LLM model (GPT-4o)
- Validation โ Syntax + Semantics (ShEx)
- Correction โ Feedback if errors (max 3 attempts)
- Execution โ GraphDB
- Results โ User interface
- Fork the project
- Create a feature branch (
git checkout -b feature/new-feature
) - Commit changes (
git commit -am 'Add new feature'
) - Push to branch (
git push origin feature/new-feature
) - Create a Pull Request
- LLM Module : Main processing pipeline
- UI Module : User interface
- Benchmark Module :
This project is licensed under the MIT License. See the LICENSE file for details.
- Filipe Ramos - Main Developer - University of Geneva
- Marco Sorbi - Supervisor and Mentor - PhD research
- Laurent Moccozet - Bachelor's Thesis Director, Senior Lecturer (MER) - University of Geneva
-
Laurent Moccozet - Senior Lecturer (MER), University of Geneva
- Bachelor's Thesis supervision
- Initiation and facilitation of collaboration with the history department
- Pedagogical and scientific expertise throughout the project
-
Marco Sorbi - University of Geneva
- Technical supervision and daily mentoring
- Expertise in SPARQL technologies
- Continuous support for implementation and evaluation
- University of Geneva - University Computing Center and History Department
- Swiss Institute of Bioinformatics (SIB) - For their SPARQL tools and frameworks
- UNIGE Computer Science research team
- LangChain community
- OpenAI for access to GPT models
This project heavily relies on work and tools developed by the Swiss Institute of Bioinformatics (SIB):
-
sparql-llm : Main framework for SPARQL-LLM integration
- Used for: Base pipeline, endpoint management, query validation
- Adapted modules:
SparqlEndpointLinks
,SparqlExamplesLoader
, validation utilities
-
sparql-examples-utils : SPARQL examples compilation and validation tools
- Used for: Training examples generation, dataset compilation
-
void-generator : VOID metadata generator
- Used for: Automatic VOID schema generation for GraphDB
SIB tools have been adapted and extended for this project:
- Calvin Validation : Extension of SPARQL validation system with ShEx
- Multilingual Pipeline : Adaptation for French and historical data
- Conversational Interface : Integration with Chainlit for UI
- Evaluation Metrics : Specialized evaluation system with LangSmith
- Error Handling : Automatic feedback and correction system
This work uses and extends SPARQL-LLM tools developed by the Swiss Institute
of Bioinformatics (SIB), available at: https://github.com/sib-swiss/sparql-llm
Project developed as part of a Bachelor's Thesis
University of Geneva - 2025