Skip to content

Lip1200/nl2sparql_calvin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

29 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

NL2SPARQL Calvin

An automatic conversion system from natural language questions in French to SPARQL queries for querying historical knowledge graphs.

Python 3.9+ License: MIT LangChain

๐Ÿ“‹ Table of Contents

๐ŸŽฏ Overview

This project developed as part of a Bachelor's Thesis at the University of Geneva allows to:

  • Convert French questions to SPARQL queries
  • Validate automatically generated queries
  • Correct errors via a feedback system
  • Evaluate performance with precise metrics
  • Interact via a conversational web interface

๐Ÿ”ฌ Based on SIB work : This system builds on and extends the sparql-llm framework developed by the Swiss Institute of Bioinformatics, adapted for French historical data and conversational interface.

Usage Example

โ“ Question: "Liste tous les registres et leur numรฉro."

๐Ÿ”„ Automatic SPARQL query generation:

PREFIX rc: <http://purl.org/rcnum/onto/rc#>
SELECT ?registre ?numero WHERE {
        SERVICE <http://localhost:7200/repositories/Calvin> {
          ?registre a rc:Registre ;
                    rc:no_registre ?numero .
        }
      }

๐Ÿš€ Installation

Prerequisites

  • Python 3.9+
  • GraphDB (>= 9.0)
  • Qdrant Vector Database
  • OpenAI API Key

Quick Installation

# Clone the repository
git clone https://gitlab.unige.ch/Filipe.Ramos/nl2sparql_calvin.git
cd nl2sparql_calvin

# Create a virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Required Services

  1. GraphDB:

Start GraphDB on port 7200

  1. Qdrant:
# Docker
docker run -p 6333:6333 qdrant/qdrant

# Or local installation
# See: https://qdrant.tech/documentation/quick-start/

โš™๏ธ Configuration

Environment Variables

Copy the .env.example file to .env and configure:

# LLM Model
MODEL_NAME=xxxxx
OPENAI_API_KEY=your_openai_api_key

# Database
GRAPHDB_ENDPOINT=http://localhost:7200/repositories/Calvin
VOID_FILE=/path/to/void.ttl

# Qdrant
QDRANT_HOST=http://localhost:6333
QDRANT_COLLECTION_NAME=calvin-sparql-docs

# LangSmith (optional)
LANGSMITH_API_KEY=your_langsmith_key
LANGSMITH_PROJECT=nl2sparql_calvin

Data Initialization

# Compile SPARQL examples
bash compile_examples.sh

# Initialize vector database
python -c "from llm.index import check_collection; check_collection()"

๐ŸŽฎ Usage

Web Interface (Recommended)

# Start the Chainlit interface
bash ./run_ui.sh
# Access: http://localhost:8000

CLI Interface

# Interactive command-line mode
python -m llm.app

Programmatic API

from llm.chain_pipeline import build_chain
from sparql_llm.utils import get_prefixes_and_schema_for_endpoints
from llm.index import endpoints

# Initialization
prefixes_map, endpoints_void_dict = get_prefixes_and_schema_for_endpoints(endpoints)
chain = build_chain(skip_retriever=False)

# Usage
ctx = {
    "prefixes_map": prefixes_map,
    "endpoints_void_dict": endpoints_void_dict,
}

result = chain.invoke({
    "question": "Your question here",
    **ctx
})

๐Ÿ—๏ธ Architecture

nl2sparql_calvin/
โ”œโ”€โ”€ llm/                    # Main LLM module
โ”‚   โ”œโ”€โ”€ chain_pipeline.py   # Processing pipeline
โ”‚   โ”œโ”€โ”€ calvin_sparql_validation.py  # SPARQL validation
โ”‚   โ”œโ”€โ”€ retriever.py        # Context retrieval
โ”‚   โ””โ”€โ”€ index.py           # Data management
โ”œโ”€โ”€ UI/                     # User interface
โ”‚   โ”œโ”€โ”€ app.py             # Chainlit application
โ”‚   โ””โ”€โ”€ chainlit.yaml      # UI configuration
โ”œโ”€โ”€ benchmark/              # Evaluation system
โ”‚   โ”œโ”€โ”€ eval.py            # Evaluation pipeline
โ”‚   โ””โ”€โ”€ sparql/            # Reference queries
โ”œโ”€โ”€ folds/                  # Cross-validation data
โ”‚   โ”œโ”€โ”€ fold_1/ ... fold_5/
โ””โ”€โ”€ Graph/                  # Data and ontologies

Processing Flow

  1. User question โ†’ Interface
  2. Context retrieval โ†’ Qdrant vector database
  3. SPARQL generation โ†’ LLM model (GPT-4o)
  4. Validation โ†’ Syntax + Semantics (ShEx)
  5. Correction โ†’ Feedback if errors (max 3 attempts)
  6. Execution โ†’ GraphDB
  7. Results โ†’ User interface

Contributing

  1. Fork the project
  2. Create a feature branch (git checkout -b feature/new-feature)
  3. Commit changes (git commit -am 'Add new feature')
  4. Push to branch (git push origin feature/new-feature)
  5. Create a Pull Request

Main Modules

๐Ÿ“„ License

This project is licensed under the MIT License. See the LICENSE file for details.

๐Ÿ‘ฅ Authors

  • Filipe Ramos - Main Developer - University of Geneva
  • Marco Sorbi - Supervisor and Mentor - PhD research
  • Laurent Moccozet - Bachelor's Thesis Director, Senior Lecturer (MER) - University of Geneva

๐Ÿ™ Acknowledgments

Supervision and Mentoring

  • Laurent Moccozet - Senior Lecturer (MER), University of Geneva

    • Bachelor's Thesis supervision
    • Initiation and facilitation of collaboration with the history department
    • Pedagogical and scientific expertise throughout the project
  • Marco Sorbi - University of Geneva

    • Technical supervision and daily mentoring
    • Expertise in SPARQL technologies
    • Continuous support for implementation and evaluation

Institutions and Teams

  • University of Geneva - University Computing Center and History Department
  • Swiss Institute of Bioinformatics (SIB) - For their SPARQL tools and frameworks
  • UNIGE Computer Science research team
  • LangChain community
  • OpenAI for access to GPT models

SIB Work and Tools Used

This project heavily relies on work and tools developed by the Swiss Institute of Bioinformatics (SIB):

  • sparql-llm : Main framework for SPARQL-LLM integration

    • Used for: Base pipeline, endpoint management, query validation
    • Adapted modules: SparqlEndpointLinks, SparqlExamplesLoader, validation utilities
  • sparql-examples-utils : SPARQL examples compilation and validation tools

    • Used for: Training examples generation, dataset compilation
  • void-generator : VOID metadata generator

    • Used for: Automatic VOID schema generation for GraphDB

Adaptations and Extensions

SIB tools have been adapted and extended for this project:

  • Calvin Validation : Extension of SPARQL validation system with ShEx
  • Multilingual Pipeline : Adaptation for French and historical data
  • Conversational Interface : Integration with Chainlit for UI
  • Evaluation Metrics : Specialized evaluation system with LangSmith
  • Error Handling : Automatic feedback and correction system
This work uses and extends SPARQL-LLM tools developed by the Swiss Institute 
of Bioinformatics (SIB), available at: https://github.com/sib-swiss/sparql-llm

Project developed as part of a Bachelor's Thesis
University of Geneva - 2025

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published