Semantic Table Search

A FastAPI-based service that enables semantic search capabilities for tabular datasets. The service supports multiple LLM and embedding providers, including Ollama, Groq, and SentenceTransformers, with ChromaDB for vector storage.

Features

Add, update, and delete datasets with descriptions and metadata
Semantic search across datasets using natural language queries
Vector-based similarity search using embeddings
Support for both basic and expanded search results
Real-time streaming search responses for better user experience
Multi-collection storage (descriptions, use cases, domains)
LLM-powered dataset description extraction and classification
Authorization scope filtering for dataset access control
Flexible LLM and embedding provider configuration

Prerequisites

Python 3.12.8 or higher
At least one of the following:
- Ollama installed and running (for local LLM and embeddings)
- Groq API access (for cloud-based LLM)
- SentenceTransformers (automatic fallback for embeddings)
ChromaDB (installed automatically via pip)

Environment Variables

Create a .env file in the project root with the following variables:

# ChromaDB Configuration
CHROMA_DIR=./chroma_storage

# LLM Provider Configuration
LLM_OPTION=ollama  # Options: "ollama", "groq"

# Ollama Configuration (if using LLM_OPTION=ollama)
OLLAMA_URL=http://localhost:11434
OLLAMA_MODEL=llama2

# Groq Configuration (if using LLM_OPTION=groq)
GROQ_URL=https://api.groq.com/openai/v1
GROQ_MODEL=llama3-8b-8192

# Embedding Provider Configuration
EMBEDDING_OPTION=ollama  # Options: "ollama", "sentencetransformers"

# Ollama Embedding Configuration (if using EMBEDDING_OPTION=ollama)
OLLAMA_EMBEDDING_MODEL=nomic-embed-text

# Note: If EMBEDDING_OPTION is not "ollama", the service will automatically use SentenceTransformers with "all-MiniLM-L6-v2"

Installation

Local Installation

Clone the repository:

git clone <repository-url>
cd semantic-table-search

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install the required packages:

pip install -r requirements.txt

Running Locally

Ensure Ollama is running:

ollama serve

Start the FastAPI server:

uvicorn server:app --host 0.0.0.0 --port 8000

The API will be available at http://localhost:8000

Docker Installation

Build the Docker image:

docker build -t semantic-dataset-search .

Run the container:

docker run -p 8000:8000 \
  -e OLLAMA_URL=http://host.docker.internal:11434 \
  -e OLLAMA_EMBEDDING_MODEL=nomic-embed-text \
  -e OLLAMA_MODEL=llama2 \
  -e CHROMA_DIR=/app/chroma_storage \
  semantic-dataset-search

API Endpoints

Add Dataset

POST /add_dataset
{
    "dataset_id": "unique_id",
    "dataset_official_description": "Official description of the dataset",
    "dataset_profile_description": "Profile description from the tool",
    "dataset_metadata": {
        "key": "value"
    }
}

Update Dataset

PUT /update_dataset
{
    "dataset_id": "unique_id",
    "dataset_official_description": "Updated official description",
    "dataset_profile_description": "Updated profile description",
    "dataset_domain": "domain",
    "dataset_metadata": {
        "key": "value"
    }
}

Update Dataset Metadata

PUT /update_dataset_metadata
{
    "dataset_id": "unique_id",
    "dataset_metadata": {
        "key": "value"
    }
}

Delete Dataset

DELETE /delete_dataset
{
    "dataset_id": "unique_id"
}

Search Datasets

POST /search_datasets
{
    "query": "Your search query",
    "n_results": 5,
    "auth_scope": ["scope1", "scope2"]  # Optional: filter by authorization scopes
}

Search Datasets (Streaming)

POST /search_datasets_streaming
{
    "query": "Your search query",
    "n_results": 5,
    "auth_scope": ["scope1", "scope2"]  # Optional: filter by authorization scopes
}

Returns a streaming response with real-time progress updates:

text/event-stream format
Progress messages during query analysis and search
Final results with dataset IDs and relevance scores

Search Datasets (Expanded)

POST /search_datasets_expanded
{
    "query": "Your search query",
    "n_results": 5,
    "auth_scope": ["scope1", "scope2"]  # Optional: filter by authorization scopes
}

Returns complete dataset information including descriptions, use cases, and domains. Also available as a streaming endpoint with real-time progress updates.

API Documentation

Once the server is running, you can access the interactive API documentation at:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Development

The project structure:

├── server.py           # Main FastAPI application
├── models.py           # Pydantic models
├── prompts.py          # LLM prompt templates
├── requirements.txt    # Python dependencies
├── Dockerfile         # Docker configuration
├── .env              # Environment variables (create this)
├── .gitignore        # Git ignore rules
├── LICENSE           # Project license
└── notebooks/        # Test and development notebooks

Architecture

The service uses three ChromaDB collections for different aspects of dataset information:

dataset_descriptions: General descriptions of datasets
dataset_use_cases: Purpose and use case information
dataset_domains: Domain classification

Each dataset is processed through LLM chains to extract structured information from raw descriptions.

Streaming Responses

The service provides streaming endpoints for real-time search feedback:

Progress Updates: Real-time status messages during processing
Component Analysis: Shows how the query is broken down into description, purpose, and domain
Search Progress: Updates on candidate retrieval and scoring
Final Results: Complete search results with relevance scores

Notes

The service supports multiple LLM and embedding providers for flexibility
When using Ollama, ensure it's running and accessible
Groq provides faster cloud-based LLM processing
SentenceTransformers provides reliable local embedding generation
ChromaDB data is persisted in the directory specified by CHROMA_DIR
When running with Docker, ensure external services (Ollama/Groq) are accessible
The default port is 8000, but can be changed by modifying the uvicorn command
The service automatically creates ChromaDB collections on startup
Streaming endpoints provide better user experience for long-running searches
Authorization scopes enable fine-grained access control over datasets

Configuration Options

LLM Providers

The service supports multiple LLM providers:

Ollama (local): Set LLM_OPTION=ollama
- Requires Ollama server running locally
- Configure OLLAMA_URL and OLLAMA_MODEL
Groq (cloud): Set LLM_OPTION=groq
- Requires Groq API access
- Configure GROQ_API_KEY and GROQ_MODEL

Embedding Providers

The service supports multiple embedding providers:

Ollama (local): Set EMBEDDING_OPTION=ollama
- Requires Ollama server with embedding model
- Configure OLLAMA_EMBEDDING_MODEL
SentenceTransformers (default): Any other value or unset
- Uses local SentenceTransformers model
- Automatically downloads "all-MiniLM-L6-v2" model

Authorization Scopes

The service supports filtering datasets by authorization scopes:

Add auth_scope metadata when creating/updating datasets
Use auth_scope parameter in search requests to filter results
Only datasets matching the provided scopes will be returned

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Semantic Table Search

Features

Prerequisites

Environment Variables

Installation

Local Installation

Running Locally

Docker Installation

API Endpoints

Add Dataset

Update Dataset

Update Dataset Metadata

Delete Dataset

Search Datasets

Search Datasets (Streaming)

Search Datasets (Expanded)

API Documentation

Development

Architecture

Streaming Responses

Notes

Configuration Options

LLM Providers

Embedding Providers

Authorization Scopes

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
models.py		models.py
prompts.py		prompts.py
requirements.txt		requirements.txt
server.py		server.py

License

stelar-eu/semantic-dataset-search

Folders and files

Latest commit

History

Repository files navigation

Semantic Table Search

Features

Prerequisites

Environment Variables

Installation

Local Installation

Running Locally

Docker Installation

API Endpoints

Add Dataset

Update Dataset

Update Dataset Metadata

Delete Dataset

Search Datasets

Search Datasets (Streaming)

Search Datasets (Expanded)

API Documentation

Development

Architecture

Streaming Responses

Notes

Configuration Options

LLM Providers

Embedding Providers

Authorization Scopes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages