Skip to content

Commit 3f8aafb

Browse files
authored
feat: add MCP server (#67)
1 parent 35fcded commit 3f8aafb

File tree

7 files changed

+295
-43
lines changed

7 files changed

+295
-43
lines changed

README.md

Lines changed: 50 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -23,12 +23,14 @@ RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with Postgr
2323
- 🧬 Multi-vector chunk embedding with [late chunking](https://weaviate.io/blog/late-chunking) and [contextual chunk headings](https://d-star.ai/solving-the-out-of-context-chunk-problem-for-rag)
2424
- ✂️ Optimal [level 4 semantic chunking](https://medium.com/@anuragmishra_27746/five-levels-of-chunking-strategies-in-rag-notes-from-gregs-video-7b735895694d) by solving a [binary integer programming problem](https://en.wikipedia.org/wiki/Integer_programming)
2525
- 🔍 [Hybrid search](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) with the database's native keyword & vector search ([tsvector](https://www.postgresql.org/docs/current/datatype-textsearch.html)+[pgvector](https://github.com/pgvector/pgvector), [FTS5](https://www.sqlite.org/fts5.html)+[sqlite-vec](https://github.com/asg017/sqlite-vec)[^1])
26+
- 💭 [Adaptive retrieval](https://arxiv.org/abs/2403.14403) where the LLM decides whether to and what to retrieve based on the query
2627
- 💰 Improved cost and latency with a [prompt caching-aware message array structure](https://platform.openai.com/docs/guides/prompt-caching)
2728
- 🍰 Improved output quality with [Anthropic's long-context prompt format](https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/long-context-tips)
2829
- 🌀 Optimal [closed-form linear query adapter](src/raglite/_query_adapter.py) by solving an [orthogonal Procrustes problem](https://en.wikipedia.org/wiki/Orthogonal_Procrustes_problem)
2930

3031
##### Extensible
3132

33+
- 🔌 A built-in [Model Context Protocol](https://modelcontextprotocol.io) (MCP) server that any MCP client like [Claude desktop](https://claude.ai/download) can connect with
3234
- 💬 Optional customizable ChatGPT-like frontend for [web](https://docs.chainlit.io/deploy/copilot), [Slack](https://docs.chainlit.io/deploy/slack), and [Teams](https://docs.chainlit.io/deploy/teams) with [Chainlit](https://github.com/Chainlit/chainlit)
3335
- ✍️ Optional conversion of any input document to Markdown with [Pandoc](https://github.com/jgm/pandoc)
3436
- ✅ Optional evaluation of retrieval and generation performance with [Ragas](https://github.com/explodinggradients/ragas)
@@ -87,10 +89,11 @@ pip install raglite[ragas]
8789

8890
1. [Configuring RAGLite](#1-configuring-raglite)
8991
2. [Inserting documents](#2-inserting-documents)
90-
3. [Searching and Retrieval-Augmented Generation (RAG)](#3-searching-and-retrieval-augmented-generation-rag)
92+
3. [Retrieval-Augmented Generation (RAG)](#3-retrieval-augmented-generation-rag)
9193
4. [Computing and using an optimal query adapter](#4-computing-and-using-an-optimal-query-adapter)
9294
5. [Evaluation of retrieval and generation](#5-evaluation-of-retrieval-and-generation)
93-
6. [Serving a customizable ChatGPT-like frontend](#6-serving-a-customizable-chatgpt-like-frontend)
95+
6. [Running a Model Context Protocol (MCP) server](#6-running-a-model-context-protocol-mcp-server)
96+
7. [Serving a customizable ChatGPT-like frontend](#7-serving-a-customizable-chatgpt-like-frontend)
9497

9598
### 1. Configuring RAGLite
9699

@@ -114,7 +117,7 @@ my_config = RAGLiteConfig(
114117

115118
# Example 'local' config with a SQLite database and a llama.cpp LLM:
116119
my_config = RAGLiteConfig(
117-
db_url="sqlite:///raglite.sqlite",
120+
db_url="sqlite:///raglite.db",
118121
llm="llama-cpp-python/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/*Q4_K_M.gguf@8192",
119122
embedder="llama-cpp-python/lm-kit/bge-m3-gguf/*F16.gguf@1024", # A context size of 1024 tokens is the sweet spot for bge-m3.
120123
)
@@ -133,7 +136,7 @@ my_config = RAGLiteConfig(
133136

134137
# Example local cross-encoder reranker per language (this is the default):
135138
my_config = RAGLiteConfig(
136-
db_url="sqlite:///raglite.sqlite",
139+
db_url="sqlite:///raglite.db",
137140
reranker=(
138141
("en", Reranker("ms-marco-MiniLM-L-12-v2", model_type="flashrank")), # English
139142
("other", Reranker("ms-marco-MultiBERT-L-12", model_type="flashrank")), # Other languages
@@ -157,11 +160,11 @@ insert_document(Path("On the Measure of Intelligence.pdf"), config=my_config)
157160
insert_document(Path("Special Relativity.pdf"), config=my_config)
158161
```
159162

160-
### 3. Searching and Retrieval-Augmented Generation (RAG)
163+
### 3. Retrieval-Augmented Generation (RAG)
161164

162-
#### 3.1 Dynamically routed RAG
165+
#### 3.1 Adaptive RAG
163166

164-
Now you can run a dynamically routed RAG pipeline that consists of adding the user prompt to the message history and streaming the LLM response. Depending on the user prompt, the LLM may choose to retrieve context using RAGLite by invoking a retrieval tool. If retrieval is necessary, the LLM determines the search query and RAGLite applies hybrid search with reranking to retrieve the most relevant chunk spans (each of which is a list of consecutive chunks). The retrieval results are sent to the `on_retrieval` callback and are also appended to the message history as a tool output. Finally, the LLM response given the RAG context is streamed and the message history is updated with the assistant response:
167+
Now you can run an adaptive RAG pipeline that consists of adding the user prompt to the message history and streaming the LLM response:
165168

166169
```python
167170
from raglite import rag
@@ -173,9 +176,7 @@ messages.append({
173176
"content": "How is intelligence measured?"
174177
})
175178

176-
# Let the LLM decide whether to search the database by providing a retrieval tool to the LLM.
177-
# If requested, RAGLite then uses hybrid search and reranking to append RAG context to the message history.
178-
# Finally, assistant response is streamed and appended to the message history.
179+
# Adaptively decide whether to retrieve and stream the response:
179180
chunk_spans = []
180181
stream = rag(messages, on_retrieval=lambda x: chunk_spans.extend(x), config=my_config)
181182
for update in stream:
@@ -185,6 +186,8 @@ for update in stream:
185186
documents = [chunk_span.document for chunk_span in chunk_spans]
186187
```
187188

189+
The LLM will adaptively decide whether to retrieve information based on the complexity of the user prompt. If retrieval is necessary, the LLM generates the search query and RAGLite applies hybrid search and reranking to retrieve the most relevant chunk spans (each of which is a list of consecutive chunks). The retrieval results are sent to the `on_retrieval` callback and are appended to the message history as a tool output. Finally, the assistant response is streamed and appended to the message history.
190+
188191
#### 3.2 Programmable RAG
189192

190193
If you need manual control over the RAG pipeline, you can run a basic but powerful pipeline that consists of retrieving the most relevant chunk spans with hybrid search and reranking, converting the user prompt to a RAG instruction and appending it to the message history, and finally generating the RAG response:
@@ -222,6 +225,8 @@ RAGLite also offers more advanced control over the individual steps of a full RA
222225
6. Streaming an LLM response to the message history
223226
7. Accessing the cited documents from the chunk spans
224227

228+
A full RAG pipeline is straightforward to implement with RAGLite:
229+
225230
```python
226231
# Search for chunks:
227232
from raglite import hybrid_search, keyword_search, vector_search
@@ -289,7 +294,35 @@ answered_evals_df = answer_evals(num_evals=10, config=my_config)
289294
evaluation_df = evaluate(answered_evals_df, config=my_config)
290295
```
291296

292-
### 6. Serving a customizable ChatGPT-like frontend
297+
### 6. Running a Model Context Protocol (MCP) server
298+
299+
RAGLite comes with an [MCP server](https://modelcontextprotocol.io) implemented with [FastMCP](https://github.com/jlowin/fastmcp) that exposes a `search_knowledge_base` [tool](https://github.com/jlowin/fastmcp?tab=readme-ov-file#tools). To use the server:
300+
301+
1. Install [Claude desktop](https://claude.ai/download)
302+
2. Install [uv](https://docs.astral.sh/uv/getting-started/installation/) so that Claude desktop can start the server
303+
3. Configure Claude desktop to use `uv` to start the MCP server with:
304+
305+
```
306+
raglite \
307+
--db-url sqlite:///raglite.db \
308+
--llm llama-cpp-python/bartowski/Llama-3.2-3B-Instruct-GGUF/*Q4_K_M.gguf@4096 \
309+
--embedder llama-cpp-python/lm-kit/bge-m3-gguf/*F16.gguf@1024 \
310+
mcp install
311+
```
312+
313+
To use an API-based LLM, make sure to include your credentials in a `.env` file or supply them inline:
314+
315+
```sh
316+
OPENAI_API_KEY=sk-... raglite --llm gpt-4o-mini --embedder text-embedding-3-large mcp install
317+
```
318+
319+
Now, when you start Claude desktop you should see a 🔨 icon at the bottom right of your prompt indicating that the Claude has successfully connected with the MCP server.
320+
321+
When relevant, Claude will suggest to use the `search_knowledge_base` tool that the MCP server provides. You can also explicitly ask Claude to search the knowledge base if you want to be certain that it does.
322+
323+
<div align="center"><video src="https://github.com/user-attachments/assets/3a597a17-874e-475f-a6dd-cd3ccf360fb9" /></div>
324+
325+
### 7. Serving a customizable ChatGPT-like frontend
293326

294327
If you installed the `chainlit` extra, you can serve a customizable ChatGPT-like frontend with:
295328

@@ -302,19 +335,20 @@ The application is also deployable to [web](https://docs.chainlit.io/deploy/copi
302335
You can specify the database URL, LLM, and embedder directly in the Chainlit frontend, or with the CLI as follows:
303336

304337
```sh
305-
raglite chainlit \
306-
--db_url sqlite:///raglite.sqlite \
338+
raglite \
339+
--db-url sqlite:///raglite.db \
307340
--llm llama-cpp-python/bartowski/Llama-3.2-3B-Instruct-GGUF/*Q4_K_M.gguf@4096 \
308-
--embedder llama-cpp-python/lm-kit/bge-m3-gguf/*F16.gguf@1024
341+
--embedder llama-cpp-python/lm-kit/bge-m3-gguf/*F16.gguf@1024 \
342+
chainlit
309343
```
310344

311345
To use an API-based LLM, make sure to include your credentials in a `.env` file or supply them inline:
312346

313347
```sh
314-
OPENAI_API_KEY=sk-... raglite chainlit --llm gpt-4o-mini --embedder text-embedding-3-large
348+
OPENAI_API_KEY=sk-... raglite --llm gpt-4o-mini --embedder text-embedding-3-large chainlit
315349
```
316350

317-
<div align="center"><video src="https://github.com/user-attachments/assets/01cf98d3-6ddd-45bb-8617-cf290c09f187" /></div>
351+
<div align="center"><video src="https://github.com/user-attachments/assets/a303ed4a-54cd-45ea-a2b5-86e086053aed" /></div>
318352

319353
## Contributing
320354

poetry.lock

Lines changed: 86 additions & 14 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pyproject.toml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@ version_provider = "poetry"
1919
[tool.poetry.dependencies] # https://python-poetry.org/docs/dependency-specification/
2020
# Python:
2121
python = ">=3.10,<4.0"
22+
# Configuration:
23+
platformdirs = ">=4.0.0"
2224
# Markdown conversion:
2325
pdftext = ">=0.3.13"
2426
pypandoc-binary = { version = ">=1.13", optional = true }
@@ -52,6 +54,8 @@ ragas = { version = ">=0.1.12", optional = true }
5254
typer = ">=0.12.5"
5355
# Frontend:
5456
chainlit = { version = ">=1.2.0", optional = true }
57+
# Model Context Protocol:
58+
fastmcp = ">=0.4.1"
5559
# Utilities:
5660
packaging = ">=23.0"
5761

0 commit comments

Comments
 (0)