Skip to content

Commit 0a08bf5

Browse files
authored
feat: support prompt caching and apply Anthropic's long-context prompt format (#52)
1 parent 003967b commit 0a08bf5

File tree

12 files changed

+396
-322
lines changed

12 files changed

+396
-322
lines changed

README.md

Lines changed: 63 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,8 @@ RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with Postgr
2323
- 🧬 Multi-vector chunk embedding with [late chunking](https://weaviate.io/blog/late-chunking) and [contextual chunk headings](https://d-star.ai/solving-the-out-of-context-chunk-problem-for-rag)
2424
- ✂️ Optimal [level 4 semantic chunking](https://medium.com/@anuragmishra_27746/five-levels-of-chunking-strategies-in-rag-notes-from-gregs-video-7b735895694d) by solving a [binary integer programming problem](https://en.wikipedia.org/wiki/Integer_programming)
2525
- 🔍 [Hybrid search](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) with the database's native keyword & vector search ([tsvector](https://www.postgresql.org/docs/current/datatype-textsearch.html)+[pgvector](https://github.com/pgvector/pgvector), [FTS5](https://www.sqlite.org/fts5.html)+[sqlite-vec](https://github.com/asg017/sqlite-vec)[^1])
26+
- 💰 Improved cost and latency with a [prompt caching-aware message array structure](https://platform.openai.com/docs/guides/prompt-caching)
27+
- 🍰 Improved output quality with [Anthropic's long-context prompt format](https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/long-context-tips)
2628
- 🌀 Optimal [closed-form linear query adapter](src/raglite/_query_adapter.py) by solving an [orthogonal Procrustes problem](https://en.wikipedia.org/wiki/Orthogonal_Procrustes_problem)
2729

2830
##### Extensible
@@ -157,38 +159,85 @@ insert_document(Path("Special Relativity.pdf"), config=my_config)
157159

158160
### 3. Searching and Retrieval-Augmented Generation (RAG)
159161

160-
Now, you can search for chunks with vector search, keyword search, or a hybrid of the two. You can also rerank the search results with the configured reranker. And you can use any search method of your choice (`hybrid_search` is the default) together with reranking to answer questions with RAG:
162+
#### 3.1 Simple RAG pipeline
163+
164+
Now you can run a simple but powerful RAG pipeline that consists of retrieving the most relevant chunk spans (each of which is a list of consecutive chunks) with hybrid search and reranking, converting the user prompt to a RAG instruction and appending it to the message history, and finally generating the RAG response:
165+
166+
```python
167+
from raglite import create_rag_instruction, rag, retrieve_rag_context
168+
169+
# Retrieve relevant chunk spans with hybrid search and reranking:
170+
user_prompt = "How is intelligence measured?"
171+
chunk_spans = retrieve_rag_context(query=user_prompt, num_chunks=5, config=my_config)
172+
173+
# Append a RAG instruction based on the user prompt and context to the message history:
174+
messages = [] # Or start with an existing message history.
175+
messages.append(create_rag_instruction(user_prompt=user_prompt, context=chunk_spans))
176+
177+
# Stream the RAG response:
178+
stream = rag(messages, config=my_config)
179+
for update in stream:
180+
print(update, end="")
181+
182+
# Access the documents cited in the RAG response:
183+
documents = [chunk_span.document for chunk_span in chunk_spans]
184+
```
185+
186+
#### 3.2 Advanced RAG pipeline
187+
188+
> [!TIP]
189+
> 🥇 Reranking can significantly improve the output quality of a RAG application. To add reranking to your application: first search for a larger set of 20 relevant chunks, then rerank them with a [rerankers](https://github.com/AnswerDotAI/rerankers) reranker, and finally keep the top 5 chunks.
190+
191+
In addition to the simple RAG pipeline, RAGLite also offers more advanced control over the individual steps of the pipeline. A full pipeline consists of several steps:
192+
193+
1. Searching for relevant chunks with keyword, vector, or hybrid search
194+
2. Retrieving the chunks from the database
195+
3. Reranking the chunks and selecting the top 5 results
196+
4. Extending the chunks with their neighbors and grouping them into chunk spans
197+
5. Converting the user prompt to a RAG instruction and appending it to the message history
198+
6. Streaming an LLM response to the message history
199+
7. Accessing the cited documents from the chunk spans
161200

162201
```python
163202
# Search for chunks:
164203
from raglite import hybrid_search, keyword_search, vector_search
165204

166-
prompt = "How is intelligence measured?"
167-
chunk_ids_vector, _ = vector_search(prompt, num_results=20, config=my_config)
168-
chunk_ids_keyword, _ = keyword_search(prompt, num_results=20, config=my_config)
169-
chunk_ids_hybrid, _ = hybrid_search(prompt, num_results=20, config=my_config)
205+
user_prompt = "How is intelligence measured?"
206+
chunk_ids_vector, _ = vector_search(user_prompt, num_results=20, config=my_config)
207+
chunk_ids_keyword, _ = keyword_search(user_prompt, num_results=20, config=my_config)
208+
chunk_ids_hybrid, _ = hybrid_search(user_prompt, num_results=20, config=my_config)
170209

171210
# Retrieve chunks:
172211
from raglite import retrieve_chunks
173212

174213
chunks_hybrid = retrieve_chunks(chunk_ids_hybrid, config=my_config)
175214

176-
# Rerank chunks:
215+
# Rerank chunks and keep the top 5 (optional, but recommended):
177216
from raglite import rerank_chunks
178217

179-
chunks_reranked = rerank_chunks(prompt, chunks_hybrid, config=my_config)
218+
chunks_reranked = rerank_chunks(user_prompt, chunks_hybrid, config=my_config)
219+
chunks_reranked = chunks_reranked[:5]
220+
221+
# Extend chunks with their neighbors and group them into chunk spans:
222+
from raglite import retrieve_chunk_spans
223+
224+
chunk_spans = retrieve_chunk_spans(chunks_reranked, config=my_config)
225+
226+
# Append a RAG instruction based on the user prompt and context to the message history:
227+
from raglite import create_rag_instruction
228+
229+
messages = [] # Or start with an existing message history.
230+
messages.append(create_rag_instruction(user_prompt=user_prompt, context=chunk_spans))
180231

181-
# Answer questions with RAG:
232+
# Stream the RAG response:
182233
from raglite import rag
183234

184-
prompt = "What does it mean for two events to be simultaneous?"
185-
stream = rag(prompt, config=my_config)
235+
stream = rag(messages, config=my_config)
186236
for update in stream:
187237
print(update, end="")
188238

189-
# You can also pass a search method or search results directly:
190-
stream = rag(prompt, search=hybrid_search, config=my_config)
191-
stream = rag(prompt, search=chunks_reranked, config=my_config)
239+
# Access the documents cited in the RAG response:
240+
documents = [chunk_span.document for chunk_span in chunk_spans]
192241
```
193242

194243
### 4. Computing and using an optimal query adapter
@@ -200,7 +249,7 @@ RAGLite can compute and apply an [optimal closed-form query adapter](src/raglite
200249
from raglite import insert_evals, update_query_adapter
201250

202251
insert_evals(num_evals=100, config=my_config)
203-
update_query_adapter(config=my_config) # From here, simply call vector_search to use the query adapter.
252+
update_query_adapter(config=my_config) # From here, every vector search will use the query adapter.
204253
```
205254

206255
### 5. Evaluation of retrieval and generation

src/raglite/__init__.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,13 @@
55
from raglite._eval import answer_evals, evaluate, insert_evals
66
from raglite._insert import insert_document
77
from raglite._query_adapter import update_query_adapter
8-
from raglite._rag import async_rag, rag
8+
from raglite._rag import async_rag, create_rag_instruction, rag, retrieve_rag_context
99
from raglite._search import (
1010
hybrid_search,
1111
keyword_search,
1212
rerank_chunks,
13+
retrieve_chunk_spans,
1314
retrieve_chunks,
14-
retrieve_segments,
1515
vector_search,
1616
)
1717

@@ -25,9 +25,11 @@
2525
"keyword_search",
2626
"vector_search",
2727
"retrieve_chunks",
28-
"retrieve_segments",
28+
"retrieve_chunk_spans",
2929
"rerank_chunks",
3030
# RAG
31+
"retrieve_rag_context",
32+
"create_rag_instruction",
3133
"async_rag",
3234
"rag",
3335
# Query adapter

src/raglite/_chainlit.py

Lines changed: 21 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -9,16 +9,19 @@
99
from raglite import (
1010
RAGLiteConfig,
1111
async_rag,
12+
create_rag_instruction,
1213
hybrid_search,
1314
insert_document,
1415
rerank_chunks,
16+
retrieve_chunk_spans,
1517
retrieve_chunks,
1618
)
1719
from raglite._markdown import document_to_markdown
1820

1921
async_insert_document = cl.make_async(insert_document)
2022
async_hybrid_search = cl.make_async(hybrid_search)
2123
async_retrieve_chunks = cl.make_async(retrieve_chunks)
24+
async_retrieve_chunk_spans = cl.make_async(retrieve_chunk_spans)
2225
async_rerank_chunks = cl.make_async(rerank_chunks)
2326

2427

@@ -84,34 +87,37 @@ async def handle_message(user_message: cl.Message) -> None:
8487
step.input = Path(file.path).name
8588
await async_insert_document(Path(file.path), config=config)
8689
# Append any inline attachments to the user prompt.
87-
user_prompt = f"{user_message.content}\n\n" + "\n\n".join(
88-
f'<attachment index="{i}">\n{attachment.strip()}\n</attachment>'
89-
for i, attachment in enumerate(inline_attachments)
90+
user_prompt = (
91+
"\n\n".join(
92+
f'<attachment index="{i}">\n{attachment.strip()}\n</attachment>'
93+
for i, attachment in enumerate(inline_attachments)
94+
)
95+
+ f"\n\n{user_message.content}"
9096
)
9197
# Search for relevant contexts for RAG.
9298
async with cl.Step(name="search", type="retrieval") as step:
9399
step.input = user_message.content
94100
chunk_ids, _ = await async_hybrid_search(query=user_prompt, num_results=10, config=config)
95101
chunks = await async_retrieve_chunks(chunk_ids=chunk_ids, config=config)
96102
step.output = chunks
97-
step.elements = [ # Show the top 3 chunks inline.
98-
cl.Text(content=str(chunk), display="inline") for chunk in chunks[:3]
103+
step.elements = [ # Show the top chunks inline.
104+
cl.Text(content=str(chunk), display="inline") for chunk in chunks[:5]
99105
]
100-
# Rerank the chunks.
106+
await step.update() # TODO: Workaround for https://github.com/Chainlit/chainlit/issues/602.
107+
# Rerank the chunks and group them into chunk spans.
101108
async with cl.Step(name="rerank", type="rerank") as step:
102109
step.input = chunks
103110
chunks = await async_rerank_chunks(query=user_prompt, chunk_ids=chunks, config=config)
104-
step.output = chunks
105-
step.elements = [ # Show the top 3 chunks inline.
106-
cl.Text(content=str(chunk), display="inline") for chunk in chunks[:3]
111+
chunk_spans = await async_retrieve_chunk_spans(chunks[:5], config=config)
112+
step.output = chunk_spans
113+
step.elements = [ # Show the top chunk spans inline.
114+
cl.Text(content=str(chunk_span), display="inline") for chunk_span in chunk_spans
107115
]
116+
await step.update() # TODO: Workaround for https://github.com/Chainlit/chainlit/issues/602.
108117
# Stream the LLM response.
109118
assistant_message = cl.Message(content="")
110-
async for token in async_rag(
111-
prompt=user_prompt,
112-
search=chunks,
113-
messages=cl.chat_context.to_openai()[-5:], # type: ignore[no-untyped-call]
114-
config=config,
115-
):
119+
messages: list[dict[str, str]] = cl.chat_context.to_openai()[:-1] # type: ignore[no-untyped-call]
120+
messages.append(create_rag_instruction(user_prompt=user_prompt, context=chunk_spans))
121+
async for token in async_rag(messages, config=config):
116122
await assistant_message.stream_token(token)
117123
await assistant_message.update() # type: ignore[no-untyped-call]

0 commit comments

Comments
 (0)