An intelligent, adaptive framework support system that leverages Self-RAG (Self-Reflective Retrieval-Augmented Generation) to provide accurate, contextual responses for technical documentation and APIs. The system features intelligent query routing, multi-source document retrieval, and comprehensive self-evaluation mechanisms built on LangGraph for robust multi-agent workflows.
- π Features
- ποΈ Architecture
- π οΈ Installation
- βοΈ Configuration
- ποΈ Vector Database Preparation
- π€ Multi-Agent System
- π§ Implementation Details
- π Usage Examples
- π Self-Reflection Process
- π Performance Evaluation
- π¨ Troubleshooting
- π License
- π§ Intelligent Query Routing: Automatically determines whether to use retrieval-based processing (for framework-specific questions) or conversational processing (for general queries)
- π Self-Reflective Evaluation: Multi-layered quality assessment including hallucination detection, document relevance checking, and answer quality verification
- π Advanced Document Processing: LLM-based summarization that processes entire documents instead of arbitrary chunking, maintaining context and semantic coherence
- π Multi-Source Retrieval: Combines vector database search with real-time web search for comprehensive information gathering
- π Adaptive Query Refinement: Automatically rewrites and optimizes queries when initial retrieval doesn't yield relevant results
- π€ Multi-Agent Workflow: LangGraph-based stateful workflow with specialized nodes for different processing stages
- π Built-in Performance Evaluation: Comprehensive evaluation framework comparing different RAG approaches
- π¬ Gradio Web Interface: User-friendly chat interface for interactive queries
- Python 3.12+
- OpenAI API key
- Tavily API key (for web search)
# Clone the repository
git clone https://github.com/erentorlak/Self-Reflective_Framework_Assistant_RAG.git
cd Self-Reflective_Framework_Assistant_RAG
# Install Poetry if you haven't already
curl -sSL https://install.python-poetry.org | python3 -
# Create virtual environment and install dependencies
poetry shell
poetry install
# Run the application
poetry run python -m app.main
# Clone the repository
git clone https://github.com/erentorlak/Self-Reflective_Framework_Assistant_RAG.git
cd Self-Reflective_Framework_Assistant_RAG
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Run the application
python -m app.main
poetry export -f requirements.txt --without-hashes > requirements.txt
Create a .env
file in the root directory with the following variables:
OPENAI_API_KEY=your_openai_api_key_here
TAVILY_API_KEY=your_tavily_api_key_here
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=your_langchain_api_key_here # Optional for LangSmith tracing
The system uses the following default configurations (defined in app/config.py
):
CHROMA_COLLECTION_NAME = "vdb_summary_query"
CHROMA_PERSIST_DIRECTORY = "./vdb_summary_query"
OPENAI_EMBEDDINGS_MODEL = "text-embedding-3-large"
CHAT_OPENAI_MODEL = "gpt-4o-mini"
CHAT_OPENAI_TEMPERATURE = 0
The system uses a sophisticated approach to document processing and vector database creation. Instead of traditional chunking, it employs LLM-based summarization for better context preservation.
- Document Loading: Uses
RecursiveUrlLoader
to scrape documentation from specified URLs - LLM-Enhanced Processing: Each document is processed by GPT-4 to generate:
- Concise summaries (5-7 sentences)
- Potential user queries (5-7 examples)
- Enhanced Document Creation: Creates enriched documents combining:
- Original content
- Generated summaries
- Possible queries
- Metadata preservation
# Prepare the vector database
python vdb_prepare.py
from vdb_prepare import CreateSummary, ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
# Initialize LLM for document processing
llm = ChatOpenAI(model="gpt-4o", temperature=0)
structured_llm = llm.with_structured_output(CreateSummary)
# Define processing prompt
system = """You are an expert at technical documentation.
Analyze the given documentation and provide:
1. A concise summary (5-7 sentences)
2. List of 5-7 possible queries that users might ask about this topic.
"""
summary_prompt = ChatPromptTemplate.from_messages([
("system", system),
("human", "Document: \n\n {document} \n\n ")
])
# Process document
processor = summary_prompt | structured_llm
result = processor.invoke({"document": document_content})
new_metadata = {
"source": doc.metadata.get("source"),
"title": doc.metadata.get("title"),
"description": doc.metadata.get("description"),
"summary": result.summary,
"possible_queries": queries,
"content": doc.page_content, # Original content preserved
}
The system is built on a LangGraph-based multi-agent workflow that orchestrates various specialized components for intelligent document retrieval and response generation.
flowchart TD
Start([User Query]) --> Router[π§ Query Router]
Router -->|Framework Question| Init[π§ Initialize State]
Router -->|General Question| Conv[π¬ Conversation Chain]
Init --> Retrieve[π Multi-Source Retrieval]
Retrieve --> VDB[ποΈ Vector DB Search]
Retrieve --> Web[π Web Search]
VDB --> Combine[π Combine Sources]
Web --> Combine
Combine --> Grade[π Grade Documents]
Grade -->|Relevant| Generate[βοΈ Generate Answer]
Grade -->|Not Relevant| Transform[π Transform Query]
Grade -->|No Results| End1[β End]
Transform --> Retrieve
Generate --> SelfEval[π Self-Evaluation]
SelfEval -->|Hallucination Check| Hallucination{π¨ Grounded?}
SelfEval -->|Quality Check| Quality{β
Addresses Question?}
Hallucination -->|No| Generate
Hallucination -->|Yes| Quality
Quality -->|No| Transform
Quality -->|Yes| Output[β¨ Final Response]
Conv --> Output
style Router fill:#e1f5fe
style Generate fill:#f3e5f5
style SelfEval fill:#fff3e0
style Output fill:#e8f5e8
Component | Purpose | Implementation |
---|---|---|
Query Router | Determines processing path | Uses structured LLM output to classify queries |
Multi-Source Retriever | Gathers relevant information | Combines vector DB and web search results |
Document Grader | Filters relevant documents | LLM-based relevance assessment |
Answer Generator | Creates responses | RAG chain with context injection |
Self-Evaluator | Quality assurance | Multi-stage evaluation (hallucination + relevance) |
Query Transformer | Improves search queries | LLM-based query rewriting |
The system implements a sophisticated multi-agent architecture using LangGraph for state management and workflow orchestration.
Purpose: Determines the optimal processing path for incoming queries.
Logic:
def route_question(self, state: GraphState) -> str:
question = state["question"]
source = self.chains.question_router.invoke({"question": question})
if source.datasource == "retrieve":
return "retrieve" # Framework-specific questions
elif source.datasource == "conversation":
return "conversation" # General conversations
Decision Criteria:
- Retrieve: LangGraph/LangChain technical questions, implementation details, API documentation
- Conversation: General chat, non-technical queries
Purpose: Gathers information from multiple sources.
Process:
def retrieve(self, state: GraphState) -> GraphState:
question = state["question"]
# Vector database retrieval
vdb_documents = self.retriever.invoke(question)
vdb_contents = [doc.metadata.get("content") for doc in vdb_documents]
# Web search retrieval
web_search_documents = self.web_search_tool.invoke({"query": question})
web_search_contents = [d["content"] for d in web_search_documents]
# Combine sources
documents = vdb_contents + web_search_contents
return {"documents": documents, "question": question, ...}
Purpose: Evaluates the relevance of retrieved documents.
Evaluation Process:
- Uses LLM to assess document relevance
- Filters out irrelevant or low-quality retrievals
- Decides whether to proceed with generation or transform the query
Purpose: Creates contextual responses using retrieved documents.
Features:
- Context-aware response generation
- Fact-grounded answers
- Integration of multiple information sources
Purpose: Performs quality assurance on generated responses.
Evaluation Stages:
- Hallucination Check: Ensures answer is grounded in retrieved facts
- Relevance Check: Verifies the answer addresses the original question
- Quality Assessment: Overall response quality evaluation
Purpose: Improves queries that don't yield good results.
Process:
def transform_query(self, state: GraphState) -> GraphState:
question = state["question"]
better_question = self.chains.question_rewriter.invoke({"question": question})
return {"question": better_question, ...}
The system uses a GraphState
model to maintain context throughout the workflow:
class GraphState(TypedDict):
question: str # Current user query (original or rewritten)
generation: str # LLM-generated answer
documents: List[str] # Retrieved documents from various sources
query_rewritten_num: int # Counter for query rewrites (prevents infinite loops)
final_output: Annotated[List[str], add] # Final processed output
-
Initial Routing:
START β route_question β { "retrieve": "init_state", "conversation": "conversation" }
-
Document Grading Flow:
grade_documents β { "transform_query": "transform_query", # Poor quality docs "generate": "generate", # Good quality docs "end": END # No docs found }
-
Self-Evaluation Flow:
generate β grade_generation β { "not supported": "generate", # Hallucination detected "useful": END, # High quality response "not useful": "transform_query" # Low quality response }
class Workflow:
def setup_workflow(self):
# Add all agent nodes
self.workflow.add_node("conversation", self.nodes.conversation)
self.workflow.add_node("init_state", self.nodes.init_state)
self.workflow.add_node("retrieve", self.nodes.retrieve)
self.workflow.add_node("grade_documents", self.nodes.grade_documents)
self.workflow.add_node("generate", self.nodes.generate)
self.workflow.add_node("transform_query", self.nodes.transform_query)
# Define conditional edges
self.workflow.add_conditional_edges(START, self.nodes.route_question, {...})
# ... additional edge configurations
def compile_workflow(self):
return self.workflow.compile()
Manages all LLM chains used throughout the system.
class Chains:
def __init__(self, llm: ChatOpenAI, graders: Graders):
# Router Chain - Determines query routing
self.question_router = route_prompt | llm.with_structured_output(RouteQuery)
# RAG Chain - Main response generation
self.rag_chain = rag_prompt | llm | StrOutputParser()
# Conversation Chain - Handles general conversations
self.conversation_chain = conv_prompt | llm | StrOutputParser()
# Question Rewriter - Improves poor queries
self.question_rewriter = re_write_prompt | llm | StrOutputParser()
Implements the self-reflection evaluation system.
class Graders:
def grade_documents(self, question: str, document: str) -> GradeDocuments:
"""Evaluates document relevance to the user question"""
return self.retrieval_grader.invoke({"question": question, "document": document})
def grade_hallucinations(self, documents: str, generation: str) -> GradeHallucinations:
"""Checks if the generation is grounded in the provided documents"""
return self.hallucination_grader.invoke({"documents": documents, "generation": generation})
def grade_answer(self, question: str, generation: str) -> GradeAnswer:
"""Evaluates if the answer addresses the original question"""
return self.answer_grader.invoke({"question": question, "generation": generation})
Contains all agent node implementations for the workflow.
Key Methods:
route_question()
: Initial query classificationretrieve()
: Multi-source document retrievalgrade_documents()
: Document relevance filteringgenerate()
: Response generation with contexttransform_query()
: Query improvement and rewriting
Orchestrates the entire workflow execution with streaming support.
class InferenceEngine:
def inference(self, inputs: str, history, *args, **kwargs) -> Iterator[str]:
"""Streaming inference generator"""
config = {"configurable": {"thread_id": "1"}}
inputs_dict = {"question": inputs}
for output in self.compiled_workflow.stream(inputs_dict, config):
for key, value in output.items():
if isinstance(value, dict) and "generation" in value:
yield value["generation"]
The system uses carefully crafted prompts for different functions:
ROUTE_SYSTEM_PROMPT = """You are an expert at routing a user question to a retrieve or conversation.
The retrieve contains documentation of Langchain and LangGraph.
Use retrieve if the user asks questions about:
- Details about LangGraph
- Usage examples and API documentation for LangGraph/Langchain
- Implementation details and source code questions
- Any agent or multi-agent workflow questions
If a user uses daily conversation, then use conversation."""
GRADE_DOCUMENTS_SYSTEM_PROMPT = """You are a grader assessing relevance of a retrieved document to a user question.
It does not need to be a stringent test. The goal is to filter out erroneous retrievals.
If the document contains keyword(s) or semantic meaning related to the user question, grade it as relevant.
Give a binary score 'yes' or 'no' score to indicate whether the document is relevant to the question."""
# Start the Gradio interface
python -m app.main
This launches a web interface at http://localhost:7860
where you can interact with the assistant.
from app.main import main
from app.inference import InferenceEngine
from app.chains import Chains
from app.graders import Graders
from langchain_openai import ChatOpenAI
# Initialize components
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
graders = Graders(llm)
chains = Chains(llm, graders)
inference_engine = InferenceEngine(chains, retriever, web_search_tool)
# Ask a question
question = "How do I create a conditional edge in LangGraph?"
for response_chunk in inference_engine.inference(question, []):
print(response_chunk, end="")
Input: "How do I implement state management in LangGraph?"
Process:
- Router β
retrieve
(framework-specific question) - Retrieval β Vector DB + Web search
- Document grading β Filters relevant docs
- Generation β Creates detailed response with code examples
- Self-evaluation β Verifies accuracy and relevance
Expected Output: Detailed explanation with code examples about LangGraph state management.
Input: "What's the weather like today?"
Process:
- Router β
conversation
(general question) - Conversation chain β Direct LLM response
- Output β General conversational response
Expected Output: Conversational response explaining the assistant can't access weather data.
Input: "agents"
Process:
- Router β
retrieve
(potentially framework-related) - Retrieval β Gathers documents
- Document grading β Poor relevance scores
- Query transformation β "What are agents in LangGraph and how do they work?"
- Re-retrieval β Better document matches
- Generation β Comprehensive response
# Modify vdb_prepare.py to add custom URLs
urls = [
"https://langchain-ai.github.io/langgraph/how-tos/",
"https://your-custom-documentation-site.com/",
# Add more documentation sources
]
# In app/config.py
CHAT_OPENAI_MODEL = "gpt-4o" # Use more powerful model
CHAT_OPENAI_TEMPERATURE = 0.1 # Add slight creativity
The notebook version provides additional features:
# Run the notebook version
python nb.py
Additional Features:
- Performance evaluation comparing different RAG approaches
- Latency measurements
- Response quality scoring
- Comparative analysis tools
The self-reflection mechanism is a key differentiator of this system, implementing multiple evaluation layers:
def grade_documents(self, state: GraphState) -> GraphState:
"""Grade retrieved documents for relevance"""
question = state["question"]
documents = state["documents"]
filtered_docs = []
for d in documents:
score = self.graders.grade_documents(question, d)
if score.binary_score == "yes":
filtered_docs.append(d)
return {"documents": filtered_docs, ...}
Evaluation Criteria:
- Keyword relevance
- Semantic similarity
- Contextual appropriateness
def grade_generation_v_documents_and_question(self, state: GraphState) -> str:
"""Comprehensive generation evaluation"""
question = state["question"]
documents = state["documents"]
generation = state["generation"]
# Check for hallucinations
score = self.graders.grade_hallucinations(documents, generation)
if score.binary_score == "no":
return "not supported" # Hallucination detected
# Check answer quality
score = self.graders.grade_answer(question, generation)
if score.binary_score == "yes":
return "useful" # High-quality answer
else:
return "not useful" # Needs improvement
Hallucination Checks:
- Fact verification against source documents
- Consistency checking
- Source attribution validation
Quality Metrics:
- Relevance: Does the answer address the question?
- Completeness: Are all aspects of the question covered?
- Accuracy: Is the information factually correct?
- Clarity: Is the explanation clear and understandable?
When initial results are poor, the system automatically improves queries:
def transform_query(self, state: GraphState) -> GraphState:
"""Improve query for better retrieval"""
question = state["question"]
query_rewritten_num = state["query_rewritten_num"]
# Prevent infinite loops
if query_rewritten_num >= 3:
return state
better_question = self.chains.question_rewriter.invoke({"question": question})
return {
"question": better_question,
"query_rewritten_num": query_rewritten_num + 1,
...
}
Query Improvement Strategies:
- Adding technical context
- Clarifying ambiguous terms
- Expanding abbreviated concepts
- Adding relevant keywords
The system includes a comprehensive evaluation framework comparing three different approaches to question answering.
def evaluate_standard_llm(query):
"""Baseline: Direct LLM response without RAG"""
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
response = llm.invoke(query)
return response.content
def evaluate_normal_rag(query):
"""Traditional RAG without self-reflection"""
docs = retriever.invoke(query)
context = "\n".join([doc.page_content for doc in docs])
prompt = f"Context: {context}\n\nQuestion: {query}"
response = llm.invoke(prompt)
return response.content
def evaluate_self_rag(query):
"""Full self-reflective RAG system"""
return inference_engine.inference(query, [])
class ScoreModel(BaseModel):
"""Relevance and performance score for a response"""
relevance: int = Field(description="Relevance score from 1-5")
justification: str = Field(description="Reasoning for the score")
def score_relevance(query, response):
scoring_prompt = ChatPromptTemplate.from_messages([
("system", "Rate the relevance of this response to the query on a scale of 1-5..."),
("human", f"Query: {query}\nResponse: {response}")
])
scoring_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
scoring_llm = scoring_llm.with_structured_output(ScoreModel)
score = scoring_llm.invoke(scoring_prompt)
return score
- Relevance Score: 1-5 scale rating of answer quality
- Latency: Response time measurement
- Factual Grounding: Verification against source documents
- Completeness: Coverage of query requirements
Method | Avg. Relevance | Avg. Latency (s) | Factual Accuracy |
---|---|---|---|
Standard LLM | 3.2/5 | 1.2 | 65% |
Traditional RAG | 4.1/5 | 3.5 | 85% |
Self-Reflective RAG | 4.7/5 | 5.8 | 95% |
# Define test queries
test_queries = [
"How do I create a conditional edge in LangGraph?",
"What is the difference between StateGraph and Graph?",
"How do I implement memory in a LangGraph application?",
]
# Run comprehensive evaluation
results, summary = evaluate_multiple_queries()
# View results
for query, data in results.items():
print(f"Query: {query}")
for method, metrics in data.items():
print(f" {method}: Relevance={metrics['relevance']}, Latency={metrics['latency']:.2f}s")
Problem: AuthenticationError
or missing API keys
Solution:
# Check environment variables
echo $OPENAI_API_KEY
echo $TAVILY_API_KEY
# Set in .env file
OPENAI_API_KEY=your_key_here
TAVILY_API_KEY=your_key_here
Problem: ChromaDB collection not found
Solution:
# Regenerate vector database
python vdb_prepare.py
# Check if directory exists
ls -la vdb_summary_query/
Problem: Out of memory errors during document processing
Solution:
# In vdb_prepare.py, process documents in batches
batch_size = 10
for i in range(0, len(docs), batch_size):
batch = docs[i:i+batch_size]
# Process batch...
Problem: Irrelevant documents retrieved
Solutions:
- Improve query preprocessing
- Adjust similarity thresholds
- Update document sources
- Retrain embeddings with domain-specific data
Problem: High latency responses
Solutions:
# Use faster model for routing
CHAT_OPENAI_MODEL = "gpt-3.5-turbo"
# Reduce web search results
web_search_tool = TavilySearchResults(max_results=1)
# Optimize vector search
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
import logging
logging.basicConfig(level=logging.DEBUG)
# Add debugging in nodes
def retrieve(self, state: GraphState) -> GraphState:
print(f"DEBUG: Query={state['question']}")
print(f"DEBUG: Retrieved {len(documents)} documents")
return state
# Test router independently
router_result = chains.question_router.invoke({"question": "test query"})
print(f"Router decision: {router_result.datasource}")
# Test document grading
grade = graders.grade_documents("test question", "test document")
print(f"Document relevance: {grade.binary_score}")
This project is licensed under the MIT License - see the LICENSE file for details.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
For questions and support:
- Open an issue on GitHub
- Check the troubleshooting section above
- Review the LangGraph documentation: https://langchain-ai.github.io/langgraph/
- Built with LangChain and LangGraph
- Uses OpenAI for language models
- Vector storage powered by ChromaDB
- Web interface built with Gradio
- Web search integration via Tavily
Latest running version: nb.py
(notebook version) or python -m app.main
(modular version)
This README provides comprehensive documentation for understanding, installing, and using the Self-Reflective Framework Assistant RAG system. The system demonstrates an advanced approach to RAG with self-reflection capabilities, intelligent routing, and comprehensive evaluation mechanisms.