An end-to-end AI engineering project that builds an intelligent product recommendation and analysis system using Amazon Electronics dataset with a complete RAG implementation. This capstone project demonstrates modern AI engineering practices including data processing, visualization, vector databases, and retrieval-augmented generation (RAG).
Course: End-to-End AI Engineering Bootcamp (Maven)
- Data Processing Pipeline: Automated processing of large-scale Amazon product and review data
- Interactive Visualizations: Comprehensive analysis dashboards with temporal trends, category insights, and rating patterns
- Complete RAG System: Vector database with ChromaDB, intelligent query processing, and context-aware retrieval
- Advanced Streamlit UI: Professional tab-based interface with smart query suggestions, real-time monitoring, and enhanced response visualization
- Multi-Provider Support: Compatible with OpenAI, Groq, and Google Gemini models
- Vector Database: ChromaDB-powered semantic search with GTE-large embeddings, metadata filtering and hybrid queries
- Query Intelligence: Automatic query type detection for product reviews, comparisons, complaints, and recommendations
- RAG Evaluation Framework: Industry-standard RAGAS evaluation with enhanced Weave integration for complete metric visibility
- Enhanced Weave-RAGAS Integration: All evaluation metrics (faithfulness, relevancy, precision, recall) visible in Weave UI with drill-down capabilities
- Synthetic Test Data: Advanced synthetic data generation with template-based queries, variation techniques, and quality analysis
- Production Testing: Automated test case generation with configurable difficulty distributions and Weave traceability
- Optimized Weave Tracing: Production-ready AI pipeline monitoring with efficient session-based initialization, zero-redundancy design, and comprehensive analytics
- LiteLLM Integration: Unified access to 100+ LLM providers including Ollama for local models
- Vector Database Management: Scripts for reinitializing and managing ChromaDB with custom JSONL data
- LangGraph Agent: ReAct pattern conversational agent with reasoning traces, tool use, and persistent state
- Session Management: PostgreSQL-based conversation persistence for multi-turn interactions
- Agent Mode Toggle: Seamless switching between direct RAG and agent-mediated queries
- contractual pricing
- account-specific catalogs
- procurement compliance
- multi-user workflows (approvers, requisitioners, etc.)
- Bulk ordering, BOM-style inputs, or quote-based negotiation are not captured
- ERP integration, punchout catalogs (OCI, cXML)
- product taxonomies (e.g. ETIM, UNSPSC)
Source: Amazon Reviews 2023 - Electronics Category
- Products: 1,000 carefully selected electronics products
- Reviews: 20,000 customer reviews (10-20 reviews per product)
- Date Range: 2003-2023 (20 years of review data)
- Categories: Comprehensive electronics categories with hierarchical structure
- Average reviews per product: 20
- Review rating distribution: 4.2/5.0 average
- Most active day: Tuesday (3,068 reviews)
- Most active month: January (2,283 reviews)
- Recent activity: 37.8% of reviews from 2020 onwards
- Embedding Model: GTE-large (1024 dimensions) for superior semantic search
- Python 3.12+
- uv package manager
- Docker (optional, for containerized deployment)
- Ollama (optional, for local LLM models)
-
Clone the repository
git clone <repository-url> cd AI-Powered-Amazon-Product-Assistant
-
Install dependencies
uv sync
-
Configure environment variables
# Create .env file with your API keys cp .env.example .env # if available, or create manually # Required for chatbot functionality echo "OPENAI_API_KEY=your_openai_key" >> .env echo "GROQ_API_KEY=your_groq_key" >> .env echo "GOOGLE_API_KEY=your_google_key" >> .env # Optional for Weave tracing echo "WANDB_API_KEY=your_wandb_key" >> .env # Optional for Ollama (local LLMs) echo "OLLAMA_BASE_URL=http://localhost:11434" >> .env
-
Set up Jupyter kernel
uv run python -m ipykernel install --user --name ai-product-assistant
-
Run data processing (if needed)
uv run jupyter notebook notebooks/data_preprocessing.ipynb
-
Launch applications
# Visualization dashboard uv run jupyter notebook notebooks/data_visualization.ipynb # Enhanced Streamlit chatbot interface with tab-based UI and RAG uv run streamlit run src/chatbot-ui/streamlit_app.py # OR use Make make run-streamlit # Run FastAPI server with agent endpoint (required for agent mode) make run-api # Optional: Run with PostgreSQL for conversation persistence docker-compose -f docker-compose.postgres.yml up -d make run-api # Will automatically detect and use PostgreSQL # Run Weave-native evaluation (RECOMMENDED - follows official best practices) uv run python scripts/eval/run_weave_native_evaluation.py --dataset-path "data/evaluation/rag_evaluation_dataset.json" --openai-api-key YOUR_KEY # Run model comparison with native evaluation uv run python scripts/eval/run_weave_native_evaluation.py --mode comparison --dataset-path "data/evaluation/rag_evaluation_dataset.json" # Run enhanced Weave-RAGAS evaluation (ensures all metrics visible in Weave UI) uv run python scripts/eval/run_enhanced_evaluation.py --single-query "What are iPhone charger features?" --wandb-api-key YOUR_KEY # Run full evaluation with complete metric tracking uv run python scripts/eval/run_enhanced_evaluation.py --dataset-path "data/evaluation/rag_evaluation_dataset.json" --wandb-api-key YOUR_KEY # Alternative: Standard RAGAS evaluation uv run python scripts/eval/run_ragas_evaluation.py --single-query "What are iPhone charger features?" --ground-truth "iPhone chargers typically feature Lightning connector, fast charging support, USB-C power adapter compatibility, and MFi certification" # Generate ragas test dataset (Note: If you get entity extraction errors, see CLAUDE.md) uv run python scripts/eval/generate_ragas_dataset.py --test-size 50 # Alternative: Generate simple synthetic dataset uv run python scripts/eval/generate_simple_ragas_dataset.py --synthetic-only
Note: ChromaDB is an API service and doesn't have a web interface. To interact with your data, use the Streamlit app at http://localhost:8501.
# Build the containers
make build-docker-streamlit
# Run both Streamlit app and ChromaDB service
make run-docker-streamlit
# View logs
make logs-docker-streamlit
# Stop services
make stop-docker-streamlit
# Restart services
make restart-docker-streamlit
Docker Services:
- Streamlit App: http://localhost:8501 (Enhanced tab-based interface)
- ChromaDB API: http://localhost:8000 (API service - no web UI)
- Health check:
curl http://localhost:8000/api/v2/heartbeat
- Collections:
curl http://localhost:8000/api/v2/collections
- Health check:
- Persistent Storage: Vector data persisted in Docker volume
The application features a professional tab-based interface designed for optimal user experience:
🔧 Configuration Tab:
- System Status: Real-time monitoring of Weave tracing and RAG system initialization
- Model Selection: Choose from OpenAI (GPT-4o, GPT-4o-mini), Groq (Llama-3.3-70b), or Google (Gemini-2.0-flash)
- Parameter Controls: Fine-tune temperature, max tokens, top-p, and top-k with provider-specific support
- RAG Configuration: Enable/disable RAG with customizable product and review limits
💬 Query Tab:
- Smart Examples: 12+ categorized example queries across 6 use cases (Product Info, Reviews, Comparisons, Complaints, Recommendations, Use Cases)
- Query History: Access and reuse your last 10 queries with one click
- Auto-Suggestions: Get intelligent query completions based on partial input (3+ characters)
- Quick Filters: Filter by query type, product category, and price range
- Enhanced Input: Dynamic placeholders and integrated filter display
📊 Monitoring Tab:
- Session Statistics: Track message counts, query history, and usage patterns
- Real-Time Performance: View RAG vs LLM processing times with percentage breakdown
- RAG Analytics: Monitor retrieved products/reviews and query type detection
- System Health: Check API configurations and system component status
- Weave Integration: Direct links to W&B dashboard for detailed trace analysis
The application includes comprehensive Weave tracing for end-to-end AI pipeline monitoring and performance analysis.
-
Get W&B API Key
- Sign up at wandb.ai
- Get your API key from User Settings
-
Configure Tracing
# Add to your .env file echo "WANDB_API_KEY=your_wandb_api_key" >> .env
-
Enhanced Features Tracked
- Optimized Initialization: Single-session setup with session state management
- RAG Pipeline Tracing: Query analysis, context building, and retrieval metrics
- LLM Provider Tracking: Detailed request/response metadata for OpenAI, Groq, and Google
- Performance Analytics: Sub-operation timing, character counts, and success rates
- Error Classification: Structured error handling with types and fallback strategies
- Real-Time UI Feedback: Processing times and operation status in sidebar
- Context Quality Metrics: Query type detection, extracted terms, and retrieval effectiveness
- Trace Optimization: Eliminated redundant calls and duplicate initialization
-
Optimized Operation Monitoring
- Session-Based Initialization: Single setup per session via
@st.cache_resource
- Consolidated Tracing: Primary trace points at key pipeline stages
- RAG Enhancement Metrics: Query processing timing and context quality
- LLM Provider Analytics: Request/response data with performance breakdown
- End-to-End Pipeline: Complete timing analysis from query to response
- Zero-Redundancy Design: Eliminated multiple trace calls for same operations
- Session-Based Initialization: Single setup per session via
-
Production-Ready Monitoring
- Optimized Trace Volume: Meaningful traces without duplication
- Session State Management: Prevents repeated initialization calls
- Clean Dashboard Data: Visit your W&B dashboard for organized traces
- Performance Insights: Navigate to "Bootcamp" project for analytics
- Error Tracking: Structured error handling with fallback strategies
- Real-Time Feedback: Processing times displayed in Streamlit sidebar
The project includes scripts for managing and reinitializing the ChromaDB vector database:
# Check current database status
uv run python scripts/check_vector_db.py
# Reinitialize with your own JSONL data (simple)
uv run python scripts/reinit_vector_db_simple.py your_data.jsonl --clear
# Reinitialize with advanced options
uv run python scripts/reinit_vector_db.py \
--jsonl-path your_data.jsonl \
--batch-size 50 \
--persist-dir custom_db \
--collection-name my_collection
# Append new data without clearing
uv run python scripts/reinit_vector_db.py \
--jsonl-path additional_data.jsonl \
--no-clear-existing
Supported JSONL formats:
- Standard RAG format:
{"id": "...", "text": "...", "type": "product|review", "metadata": {...}}
- Amazon format:
{"asin": "...", "title": "...", "description": "...", "reviewText": "..."}
- Generic format:
{"content": "...", "category": "...", "source": "..."}
For detailed documentation, see scripts/README_vector_db.md
.
The application supports local LLMs through Ollama via LiteLLM:
# Install Ollama (visit https://ollama.com for instructions)
# Pull and run a model
ollama pull llama3.2
ollama run llama3.2
# The app will automatically detect Ollama at http://localhost:11434
# Select "Ollama" as the provider in the Streamlit configuration tab
Note for Docker users: When running the Streamlit app in Docker, Ollama running on your host machine is accessible via host.docker.internal:11434
. This is automatically configured in the docker-compose.yml
file.
AI-Powered-Amazon-Product-Assistant/
├── 📁 data/
│ ├── Electronics.jsonl # Raw review data (25GB)
│ ├── meta_Electronics.jsonl # Raw product metadata (4.9GB)
│ ├── 📁 processed/
│ │ ├── electronics_top1000_products.jsonl # 1,000 product records
│ │ ├── electronics_top1000_products_reviews.jsonl # 20,000 review records
│ │ ├── electronics_rag_documents.jsonl # 2,000 RAG-optimized documents
│ │ ├── dataset_summary.json # Processing metadata
│ │ └── README.md # Data documentation
│ └── 📁 chroma_db/ # Vector database storage (local)
├── 📁 notebooks/
│ ├── data_preprocessing.ipynb # High-performance data processing with Polars
│ ├── data_visualization.ipynb # Efficient data visualization with Polars
│ ├── verify_api_keys.ipynb # API configuration testing
│ └── README.md # Notebook documentation
├── 📁 src/
│ ├── 📁 chatbot-ui/
│ │ ├── 📁 core/
│ │ │ └── config.py # Multi-provider configuration
│ │ ├── streamlit_app.py # Main chatbot interface with RAG
│ │ └── session_manager.py # Session management for agent conversations
│ ├── 📁 core/ # Core modules
│ │ ├── __init__.py # Core module initialization
│ │ ├── base_classes.py # Base abstract classes
│ │ ├── config_improved.py # Enhanced configuration (Pydantic V2)
│ │ ├── decorators.py # Utility decorators (retry, cache, timing)
│ │ ├── exceptions.py # Custom exception hierarchy
│ │ ├── implementations.py # Concrete implementations
│ │ ├── llm_providers.py # LLM provider management
│ │ ├── llm_service.py # LLM service interface
│ │ ├── logging_config.py # Logging configuration
│ │ ├── performance.py # Performance optimization utilities
│ │ └── structured_outputs.py # Pydantic models for structured LLM responses
│ ├── 📁 agents/ # LangGraph agent implementation (Sprint 3)
│ │ ├── __init__.py # Agent module initialization
│ │ ├── state.py # Agent state TypedDict definitions
│ │ ├── nodes.py # ReAct pattern nodes (reasoning, action, observation)
│ │ ├── graph.py # LangGraph workflow and routing
│ │ ├── react_agent.py # Main ReactAgent implementation
│ │ ├── 📁 tools/ # Agent tools
│ │ │ ├── __init__.py # Tools initialization
│ │ │ └── vector_search_tool.py # Vector search tool wrapping RAG
│ │ └── 📁 persistence/ # State persistence
│ │ ├── __init__.py # Persistence initialization
│ │ ├── models.py # SQLAlchemy models for state storage
│ │ └── postgres_checkpointer.py # PostgreSQL checkpointer for conversations
│ ├── 📁 api/ # FastAPI implementation (Sprint 2)
│ │ ├── __init__.py # API module initialization
│ │ ├── app.py # Main FastAPI application
│ │ ├── dependencies.py # Dependency injection
│ │ ├── models.py # Request/response models
│ │ ├── 📁 middleware/ # API middleware
│ │ │ ├── __init__.py # Middleware initialization
│ │ │ ├── rate_limiting.py # Rate limiting middleware
│ │ │ ├── cors.py # CORS configuration
│ │ │ ├── authentication.py # API key authentication
│ │ │ └── error_handling.py # Global error handling
│ │ └── 📁 routers/ # API route handlers
│ │ ├── __init__.py # Routers initialization
│ │ ├── health.py # Health check endpoints
│ │ └── rag.py # RAG and agent endpoints
│ ├── 📁 monitoring/ # Monitoring and observability
│ │ └── integration.py # Monitoring system integration
│ ├── 📁 prompts/ # Prompt management (Sprint 2)
│ │ ├── __init__.py # Prompts module initialization
│ │ ├── registry.py # Prompt template registry
│ │ ├── filters.py # Custom Jinja2 filters
│ │ └── templates/ # Jinja2 templates for all query types
│ ├── 📁 rag/
│ │ ├── vector_db.py # ChromaDB vector database (local, GTE-large)
│ │ ├── vector_db_docker.py # ChromaDB vector database (Docker, optimized)
│ │ ├── query_processor.py # RAG query processing (auto-selects implementation)
│ │ ├── hybrid_retrieval.py # BM25 and hybrid search implementation (Sprint 2)
│ │ └── 📁 experimental/ # Experimental implementations for reference
│ │ ├── vector_db_improved.py # Best practices reference implementation
│ │ ├── vector_db_migrated.py # Factory pattern implementation
│ │ └── vector_db_optimized.py # Performance optimization reference
│ ├── 📁 evaluation/
│ │ ├── __init__.py # Evaluation module interface
│ │ ├── rag_adapter.py # RAG system adapter for ragas framework
│ │ ├── ragas_evaluator.py # Main RAG evaluator using ragas
│ │ ├── ragas_reporter.py # HTML report generation for ragas results
│ │ ├── weave_ragas_evaluator.py # Basic Weave-RAGAS integration
│ │ ├── enhanced_weave_ragas.py # Enhanced Weave-RAGAS with full metric visibility
│ │ ├── weave_native_evaluation.py # Weave-native evaluation (best practices)
│ │ ├── dataset.py # Evaluation dataset creation and management
│ │ └── synthetic_data_generator.py # Advanced synthetic test data generation
│ └── 📁 tracing/
│ ├── business_intelligence.py # Business intelligence tracking
│ └── trace_utils.py # Tracing utilities and helpers
├── 📁 tests/ # Test suite
│ ├── __init__.py # Test module initialization
│ ├── conftest.py # Pytest configuration and fixtures
│ ├── test_basic.py # Basic test suite functionality
│ ├── test_infrastructure.py # Infrastructure test verification
│ ├── 📁 unit/ # Unit tests
│ │ ├── test_config.py # Configuration tests (Pydantic V2)
│ │ ├── test_decorators.py # Decorator functionality tests
│ │ ├── test_error_handling.py # Error handling tests
│ │ ├── test_litellm_service.py # LiteLLM service tests
│ │ ├── test_query_processor.py # Query processor tests
│ │ ├── test_vector_db.py # Vector database tests
│ │ └── test_vector_db_migrated.py # Migrated vector DB tests
│ ├── 📁 integration/ # Integration tests
│ │ ├── test_chatbot_e2e.py # End-to-end chatbot tests
│ │ ├── test_monitoring_integration.py # Monitoring integration tests
│ │ └── test_rag_pipeline.py # RAG pipeline tests
│ └── 📁 fixtures/ # Test fixtures and mock data
├── 📁 scripts/ # UV scripts and utilities
│ ├── 📁 eval/ # Evaluation runner scripts
│ │ ├── run_enhanced_evaluation.py # Enhanced Weave-RAGAS evaluation (RECOMMENDED)
│ │ ├── run_ragas_evaluation.py # Standard RAGAS evaluation
│ │ ├── run_weave_native_evaluation.py # Weave-native evaluation
│ │ ├── run_weave_ragas_evaluation.py # Basic Weave-RAGAS (DEPRECATED)
│ │ ├── generate_ragas_dataset.py # Generate ragas test datasets
│ │ └── generate_simple_ragas_dataset.py # Simplified dataset generation
│ ├── validate_config.py # Configuration validation script
│ ├── run_streamlit.py # UV script to run Streamlit app
│ ├── run_api_server.py # UV script to run FastAPI server
│ ├── lint.py # UV script for code linting
│ ├── format.py # UV script for code formatting
│ ├── clean_notebooks.py # UV script to clean notebook outputs
│ ├── list_scripts.py # List all available UV scripts
│ ├── check_vector_db.py # Check vector database status
│ ├── reinit_vector_db.py # Reinitialize vector database
│ ├── reinit_vector_db_simple.py # Simple vector database reinitialization
│ ├── init-postgres.sql # PostgreSQL schema initialization (Sprint 3)
│ └── test_agent_simple.py # Simple agent testing script
├── 📁 examples/
│ └── synthetic_data_examples.py # Synthetic data usage demonstrations
├── 📁 docs/ # Technical documentation
│ ├── 📁 architecture/ # System design documents
│ │ ├── CHROMA.md # ChromaDB integration guide
│ │ ├── LOCAL_VS_DOCKER.md # Local vs Docker implementation comparison
│ │ └── DASHBOARD_METRICS.md # Dashboard metrics interpretation
│ ├── 📁 guides/ # How-to guides
│ │ ├── WEAVE_TRACING_GUIDE.md # LLM tracing & monitoring guide
│ │ ├── EVALUATIONS.md # RAG evaluation framework documentation
│ │ ├── SYNTHETIC_DATA.md # Synthetic test data generation guide
│ │ ├── GEMINI_MESSAGE_HANDLING.md # Google Gemini integration guide
│ │ ├── DOCKER_TTY_FIXES.md # Container deployment fixes
│ │ ├── MONITORING_GUIDE.md # System monitoring setup
│ │ ├── PERFORMANCE_OPTIMIZATIONS.md # Performance optimization guide
│ │ └── ENHANCED_WEAVE_RAGAS_GUIDE.md # Enhanced Weave-RAGAS integration guide
│ ├── 📁 sprints/ # Sprint documentation
│ │ ├── SPRINT_0.md # Sprint 0 foundation summary
│ │ ├── SPRINT_1.md # Sprint 1 RAG implementation summary
│ │ ├── SPRINT_2.md # Sprint 2 production readiness summary
│ │ ├── SPRINT_3.md # Sprint 3 LangGraph agent summary
│ │ └── SPRINT_3_IMPLEMENTATION.md # Sprint 3 detailed implementation guide
│ ├── 📁 testing/ # Testing documentation
│ ├── 📁 development/ # Development process docs
│ └── 📁 planning/ # Vision and planning docs
├── 📄 pyproject.toml # uv dependencies & config
├── 📄 docker-compose.yml # Multi-service container setup
├── 📄 docker-compose.postgres.yml # Extended Docker config with PostgreSQL (Sprint 3)
├── 📄 Dockerfile # Container deployment
├── 📄 docker-entrypoint.sh # Container initialization script
├── 📄 Makefile # Build automation (Docker & shell commands)
├── 📄 PROJECT_CANVAS.md # Project roadmap & tasks
├── 📄 CLAUDE.md # AI assistant development log
└── 📄 README.md # Project documentation
The project includes a comprehensive data processing pipeline:
- Raw Data Ingestion: Processes large JSONL files from Amazon Reviews 2023
- Product Selection: Intelligently selects top 1000 products based on review volume and quality
- Review Sampling: Extracts representative reviews for each product
- Data Cleaning: Handles missing values, validates data integrity
- RAG Optimization: Formats data for retrieval-augmented generation systems
- Vector Database Creation: Automatic ingestion into ChromaDB with embeddings and metadata
- Query Processing: Intelligent context retrieval based on query type and intent
The visualization notebook provides comprehensive insights:
- Review Distribution Analysis: Product popularity and rating patterns
- Price Analysis: Price ranges and correlation with ratings
- Category Analysis: Hierarchical category exploration
- Store & Brand Analysis: Top performers and market distribution
- Temporal Analysis: Review trends over time (2003-2023)
- Text Analysis: Review length and content characteristics
- Data Processing: pandas, numpy, json, Polars (high-performance alternative)
- Visualization: matplotlib, seaborn, plotly
- Vector Database: Dual-architecture ChromaDB system (local: GTE-large, Docker: optimized)
- Embedding Models: GTE-large (development) and ChromaDB default (production) with automatic selection
- RAG Implementation: Custom query processing with intelligent context retrieval and environment detection
- Agent Framework: LangGraph for ReAct pattern agent with tool use (Sprint 3)
- State Persistence: PostgreSQL with SQLAlchemy for conversation management (Sprint 3)
- API Framework: FastAPI with middleware, routers, and dependency injection (Sprint 2)
- Structured Outputs: Instructor library with Pydantic models (Sprint 2)
- Prompt Management: Jinja2 templating system with registry (Sprint 2)
- Notebook Environment: Jupyter, IPython, Marimo (reactive notebooks)
- Package Management: uv (modern Python package manager)
- Web Interface: Professional Streamlit UI with tab-based architecture, smart query suggestions, and real-time monitoring
- LLM Providers: OpenAI GPT-4o, Groq Llama, Google Gemini 2.0, Ollama (100+ via LiteLLM)
- Monitoring: Optimized Weave tracing via Weights & Biases with session state management
- Configuration: Pydantic V2 settings with environment variables
- Testing: Pytest with 108+ tests, 91% coverage
- Containerization: Docker with non-root security, Docker Compose for multi-service deployment
-
Start Required Services:
# Start API server (required for agent mode) make run-api # In another terminal, start Streamlit make run-streamlit
-
Enable Agent Mode:
- Go to the Configuration tab
- Enable "Enable RAG (Product Search)"
- Toggle "🤖 Enable Agent Mode (ReAct)"
-
Ask Questions:
- The agent will process queries with reasoning steps
- View reasoning trace in expandable "🤔 Agent Reasoning Steps" section
- Session info displayed in sidebar
-
Example Queries:
- "What are the main complaints about laptop backpacks?"
- "Compare iPhone and Android chargers"
- "Find budget tablets under $200 with good reviews"
# Load processed data
import pandas as pd
import json
# Load products
products = []
with open('data/processed/electronics_top1000_products.jsonl', 'r') as f:
for line in f:
products.append(json.loads(line.strip()))
df_products = pd.DataFrame(products)
print(f"Loaded {len(df_products)} products")
# Test RAG system
from src.rag.query_processor import create_rag_processor
# Initialize processor
processor = create_rag_processor()
# Process a query
result = processor.process_query("What do people say about iPhone charger cables?")
print(f"Found {result['metadata']['num_products']} products and {result['metadata']['num_reviews']} reviews")
# Run enhanced evaluation with full Weave visibility
from src.evaluation.enhanced_weave_ragas import create_enhanced_evaluator
import asyncio
# Create enhanced evaluator
model, evaluator = create_enhanced_evaluator(
project_name="my-rag-evaluation",
openai_api_key="your_key"
)
# Run single evaluation
async def evaluate():
result = await evaluator.evaluate_example(
model=model,
question="What are iPhone charger features?",
ground_truth="iPhone cables feature Lightning connectors..."
)
print(f"Overall Score: {result['overall_score']:.3f}")
print(f"Metrics: {result['metrics']}")
asyncio.run(evaluate())
# Use Weave's native evaluation framework
from src.evaluation.weave_native_evaluation import create_rag_model, create_native_evaluator
import asyncio
# Create model and evaluator
model = create_rag_model(model_name="rag-v1", temperature=0.7)
evaluator = create_native_evaluator(project_name="rag-eval")
# Run evaluation
async def native_evaluate():
# Create dataset
dataset = evaluator.create_dataset(
[{"query": "What are iPhone features?", "expected_answer": "..."}],
name="test_dataset"
)
# Run evaluation
results = await evaluator.evaluate_model(
model=model,
dataset=dataset,
evaluation_name="Baseline Test"
)
asyncio.run(native_evaluate())
# Generate synthetic evaluation data
from src.evaluation.synthetic_data_generator import create_synthetic_dataset, SyntheticDataConfig
# Custom configuration
config = SyntheticDataConfig(
num_examples_per_category=5,
difficulty_distribution={"easy": 0.3, "medium": 0.5, "hard": 0.2},
variation_techniques=["rephrase", "specificity", "context"]
)
# Generate synthetic examples
synthetic_examples = create_synthetic_dataset(config, num_examples=30)
print(f"Generated {len(synthetic_examples)} synthetic test cases")
# Create mixed dataset (original + synthetic)
from src.evaluation.synthetic_data_generator import create_mixed_dataset
original_examples = create_evaluation_dataset()
mixed_dataset = create_mixed_dataset(original_examples, synthetic_ratio=0.5)
# Generate temporal analysis
from notebooks.data_visualization import temporal_analysis
temporal_analysis(df_reviews)
For detailed solutions to common issues, see docs/TROUBLESHOOTING.md.
-
Ragas Entity Extraction Error: Use the simple generator:
uv run python scripts/eval/generate_simple_ragas_dataset.py --synthetic-only
-
Docker Ollama Connection: Already configured with
host.docker.internal
in docker-compose.yml -
Import Errors: Run
uv sync
to ensure all dependencies are installed -
Vector DB Hanging: Skip initialization during development:
SKIP_VECTOR_DB_INGESTION=true uv run streamlit run src/chatbot-ui/streamlit_app.py
-
Multiple Weave Traces: Fixed with session state management
- New Feature: Professional tab-based interface architecture
- Smart Query Features: Auto-suggestions, query history, and intelligent filters
- Real-Time Monitoring: Performance metrics, RAG analytics, and system health dashboard
- Enhanced Response Display: Context cards, structured information, and query analysis
- Improved UX: Organized configuration, categorized examples, and responsive design
- Issue Resolved: Eliminated multiple/redundant Weave trace calls
- Root Cause: Improper interaction between Streamlit caching and Weave decorators
- Solution: Session state initialization + consolidated trace entry points
- Result: Clean, meaningful traces with zero redundancy
- TOP PRIORITY Achievement: All RAGAS evaluation metrics now fully visible in Weave UI
- Enhanced Implementation: Created
enhanced_weave_ragas.py
with comprehensive metric tracking - Three Evaluation Modes: Single query, full dataset, and comparison evaluations
- Complete Metric Visibility: All 8 RAGAS metrics (faithfulness, relevancy, precision, recall, etc.) tracked individually
- Performance Monitoring: Latency tracking for retrieval and generation phases
- Drill-Down Capabilities: Click any example in Weave UI to see full details and scores
- Comparison Views: Easy A/B testing with automatic leaderboard creation
- Documentation: Complete guide in
docs/guides/ENHANCED_WEAVE_RAGAS_GUIDE.md
- Ollama Import Error: Removed direct ollama import, using LiteLLM's built-in support instead
- RAG Processor Initialization: Fixed query_patterns initialization when using existing vector database
- LLM Service Interface: Updated streamlit app to use
chat()
method instead ofgenerate()
for proper message handling - Weave Tracing: Removed redundant @weave.op() decorator from generate method to prevent argument mismatch errors
- Result: Seamless LiteLLM integration with support for 100+ providers including Ollama
- ReAct Pattern Agent: Fully functional reasoning-action-observation loop with LangGraph
- Tool Integration: Vector search wrapped as agent tool maintaining RAG capabilities
- Conversation Persistence: PostgreSQL-backed state management for multi-turn conversations
- Session Management: UUID-based session and thread tracking with UI integration
- Reasoning Transparency: Expandable reasoning traces in Streamlit interface
- API Enhancement: New
/api/v1/agent/query
endpoint with full agent capabilities - Backward Compatibility: Existing RAG endpoints preserved, agent mode is optional toggle
This project includes comprehensive documentation to help you understand and work with the system:
Project roadmap and task tracking
- Complete project overview and goals
- Sprint 0 and Sprint 1 deliverables with detailed task breakdowns
- EDA findings and dataset analysis summary
- Configuration features and tracing implementation status
- Success criteria and architecture decisions
Sprint 0 foundation summary
- Foundational components completed (June 28, 2025)
- Data processing pipeline, LLM configuration, monitoring setup
- Project setup, environment configuration, and architecture planning
- Technical achievements and development infrastructure
- Foundation established for RAG implementation
Sprint 1 RAG prototype implementation
- Complete RAG system implementation following course requirements
- Vector database setup, basic RAG pipeline, instrumentation, and evaluation
- All 4 instructor-specified tasks completed (Lessons 3-6)
- Advanced features beyond scope: query intelligence, dual-architecture, synthetic data
- W&B integration with comprehensive evaluation framework
Sprint 2 production readiness
- Complete production implementation with FastAPI REST API
- Hybrid retrieval with BM25 and Reciprocal Rank Fusion
- Structured outputs using Instructor library with Pydantic models
- Jinja2 prompt management system with template registry
- 108+ tests with 91% coverage and 60-96% performance improvements
Sprint 3 LangGraph agent
- Transformed RAG system into intelligent conversational agent using LangGraph
- Implemented ReAct pattern with reasoning, action, and observation nodes
- Created vector search tool wrapping existing RAG functionality
- Added PostgreSQL persistence for multi-turn conversation support
- Session management with UUID-based tracking
- Agent mode toggle in Streamlit UI with reasoning trace visibility
- New
/api/v1/agent/query
endpoint for agent interactions
Complete ChromaDB integration guide
- GTE-large embedding model implementation details
- Data loading process and timeline details
- Search capabilities and metadata schema
- Performance monitoring and logging
- Troubleshooting guide and best practices
- API reference and usage examples
Local development vs Docker production comparison
- Dual-architecture approach explanation (vector_db.py vs vector_db_docker.py)
- Embedding strategy differences (GTE-large vs ChromaDB default)
- Connection architecture and storage configuration details
- Performance comparison and resource usage analysis
- Use case guidelines and migration considerations
- Troubleshooting and best practices for both environments
Comprehensive LLM tracing and monitoring guide
- Complete Weave integration implementation details
- Configuration parameter tracking (temperature, max_tokens, top_p, top_k)
- W&B dashboard setup and trace analysis
- Provider-specific handling and error resilience
- Performance monitoring and debugging techniques
- Troubleshooting guide for common tracing issues
RAG evaluation framework documentation
- Industry-standard RAGAS evaluation framework with enhanced Weave integration
- Core RAGAS metrics: Faithfulness, Answer Relevancy, Context Precision, Context Recall
- Additional metrics: Context Utilization, Answer Correctness, Similarity, Completeness
- Complete visibility of all metrics in Weave UI with drill-down capabilities
- Command-line interface for single query, dataset, and comparison evaluations
- Performance tracking with retrieval and generation latency monitoring
Synthetic test data generation guide
- Advanced synthetic data generation with template-based queries and variation techniques
- Configurable generation parameters: difficulty distribution, query types, and variation methods
- Quality validation tools: uniqueness analysis, length distribution, and topic coverage
- Weave integration for full traceability and performance monitoring
- Mixed dataset creation combining original and synthetic data for robust testing
- Best practices implementation and troubleshooting guide
Enhanced Weave-RAGAS integration guide
- Complete guide for using enhanced evaluation with full metric visibility in Weave UI
- Three evaluation modes: single query, dataset, and comparison evaluations
- All 8 RAGAS metrics tracked individually with proper attribution
- Performance monitoring for retrieval and generation phases
- Step-by-step instructions and command examples
- Troubleshooting and best practices for production use
Dashboard metrics interpretation & implementation guide
- Comprehensive documentation of all monitoring dashboard metrics
- Session statistics including conversation balance logic and message ratio handling
- Performance monitoring with provider-specific tracking and comparison
- RAG system metrics including vector performance and context quality
- Business intelligence integration with user journey analytics
- Configuration status monitoring and system health indicators
- Implementation details and troubleshooting guidelines
Google Gemini integration guide
- Complete Google GenAI client message formatting requirements
- Role conversion and content validation for Gemini compatibility
- Error resolution for INVALID_ARGUMENT and empty message parts
- Performance monitoring and provider-specific baselines
- Integration with Enhanced Tracing v2.0 system
- Troubleshooting guide and best practices
Containerized deployment compatibility guide
- Docker TTY issues and solutions for production deployment
- Non-root user configuration and security best practices
- Streamlit headless configuration for container environments
- Weave tracing compatibility in containerized setups
- Complete verification steps and troubleshooting
Weave-native evaluation using official best practices
- Proper Model and Dataset versioning with Weave
- Built-in metric aggregation and comparison views
- Custom scorer implementation patterns
- Native UI integration for evaluation results
- Migration guide from custom implementations
AI assistant development log
- Detailed record of changes and improvements made by the AI assistant
- Implementation decisions and technical explanations
- Feature development timeline and reasoning
- Code modifications and their rationale
These documents provide in-depth technical guidance beyond the quick start instructions in this README, covering advanced topics like monitoring, containerization, and project management.
This project uses data from the Amazon Reviews 2023 dataset:
@article{hou2024bridging,
title={Bridging Language and Items for Retrieval and Recommendation},
author={Hou, Yupeng and Li, Jiacheng and He, Zhankui and Yan, An and Chen, Xiusi and McAuley, Julian},
journal={arXiv preprint arXiv:2403.03952},
year={2024}
}
This is a capstone project for educational purposes. Feel free to explore, learn, and adapt the code for your own projects.
This project is licensed under the terms specified in the LICENSE file.