PopBot Experiments: Comprehensive Evaluation Framework

This repository contains a complete experimental setup for evaluating different AI agent architectures for answering questions about Invopop, GOBL, and VeriFactu documentation. The experiments compare various approaches including basic LLMs, web search augmentation, RAG systems, multi-agent architectures, and MCP-based solutions.

📊 Experimental Results Summary

Based on our comprehensive evaluation across different AI architectures, here are the complete results:

Performance Overview (Accuracy vs Cost vs Speed)

Model	Method	Accuracy	VeriFactu Acc.	Cost ($/answer)	P50 Time (s)	P90 Time (s)	P99 Time (s)
Basic LLM (No System Prompts)
GPT-4.1	No prompt	0.505	0.554	0.00479	11.668	17.808	31.023
GPT-5	Minimal no prompt	0.494	0.574	0.00442	8.170	15.880	25.440
GPT-5	Low no prompt	0.543	0.608	0.01110	18.863	32.797	46.871
Basic LLM (With System Prompts)
GPT-4.1-mini	Base prompt	0.538	0.568	0.00045	4.050	6.160	8.333
GPT-4.1	Base prompt	0.594	0.606	0.00228	4.071	6.335	10.390
GPT-5-mini	Minimal	0.540	0.599	0.00072	6.154	9.557	12.907
GPT-5-mini	Low	0.506	0.557	0.00128	9.839	14.660	21.774
GPT-5-mini	Medium	0.522	0.547	0.00274	18.505	25.308	36.934
GPT-5-mini	High	0.565	0.612	0.00759	54.903	83.322	130.443
GPT-5	Minimal	0.613	0.651	0.00347	5.593	8.552	13.877
GPT-5	Low	0.617	0.653	0.01183	15.447	25.019	32.805
GPT-5	Medium	0.626	0.650	0.02473	32.310	49.589	75.491
GPT-5	High	0.654	0.653	0.02434	43.003	79.790	127.549
Web Search Augmented
GPT-4.1	Web search	0.425	0.446	0.03982	8.617	12.069	15.473
GPT-5-mini	Low + web	0.631	0.676	0.00887	19.138	27.833	41.627
GPT-5-mini	Medium + web	0.648	0.690	0.03984	48.344	84.924	137.963
GPT-5-mini	High + web	0.678	0.705	0.12381	150.450	231.338	292.249
GPT-5	Low + web	0.713	0.743	0.14872	37.386	59.812	82.178
GPT-5	Medium + web	0.716	0.746	0.29536	66.178	101.190	134.978
GPT-5	High + web	0.722	0.751	0.49285	114.294	158.767	199.264
GPT-5	Low + web (1 tool)	0.646	0.664	0.03940	24.727	35.143	48.019
GPT-5	High + web (1 tool)	0.639	0.660	0.07069	62.029	93.400	134.649
RAG Systems
GPT-4.1	RAG	0.678	0.717	0.0376	11.79	17.94	28.82
GPT-5-mini	Low + RAG	0.653	0.681	0.0064	16.58	23.41	38.31
GPT-5-mini	Medium + RAG	0.680	0.722	0.0127	36.5	56.7	91.99
GPT-5-mini	High + RAG	0.692	0.733	0.0320	88.40	152.76	213.64
GPT-5	Low + RAG	0.719	0.734	0.0526	25.33	38.88	53.46
GPT-5	Medium + RAG	0.730	0.762	0.1061	52.12	78.73	116.30
GPT-5	High + RAG	0.739	0.769	0.1696	96.11	147.19	219.07
GPT-5	Low + RAG (1 tool)	0.703	0.738	0.0421	23.81	36.10	51.44
GPT-5	Medium + RAG (1 tool)	0.701	0.740	0.0612	44.80	68.02	87.97
Multi-Agent RAG (Distributed Tools)
GPT-5-mini	Low + agents	0.654	0.686	0.0071	18.44	26.17	37.16
GPT-5-mini	High + agents	0.681	0.721	0.0535	159.57	288.55	407.10
GPT-4.1	Multi-agent	0.673	0.717	0.0460	14.45	28.18	47.20
GPT-5	Low + agents	0.739	0.772	0.0602	40.12	71.62	110.23
GPT-5	Medium + agents	0.740	0.757	0.1506	101.26	182.65	248.26
GPT-5	High + agents	0.750	0.775	0.2673	191.05	347.49	478.04
MCP-Based Solutions
GPT-4.1	MCP	0.685	0.709	0.032	14.45	28.18	47.20
GPT-5	Low + MCP	0.742	0.757	0.0523	54.05	101.12	138.23
GPT-5	Medium + MCP	0.744	0.761	0.1332	165.26	275.75	379.76

Key Findings

🏆 Best Overall Performance: GPT-5 High + Multi-Agent (75.0% accuracy, 77.5% VeriFactu)
💰 Best Cost-Performance: GPT-5-mini Low + Multi-Agent (65.4% accuracy at $0.0071/answer)
⚡ Fastest Response: GPT-4.1-mini (4.050s P50, $0.00045/answer)
🎯 Best Balance: GPT-5 Low + MCP (74.2% accuracy, $0.0523/answer, 54s P50)
📈 Best VeriFactu Performance: GPT-5 High + Multi-Agent (77.5% VeriFactu accuracy)

🎯 Prerequisites

Before running any experiments, ensure you have the following API keys:

OpenAI API Key - For LLM inference and vector store
Opik API Key and workspace - For experiment tracking and evaluation
FireCrawl API Key - For web crawling documentation

Set these as environment variables:

export OPENAI_API_KEY="your_openai_api_key"
export OPIK_API_KEY="your_opik_api_key"
export OPIK_WORKSPACE="your_opik_workspace"
export FIRECRAWL_API_KEY="your_firecrawl_api_key"

📁 Repository Structure

popbot-experiments/
├── README.md                          # This file
├── EXPERIMENT_GUIDE.md                # Detailed experiment instructions
├── QUICK_START.md                     # Quick start guide
├── data/
│   └── anonymized_benchmark_june_july.json  # Evaluation dataset
├── solutions/                         # Different AI architectures
│   ├── basic_llm/                    # Direct LLM calls
│   ├── web/                          # Web search augmented
│   ├── file_search/                  # RAG with single vector store
│   ├── agent_all_rag_langgraph/      # Multi-agent with distributed RAG
│   └── mcp/                          # MCP-based solutions
├── eval/                             # Evaluation scripts and metrics
│   ├── run_basic_llm_eval.py         # Configurable basic LLM evaluation
│   ├── run_web_search_eval.py        # Configurable web search evaluation
│   ├── run_file_search_eval.py       # Configurable RAG evaluation
│   ├── run_agent_rag_eval.py         # Configurable multi-agent evaluation
│   ├── run_mcp_eval.py               # Configurable MCP evaluation
│   └── correctness.py                # Evaluation metrics
├── vector_store/                     # Vector database management
├── load/                            # Data loading and preprocessing
├── files/                           # Official documentation files
└── pyproject.toml                   # Python project configuration

🚀 Quick Start Guide

Step 1: Environment Setup

Clone this repository
Install dependencies:

uv sync

Set up your API keys (see Prerequisites section above)

Step 2: Data Preparation

2.1 Evaluation Dataset

The evaluation dataset anonymized_benchmark_june_july.json is already included in the data/ folder. This contains customer questions from Slack channels, anonymized for privacy.

2.2 Document Collection

You need to gather three types of documents for the RAG systems:

A. Crawled Documentation (Invopop & GOBL docs)

cd load/crawler
uv run python crawl_docs.py docs.invopop.com
uv run python crawl_docs.py docs.gobl.org

You can also use directly the docs in load/crawler/docs, but if you want to get updated docs, I recommend using firecrawl. The load/crawler/docs were the ones used for the results obtained, but they will differ if you want to compare to mcp or web which is always updated.

B. GitHub Repository Code For the github repository code, you would need to clone the repos github.com/invopop/gobl and github.com/invopop/gobl.verifactu at the same level as this repository.

C. Official VeriFactu Documents The URLs for official documents are in files/verifactu/json/verifactu.json. Download them manually. For this case we have downloaded for you and are stored in files/verifactu.

Step 3: Vector Store Setup

The vector store setup has been streamlined with improved scripts that handle configuration automatically.

Option A: Automated Setup (Recommended)

Use the comprehensive setup script that handles everything:

cd vector_store/openai

# Complete setup with all document types
uv run python setup_vector_store.py

# Or with custom options
uv run python setup_vector_store.py --name "My Custom Vector Store" --skip-github

The setup script will:

✅ Check all prerequisites automatically
🚀 Create the vector store with proper configuration
📚 Add Firecrawl documentation (GOBL + Invopop)
💻 Add GitHub code repositories (gobl + gobl.verifactu)
📄 Add official VeriFactu documents
📊 Provide progress reporting and final statistics
💾 Save configuration for future use

Option B: Manual Step-by-Step Setup

For more control, run individual scripts:

cd vector_store/openai

# 1. Create the vector store
uv run python create_vector_store.py

# 2. Add different document types (in any order)
uv run python add_firecrawl_docs.py    # Add crawled documentation
uv run python add_github_code.py       # Add GitHub code repositories  
uv run python add_official_docs.py     # Add official PDF documents

Troubleshooting

If you encounter issues:

# Check what's missing
uv run python setup_vector_store.py --dry-run

# Skip problematic document types
uv run python setup_vector_store.py --skip-github 

# Force recreation of vector store
uv run python setup_vector_store.py --force

The vector store configuration is saved in vector_store_config.json for future reference.

Environment Variable Setup

After setting up your vector store, you need to configure the OPENAI_VECTOR_STORE_ID environment variable with the ID from your configuration:

# Extract the vector store ID from the config file
export OPENAI_VECTOR_STORE_ID=$(python -c "import json; print(json.load(open('vector_store_config.json'))['vector_store_id'])")

# Or set it manually by copying the ID from vector_store_config.json
export OPENAI_VECTOR_STORE_ID="vs_your_vector_store_id_here"

# Add to your shell profile for persistence (optional)
echo "export OPENAI_VECTOR_STORE_ID=$OPENAI_VECTOR_STORE_ID" >> ~/.bashrc

This environment variable is required for all solutions to access the vector store. Alternatively, you can pass the vector_store_id directly as a constructor argument when initializing the services.

Vector Store Management

Use the management utility to inspect and manage your vector store:

cd vector_store/openai

# Show vector store information and statistics
uv run python manage_vector_store.py info
uv run python manage_vector_store.py stats

# List all files in the vector store
uv run python manage_vector_store.py files

# List all vector stores in your account
uv run python manage_vector_store.py list

# Delete the vector store (with confirmation)
uv run python manage_vector_store.py delete

Step 4: MCP Server Setup (for MCP experiments)

The MCP solution uses Model Context Protocol to access documentation through specialized MCP servers.

MCP Installation

Install the required MCP servers using mint-mcp:

# Install the Invopop and GOBL MCP servers
npx mint-mcp add invopop
npx mint-mcp add gobl

Note: You may see error messages like "Error installing MCP for invopop: Cannot read properties of undefined (reading 'name')" during installation. This is a known issue with mint-mcp, but the servers are actually installed correctly in ~/.mcp/.

Verify Installation

Check that the MCP servers are installed:

ls -la ~/.mcp/
# Should show: invopop/ and gobl/ directories

Install Dependencies

Install Node.js dependencies for both MCP servers:

cd ~/.mcp/invopop && npm install
cd ~/.mcp/gobl && npm install

Fix Missing Configuration Files

The MCP servers require tools.json files that might be missing after installation. Create them:

# Create empty tools.json files for both servers
echo "[]" > ~/.mcp/invopop/src/tools.json
echo "[]" > ~/.mcp/gobl/src/tools.json

Test MCP Setup

Test that the MCP setup works by running the main application:

# Set a temporary API key for testing (replace with your real key)
export OPENAI_API_KEY="your_openai_api_key_here"

# Run the MCP application
uv run python -m solutions.mcp.main --config solutions/mcp/config.yaml --verbose

Success indicators:

✅ "MCP Server running on stdio" messages appear
✅ Welcome message displays
✅ 👤 You: prompt appears (this confirms MCP setup is working)

MCP Configuration

The MCP configuration is handled via solutions/mcp/config.yaml:

mcp:
  servers:
    invopop:
      command: "node"
      args: ["~/.mcp/invopop/src/index.js"]
      transport: "stdio"
    gobl:
      command: "node" 
      args: ["~/.mcp/gobl/src/index.js"]
      transport: "stdio"

Step 5: Upload Evaluation Dataset

Upload the dataset to Opik for tracking:

cd eval
uv run python create_dataset.py

🧪 Running Experiments

Basic LLM Experiments

Test direct LLM performance without any augmentation using the new configurable evaluation runner:

cd eval

# Basic usage with defaults (gpt-5-mini, medium reasoning, with prompt)
uv run python run_basic_llm_eval.py

# Model variations
uv run python run_basic_llm_eval.py --model gpt-4.1
uv run python run_basic_llm_eval.py --model gpt-5
uv run python run_basic_llm_eval.py --model gpt-5-mini

# Reasoning effort configurations (gpt-5/gpt-5-mini only)
uv run python run_basic_llm_eval.py --model gpt-5 --reasoning-effort minimal
uv run python run_basic_llm_eval.py --model gpt-5 --reasoning-effort low
uv run python run_basic_llm_eval.py --model gpt-5 --reasoning-effort medium
uv run python run_basic_llm_eval.py --model gpt-5 --reasoning-effort high

# Compare with/without system prompt
uv run python run_basic_llm_eval.py --model gpt-5 --reasoning-effort low
uv run python run_basic_llm_eval.py --model gpt-5 --reasoning-effort low --no-prompt

# Custom experiment naming and advanced options
uv run python run_basic_llm_eval.py \
    --model gpt-5-mini \
    --reasoning-effort high \
    --no-prompt \
    --experiment-name "gpt5mini_high_no_prompt_test" \
    --threads 8

Configuration Options:

Models: gpt-4.1, gpt-5, gpt-5-mini
Reasoning Effort (gpt-5/gpt-5-mini only): minimal, low, medium, high
Prompt Control: Use --no-prompt to test raw model performance without system prompt
Custom Naming: Use --experiment-name for better experiment tracking
Threading: Use --threads N to adjust parallel evaluation threads
Dataset: Use --dataset-name to specify different evaluation datasets

View all options:

uv run python run_basic_llm_eval.py --help

Expected Performance (based on comprehensive evaluations):

GPT-4.1-mini: ~53.8% accuracy, $0.00045/answer, ~4s response
GPT-4.1: ~59.4% accuracy, $0.0023/answer, ~4s response
GPT-5-mini minimal: ~54.0% accuracy, $0.00072/answer, ~6s response
GPT-5 minimal: ~61.3% accuracy, $0.0035/answer, ~6s response
GPT-5 low: ~61.7% accuracy, $0.0118/answer, ~15s response
GPT-5 medium: ~62.6% accuracy, $0.0247/answer, ~32s response
GPT-5 high: ~65.4% accuracy, $0.0243/answer, ~43s response

📖 Detailed Guide: See eval/BASIC_LLM_EVALUATION_GUIDE.md for comprehensive usage examples and troubleshooting.

Legacy Script: The original eval_basic_llm.py is still available but requires manual configuration.

Web Search Experiments

Test web search augmented models using the new configurable evaluation runner:

cd eval

# Basic usage with defaults (gpt-5-mini, medium reasoning, multiple tools)
uv run python run_web_search_eval.py

# Model variations
uv run python run_web_search_eval.py --model gpt-4.1
uv run python run_web_search_eval.py --model gpt-5
uv run python run_web_search_eval.py --model gpt-5-mini

# Reasoning effort configurations (gpt-5/gpt-5-mini only)
uv run python run_web_search_eval.py --model gpt-5 --reasoning-effort minimal
uv run python run_web_search_eval.py --model gpt-5 --reasoning-effort low
uv run python run_web_search_eval.py --model gpt-5 --reasoning-effort medium
uv run python run_web_search_eval.py --model gpt-5 --reasoning-effort high

# Tool usage comparison
uv run python run_web_search_eval.py --model gpt-5 --reasoning-effort low
uv run python run_web_search_eval.py --model gpt-5 --reasoning-effort low --limit-tools

# Custom experiment naming and advanced options
uv run python run_web_search_eval.py \
    --model gpt-5-mini \
    --reasoning-effort high \
    --limit-tools \
    --experiment-name "gpt5mini_high_single_tool_test" \
    --threads 8

Configuration Options:

Models: gpt-4.1, gpt-5, gpt-5-mini
Reasoning Effort (gpt-5/gpt-5-mini only): minimal, low, medium, high
Tool Limit: Use --limit-tools to restrict to one web search per query
Custom Naming: Use --experiment-name for better experiment tracking
Threading: Use --threads N to adjust parallel evaluation threads
Dataset: Use --dataset-name to specify different evaluation datasets

View all options:

uv run python run_web_search_eval.py --help

Expected Performance (based on comprehensive evaluations):

GPT-4.1: ~42.5% accuracy, $0.0398/answer, ~9s response (⚠️ Poor performance)
GPT-5-mini low: ~63.1% accuracy, $0.0089/answer, ~19s response
GPT-5-mini medium: ~64.8% accuracy, $0.0398/answer, ~48s response
GPT-5-mini high: ~67.8% accuracy, $0.124/answer, ~150s response
GPT-5 low: ~71.3% accuracy, $0.149/answer, ~37s response
GPT-5 medium: ~71.6% accuracy, $0.295/answer, ~66s response
GPT-5 high: ~72.2% accuracy, $0.493/answer, ~114s response
GPT-5 low (1 tool): ~64.6% accuracy, $0.0394/answer, ~25s response

Legacy Script: The original eval_web_search.py is still available but requires manual configuration.

RAG (File Search) Experiments

Test retrieval augmented generation using the new configurable evaluation runner:

cd eval

# Basic usage with defaults (gpt-4.1, medium reasoning, multiple file searches)
uv run python run_file_search_eval.py

# Model variations
uv run python run_file_search_eval.py --model gpt-4.1
uv run python run_file_search_eval.py --model gpt-5
uv run python run_file_search_eval.py --model gpt-5-mini

# Reasoning effort configurations (gpt-5/gpt-5-mini only)
uv run python run_file_search_eval.py --model gpt-5 --reasoning-effort minimal
uv run python run_file_search_eval.py --model gpt-5 --reasoning-effort low
uv run python run_file_search_eval.py --model gpt-5 --reasoning-effort medium
uv run python run_file_search_eval.py --model gpt-5 --reasoning-effort high

# File search limit comparison
uv run python run_file_search_eval.py --model gpt-5 --reasoning-effort low
uv run python run_file_search_eval.py --model gpt-5 --reasoning-effort low --limit-tools

# Custom experiment naming and advanced options
uv run python run_file_search_eval.py \
    --model gpt-5-mini \
    --reasoning-effort high \
    --limit-tools \
    --experiment-name "gpt5mini_high_single_search_test" \
    --threads 8

Configuration Options:

Models: gpt-4.1, gpt-5, gpt-5-mini
Reasoning Effort (gpt-5/gpt-5-mini only): minimal, low, medium, high
Tool Limit: Use --limit-tools to restrict to one file search per query
Custom Naming: Use --experiment-name for better experiment tracking
Threading: Use --threads N to adjust parallel evaluation threads
Dataset: Use --dataset-name to specify different evaluation datasets

View all options:

uv run python run_file_search_eval.py --help

Expected Performance (based on comprehensive evaluations):

GPT-4.1: ~67.8% accuracy, $0.0376/answer, ~12s response
GPT-5-mini low: ~65.3% accuracy, $0.0064/answer, ~17s response
GPT-5-mini medium: ~68.0% accuracy, $0.0127/answer, ~37s response
GPT-5-mini high: ~69.2% accuracy, $0.0320/answer, ~88s response
GPT-5 low: ~71.9% accuracy, $0.0526/answer, ~25s response
GPT-5 medium: ~73.0% accuracy, $0.106/answer, ~52s response
GPT-5 high: ~73.9% accuracy, $0.170/answer, ~96s response
GPT-5 low (1 tool): ~70.3% accuracy, $0.0421/answer, ~24s response

Legacy Script: The original eval_file_search.py is still available but requires manual configuration.

Multi-Agent RAG Experiments

Test distributed RAG with specialized tools using the new configurable evaluation runner:

cd eval

# Basic usage with defaults (gpt-5, high reasoning)
uv run python run_agent_rag_eval.py

# Model variations
uv run python run_agent_rag_eval.py --model gpt-4.1
uv run python run_agent_rag_eval.py --model gpt-5
uv run python run_agent_rag_eval.py --model gpt-5-mini

# Reasoning effort configurations (gpt-5/gpt-5-mini only)
uv run python run_agent_rag_eval.py --model gpt-5 --reasoning-effort low
uv run python run_agent_rag_eval.py --model gpt-5 --reasoning-effort medium
uv run python run_agent_rag_eval.py --model gpt-5 --reasoning-effort high

# Custom experiment naming and advanced options
uv run python run_agent_rag_eval.py \
    --model gpt-5-mini \
    --reasoning-effort medium \
    --experiment-name "gpt5mini_medium_multiagent_test" \
    --threads 16

Configuration Options:

Models: gpt-4.1, gpt-5, gpt-5-mini
Reasoning Effort (gpt-5/gpt-5-mini only): minimal, low, medium, high
Custom Naming: Use --experiment-name for better experiment tracking
Threading: Use --threads N to adjust parallel evaluation threads
Dataset: Use --dataset-name to specify different evaluation datasets

View all options:

uv run python run_agent_rag_eval.py --help

Features:

Specialized Tools: Separate tools for different document types (VeriFactu, Invopop, GOBL docs/code)
Intelligent Routing: Agent automatically chooses appropriate tools based on question context
LangGraph Orchestration: Advanced workflow management with memory and checkpointing
Multi-Source RAG: Can access and combine information from multiple knowledge sources

Expected Performance (based on comprehensive evaluations):

GPT-4.1: ~67.3% accuracy, $0.0460/answer, ~14s response
GPT-5-mini low: ~65.4% accuracy, $0.0071/answer, ~18s response (🏆 Best cost-performance)
GPT-5-mini high: ~68.1% accuracy, $0.0535/answer, ~160s response
GPT-5 low: ~73.9% accuracy, $0.0602/answer, ~40s response
GPT-5 medium: ~74.0% accuracy, $0.151/answer, ~101s response
GPT-5 high: ~75.0% accuracy, $0.267/answer, ~191s response (🏆 Best overall accuracy)

Legacy Script: The original eval_agent_rag_all.py is still available but requires manual configuration.

MCP-Based Experiments

Test Model Context Protocol implementations using the new configurable evaluation runner:

cd eval

# Basic usage with defaults (gpt-5, medium reasoning)
uv run python run_mcp_eval.py

# Model variations
uv run python run_mcp_eval.py --model gpt-4.1
uv run python run_mcp_eval.py --model gpt-5
uv run python run_mcp_eval.py --model gpt-5-mini

# Reasoning effort configurations (gpt-5/gpt-5-mini only)
uv run python run_mcp_eval.py --model gpt-5 --reasoning-effort minimal
uv run python run_mcp_eval.py --model gpt-5 --reasoning-effort low
uv run python run_mcp_eval.py --model gpt-5 --reasoning-effort medium
uv run python run_mcp_eval.py --model gpt-5 --reasoning-effort high

# Custom experiment naming
uv run python run_mcp_eval.py \
    --model gpt-5-mini \
    --reasoning-effort low \
    --experiment-name "gpt5mini_low_mcp_test"

Configuration Options:

Models: gpt-4.1, gpt-5, gpt-5-mini
Reasoning Effort (gpt-5/gpt-5-mini only): minimal, low, medium, high
Custom Naming: Use --experiment-name for better experiment tracking
Threading: Use --threads N (default: 1, recommended for MCP stability)
Dataset: Use --dataset-name to specify different evaluation datasets

View all options:

uv run python run_mcp_eval.py --help

Features:

Integrated MCP Tools: Built-in access to documentation via MCP protocol
Multi-Source Knowledge: Access to Invopop, GOBL docs, and code repositories
Official Document Access: VeriFactu and other official documentation
Configurable Models: Support for different GPT models and reasoning levels

Expected Performance (based on comprehensive evaluations):

GPT-4.1: ~68.5% accuracy, $0.032/answer, ~14s response
GPT-5 low: ~74.2% accuracy, $0.0523/answer, ~54s response (🎯 Best balance)
GPT-5 medium: ~74.4% accuracy, $0.133/answer, ~165s response

Legacy Script: The original eval_mcp_new_prompts.py is still available but requires manual configuration.

📈 Analyzing Results

Getting Experiment Metrics

After running experiments, extract metrics:

cd eval
uv run python get_experiment_cost_duration.py --experiment_name your_experiment_name

This provides:

Accuracy scores (overall and VeriFactu-specific)
Cost analysis (per 1000 queries)
Response time percentiles (P50, P90, P99)
Success rates

Understanding the Metrics

Accuracy: Semantic correctness of answers compared to ground truth VeriFactu Accuracy: Specific accuracy on Spanish tax compliance questions
Cost: OpenAI API costs per 1000 queries Response Times: Latency percentiles in milliseconds

🎯 Model Recommendations

Based on our comprehensive evaluation results, here are our recommendations for different use cases:

For Production Applications

🏆 Best Overall Choice: GPT-5 Low + MCP

Accuracy: 74.2% (excellent)
Cost: $0.0523/answer (reasonable)
Speed: 54s P50 (acceptable for complex queries)
Why: Best balance of accuracy, cost, and speed

💰 Budget-Conscious: GPT-5-mini Low + Multi-Agent

Accuracy: 65.4% (good)
Cost: $0.0071/answer (very low)
Speed: 18s P50 (fast)
Why: Excellent cost-performance ratio

⚡ Speed-Critical: GPT-4.1-mini

Accuracy: 53.8% (basic)
Cost: $0.00045/answer (extremely low)
Speed: 4s P50 (very fast)
Why: Fastest responses at minimal cost

For Research & Development

🔬 Maximum Accuracy: GPT-5 High + Multi-Agent

Accuracy: 75.0% (highest)
VeriFactu Accuracy: 77.5% (highest)
Cost: $0.267/answer (expensive)
Speed: 191s P50 (slow)
Why: Best possible accuracy for research purposes

🧪 Experimentation: GPT-5 Medium + RAG

Accuracy: 73.0% (very good)
Cost: $0.106/answer (moderate)
Speed: 52s P50 (reasonable)
Why: Good balance for testing and development

Architecture-Specific Recommendations

Use Case	Recommended Configuration	Accuracy	Cost	Speed
Customer Support	GPT-5 Low + MCP	74.2%	$0.0523	54s
Internal Tools	GPT-5-mini Medium + RAG	68.0%	$0.0127	37s
Documentation Search	GPT-5 Low + RAG	71.9%	$0.0526	25s
Research Analysis	GPT-5 High + Multi-Agent	75.0%	$0.267	191s
Quick Queries	GPT-4.1	59.4%	$0.0023	4s
Batch Processing	GPT-5-mini Low + Multi-Agent	65.4%	$0.0071	18s

Key Insights

System Prompts Matter: Adding prompts improves accuracy by ~9-12% across all models
GPT-4.1 + Web Search: Performs poorly (42.5% accuracy) - avoid this combination
Tool Limiting: Restricting to 1 tool call reduces accuracy by ~7-8% but cuts costs significantly
Mini Models: GPT-5-mini offers good performance at much lower cost than full GPT-5
Architecture Impact: Multi-Agent ≈ MCP > RAG > Web Search > Basic LLM in terms of accuracy

🔧 Customization

Adding New Solutions

Create a new directory in solutions/
Implement the required interface (see existing solutions as examples)
Add corresponding evaluation script in evaluation/
Update this README with new results

Modifying Evaluation Metrics

Edit evaluation/correctness.py to:

Add new evaluation criteria
Adjust scoring algorithms
Include domain-specific metrics

Extending Document Types

To add new document sources:

Add loading scripts in load/
Create corresponding vector store scripts in vector_store/openai/
Update solution prompts to handle new document types

🎯 Best Practices

For Cost Optimization

Start with GPT-4.1 for baseline experiments
Use GPT-5-mini for development and testing
Monitor token usage with Opik tracking

For Accuracy Optimization

Use higher reasoning levels for complex queries
Combine RAG with web search for comprehensive coverage
Implement multi-agent approaches for specialized domains

For Speed Optimization

Use simpler models for time-critical applications
Implement caching for repeated queries
Optimize vector store chunk sizes

🐛 Troubleshooting

Common Issues

Vector Store Connection Errors

Verify OpenAI API key is set correctly
Check vector store ID in configuration files

MCP Server Issues

Ensure Node.js is installed for MCP servers
Verify MCP servers are running: mint-mcp list

Evaluation Dataset Issues

Confirm Opik API key is configured
Check dataset exists in Opik dashboard

Memory Issues

Reduce batch sizes in evaluation scripts
Use streaming for large document processing

📚 Additional Resources

🤝 Contributing

Fork the repository
Create a feature branch
Run experiments and document results
Submit a pull request with performance comparisons

Note: This experimental framework is designed for research and development purposes. Production deployments should consider additional factors like security, scalability, and compliance requirements.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
eval		eval
files/verifactu		files/verifactu
load/crawler		load/crawler
solutions		solutions
vector_store/openai		vector_store/openai
.gitignore		.gitignore
EXPERIMENT_GUIDE.md		EXPERIMENT_GUIDE.md
QUICK_START.md		QUICK_START.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

invopop/popbot-experiments

Folders and files

Latest commit

History

Repository files navigation

PopBot Experiments: Comprehensive Evaluation Framework

📊 Experimental Results Summary

Performance Overview (Accuracy vs Cost vs Speed)

Key Findings

🎯 Prerequisites

📁 Repository Structure

🚀 Quick Start Guide

Step 1: Environment Setup

Step 2: Data Preparation

2.1 Evaluation Dataset

2.2 Document Collection

Step 3: Vector Store Setup

Option A: Automated Setup (Recommended)

Option B: Manual Step-by-Step Setup

Troubleshooting

Environment Variable Setup

Vector Store Management

Step 4: MCP Server Setup (for MCP experiments)

MCP Installation

Verify Installation

Install Dependencies

Fix Missing Configuration Files

Test MCP Setup

MCP Configuration

Step 5: Upload Evaluation Dataset

🧪 Running Experiments

Basic LLM Experiments

Web Search Experiments

RAG (File Search) Experiments

Multi-Agent RAG Experiments

MCP-Based Experiments

📈 Analyzing Results

Getting Experiment Metrics

Understanding the Metrics

🎯 Model Recommendations

For Production Applications

For Research & Development

Architecture-Specific Recommendations

Key Insights

🔧 Customization

Adding New Solutions

Modifying Evaluation Metrics

Extending Document Types

🎯 Best Practices

For Cost Optimization

For Accuracy Optimization

For Speed Optimization

🐛 Troubleshooting

Common Issues

📚 Additional Resources

🤝 Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages