Skip to content

invopop/popbot-experiments

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PopBot Experiments: Comprehensive Evaluation Framework

This repository contains a complete experimental setup for evaluating different AI agent architectures for answering questions about Invopop, GOBL, and VeriFactu documentation. The experiments compare various approaches including basic LLMs, web search augmentation, RAG systems, multi-agent architectures, and MCP-based solutions.

πŸ“Š Experimental Results Summary

Based on our comprehensive evaluation across different AI architectures, here are the complete results:

Performance Overview (Accuracy vs Cost vs Speed)

Model Method Accuracy VeriFactu Acc. Cost ($/answer) P50 Time (s) P90 Time (s) P99 Time (s)
Basic LLM (No System Prompts)
GPT-4.1 No prompt 0.505 0.554 0.00479 11.668 17.808 31.023
GPT-5 Minimal no prompt 0.494 0.574 0.00442 8.170 15.880 25.440
GPT-5 Low no prompt 0.543 0.608 0.01110 18.863 32.797 46.871
Basic LLM (With System Prompts)
GPT-4.1-mini Base prompt 0.538 0.568 0.00045 4.050 6.160 8.333
GPT-4.1 Base prompt 0.594 0.606 0.00228 4.071 6.335 10.390
GPT-5-mini Minimal 0.540 0.599 0.00072 6.154 9.557 12.907
GPT-5-mini Low 0.506 0.557 0.00128 9.839 14.660 21.774
GPT-5-mini Medium 0.522 0.547 0.00274 18.505 25.308 36.934
GPT-5-mini High 0.565 0.612 0.00759 54.903 83.322 130.443
GPT-5 Minimal 0.613 0.651 0.00347 5.593 8.552 13.877
GPT-5 Low 0.617 0.653 0.01183 15.447 25.019 32.805
GPT-5 Medium 0.626 0.650 0.02473 32.310 49.589 75.491
GPT-5 High 0.654 0.653 0.02434 43.003 79.790 127.549
Web Search Augmented
GPT-4.1 Web search 0.425 0.446 0.03982 8.617 12.069 15.473
GPT-5-mini Low + web 0.631 0.676 0.00887 19.138 27.833 41.627
GPT-5-mini Medium + web 0.648 0.690 0.03984 48.344 84.924 137.963
GPT-5-mini High + web 0.678 0.705 0.12381 150.450 231.338 292.249
GPT-5 Low + web 0.713 0.743 0.14872 37.386 59.812 82.178
GPT-5 Medium + web 0.716 0.746 0.29536 66.178 101.190 134.978
GPT-5 High + web 0.722 0.751 0.49285 114.294 158.767 199.264
GPT-5 Low + web (1 tool) 0.646 0.664 0.03940 24.727 35.143 48.019
GPT-5 High + web (1 tool) 0.639 0.660 0.07069 62.029 93.400 134.649
RAG Systems
GPT-4.1 RAG 0.678 0.717 0.0376 11.79 17.94 28.82
GPT-5-mini Low + RAG 0.653 0.681 0.0064 16.58 23.41 38.31
GPT-5-mini Medium + RAG 0.680 0.722 0.0127 36.5 56.7 91.99
GPT-5-mini High + RAG 0.692 0.733 0.0320 88.40 152.76 213.64
GPT-5 Low + RAG 0.719 0.734 0.0526 25.33 38.88 53.46
GPT-5 Medium + RAG 0.730 0.762 0.1061 52.12 78.73 116.30
GPT-5 High + RAG 0.739 0.769 0.1696 96.11 147.19 219.07
GPT-5 Low + RAG (1 tool) 0.703 0.738 0.0421 23.81 36.10 51.44
GPT-5 Medium + RAG (1 tool) 0.701 0.740 0.0612 44.80 68.02 87.97
Multi-Agent RAG (Distributed Tools)
GPT-5-mini Low + agents 0.654 0.686 0.0071 18.44 26.17 37.16
GPT-5-mini High + agents 0.681 0.721 0.0535 159.57 288.55 407.10
GPT-4.1 Multi-agent 0.673 0.717 0.0460 14.45 28.18 47.20
GPT-5 Low + agents 0.739 0.772 0.0602 40.12 71.62 110.23
GPT-5 Medium + agents 0.740 0.757 0.1506 101.26 182.65 248.26
GPT-5 High + agents 0.750 0.775 0.2673 191.05 347.49 478.04
MCP-Based Solutions
GPT-4.1 MCP 0.685 0.709 0.032 14.45 28.18 47.20
GPT-5 Low + MCP 0.742 0.757 0.0523 54.05 101.12 138.23
GPT-5 Medium + MCP 0.744 0.761 0.1332 165.26 275.75 379.76

Key Findings

  1. πŸ† Best Overall Performance: GPT-5 High + Multi-Agent (75.0% accuracy, 77.5% VeriFactu)
  2. πŸ’° Best Cost-Performance: GPT-5-mini Low + Multi-Agent (65.4% accuracy at $0.0071/answer)
  3. ⚑ Fastest Response: GPT-4.1-mini (4.050s P50, $0.00045/answer)
  4. 🎯 Best Balance: GPT-5 Low + MCP (74.2% accuracy, $0.0523/answer, 54s P50)
  5. πŸ“ˆ Best VeriFactu Performance: GPT-5 High + Multi-Agent (77.5% VeriFactu accuracy)

🎯 Prerequisites

Before running any experiments, ensure you have the following API keys:

  1. OpenAI API Key - For LLM inference and vector store
  2. Opik API Key and workspace - For experiment tracking and evaluation
  3. FireCrawl API Key - For web crawling documentation

Set these as environment variables:

export OPENAI_API_KEY="your_openai_api_key"
export OPIK_API_KEY="your_opik_api_key"
export OPIK_WORKSPACE="your_opik_workspace"
export FIRECRAWL_API_KEY="your_firecrawl_api_key"

πŸ“ Repository Structure

popbot-experiments/
β”œβ”€β”€ README.md                          # This file
β”œβ”€β”€ EXPERIMENT_GUIDE.md                # Detailed experiment instructions
β”œβ”€β”€ QUICK_START.md                     # Quick start guide
β”œβ”€β”€ data/
β”‚   └── anonymized_benchmark_june_july.json  # Evaluation dataset
β”œβ”€β”€ solutions/                         # Different AI architectures
β”‚   β”œβ”€β”€ basic_llm/                    # Direct LLM calls
β”‚   β”œβ”€β”€ web/                          # Web search augmented
β”‚   β”œβ”€β”€ file_search/                  # RAG with single vector store
β”‚   β”œβ”€β”€ agent_all_rag_langgraph/      # Multi-agent with distributed RAG
β”‚   └── mcp/                          # MCP-based solutions
β”œβ”€β”€ eval/                             # Evaluation scripts and metrics
β”‚   β”œβ”€β”€ run_basic_llm_eval.py         # Configurable basic LLM evaluation
β”‚   β”œβ”€β”€ run_web_search_eval.py        # Configurable web search evaluation
β”‚   β”œβ”€β”€ run_file_search_eval.py       # Configurable RAG evaluation
β”‚   β”œβ”€β”€ run_agent_rag_eval.py         # Configurable multi-agent evaluation
β”‚   β”œβ”€β”€ run_mcp_eval.py               # Configurable MCP evaluation
β”‚   └── correctness.py                # Evaluation metrics
β”œβ”€β”€ vector_store/                     # Vector database management
β”œβ”€β”€ load/                            # Data loading and preprocessing
β”œβ”€β”€ files/                           # Official documentation files
└── pyproject.toml                   # Python project configuration

πŸš€ Quick Start Guide

Step 1: Environment Setup

  1. Clone this repository

  2. Install dependencies:

uv sync
  1. Set up your API keys (see Prerequisites section above)

Step 2: Data Preparation

2.1 Evaluation Dataset

The evaluation dataset anonymized_benchmark_june_july.json is already included in the data/ folder. This contains customer questions from Slack channels, anonymized for privacy.

2.2 Document Collection

You need to gather three types of documents for the RAG systems:

A. Crawled Documentation (Invopop & GOBL docs)

cd load/crawler
uv run python crawl_docs.py docs.invopop.com
uv run python crawl_docs.py docs.gobl.org

You can also use directly the docs in load/crawler/docs, but if you want to get updated docs, I recommend using firecrawl. The load/crawler/docs were the ones used for the results obtained, but they will differ if you want to compare to mcp or web which is always updated.

B. GitHub Repository Code For the github repository code, you would need to clone the repos github.com/invopop/gobl and github.com/invopop/gobl.verifactu at the same level as this repository.

C. Official VeriFactu Documents The URLs for official documents are in files/verifactu/json/verifactu.json. Download them manually. For this case we have downloaded for you and are stored in files/verifactu.

Step 3: Vector Store Setup

The vector store setup has been streamlined with improved scripts that handle configuration automatically.

Option A: Automated Setup (Recommended)

Use the comprehensive setup script that handles everything:

cd vector_store/openai

# Complete setup with all document types
uv run python setup_vector_store.py

# Or with custom options
uv run python setup_vector_store.py --name "My Custom Vector Store" --skip-github

The setup script will:

  • βœ… Check all prerequisites automatically
  • πŸš€ Create the vector store with proper configuration
  • πŸ“š Add Firecrawl documentation (GOBL + Invopop)
  • πŸ’» Add GitHub code repositories (gobl + gobl.verifactu)
  • πŸ“„ Add official VeriFactu documents
  • πŸ“Š Provide progress reporting and final statistics
  • πŸ’Ύ Save configuration for future use

Option B: Manual Step-by-Step Setup

For more control, run individual scripts:

cd vector_store/openai

# 1. Create the vector store
uv run python create_vector_store.py

# 2. Add different document types (in any order)
uv run python add_firecrawl_docs.py    # Add crawled documentation
uv run python add_github_code.py       # Add GitHub code repositories  
uv run python add_official_docs.py     # Add official PDF documents

Troubleshooting

If you encounter issues:

# Check what's missing
uv run python setup_vector_store.py --dry-run

# Skip problematic document types
uv run python setup_vector_store.py --skip-github 

# Force recreation of vector store
uv run python setup_vector_store.py --force

The vector store configuration is saved in vector_store_config.json for future reference.

Environment Variable Setup

After setting up your vector store, you need to configure the OPENAI_VECTOR_STORE_ID environment variable with the ID from your configuration:

# Extract the vector store ID from the config file
export OPENAI_VECTOR_STORE_ID=$(python -c "import json; print(json.load(open('vector_store_config.json'))['vector_store_id'])")

# Or set it manually by copying the ID from vector_store_config.json
export OPENAI_VECTOR_STORE_ID="vs_your_vector_store_id_here"

# Add to your shell profile for persistence (optional)
echo "export OPENAI_VECTOR_STORE_ID=$OPENAI_VECTOR_STORE_ID" >> ~/.bashrc

This environment variable is required for all solutions to access the vector store. Alternatively, you can pass the vector_store_id directly as a constructor argument when initializing the services.

Vector Store Management

Use the management utility to inspect and manage your vector store:

cd vector_store/openai

# Show vector store information and statistics
uv run python manage_vector_store.py info
uv run python manage_vector_store.py stats

# List all files in the vector store
uv run python manage_vector_store.py files

# List all vector stores in your account
uv run python manage_vector_store.py list

# Delete the vector store (with confirmation)
uv run python manage_vector_store.py delete

Step 4: MCP Server Setup (for MCP experiments)

The MCP solution uses Model Context Protocol to access documentation through specialized MCP servers.

MCP Installation

Install the required MCP servers using mint-mcp:

# Install the Invopop and GOBL MCP servers
npx mint-mcp add invopop
npx mint-mcp add gobl

Note: You may see error messages like "Error installing MCP for invopop: Cannot read properties of undefined (reading 'name')" during installation. This is a known issue with mint-mcp, but the servers are actually installed correctly in ~/.mcp/.

Verify Installation

Check that the MCP servers are installed:

ls -la ~/.mcp/
# Should show: invopop/ and gobl/ directories

Install Dependencies

Install Node.js dependencies for both MCP servers:

cd ~/.mcp/invopop && npm install
cd ~/.mcp/gobl && npm install

Fix Missing Configuration Files

The MCP servers require tools.json files that might be missing after installation. Create them:

# Create empty tools.json files for both servers
echo "[]" > ~/.mcp/invopop/src/tools.json
echo "[]" > ~/.mcp/gobl/src/tools.json

Test MCP Setup

Test that the MCP setup works by running the main application:

# Set a temporary API key for testing (replace with your real key)
export OPENAI_API_KEY="your_openai_api_key_here"

# Run the MCP application
uv run python -m solutions.mcp.main --config solutions/mcp/config.yaml --verbose

Success indicators:

  • βœ… "MCP Server running on stdio" messages appear
  • βœ… Welcome message displays
  • βœ… πŸ‘€ You: prompt appears (this confirms MCP setup is working)

MCP Configuration

The MCP configuration is handled via solutions/mcp/config.yaml:

mcp:
  servers:
    invopop:
      command: "node"
      args: ["~/.mcp/invopop/src/index.js"]
      transport: "stdio"
    gobl:
      command: "node" 
      args: ["~/.mcp/gobl/src/index.js"]
      transport: "stdio"

Step 5: Upload Evaluation Dataset

Upload the dataset to Opik for tracking:

cd eval
uv run python create_dataset.py

πŸ§ͺ Running Experiments

Basic LLM Experiments

Test direct LLM performance without any augmentation using the new configurable evaluation runner:

cd eval

# Basic usage with defaults (gpt-5-mini, medium reasoning, with prompt)
uv run python run_basic_llm_eval.py

# Model variations
uv run python run_basic_llm_eval.py --model gpt-4.1
uv run python run_basic_llm_eval.py --model gpt-5
uv run python run_basic_llm_eval.py --model gpt-5-mini

# Reasoning effort configurations (gpt-5/gpt-5-mini only)
uv run python run_basic_llm_eval.py --model gpt-5 --reasoning-effort minimal
uv run python run_basic_llm_eval.py --model gpt-5 --reasoning-effort low
uv run python run_basic_llm_eval.py --model gpt-5 --reasoning-effort medium
uv run python run_basic_llm_eval.py --model gpt-5 --reasoning-effort high

# Compare with/without system prompt
uv run python run_basic_llm_eval.py --model gpt-5 --reasoning-effort low
uv run python run_basic_llm_eval.py --model gpt-5 --reasoning-effort low --no-prompt

# Custom experiment naming and advanced options
uv run python run_basic_llm_eval.py \
    --model gpt-5-mini \
    --reasoning-effort high \
    --no-prompt \
    --experiment-name "gpt5mini_high_no_prompt_test" \
    --threads 8

Configuration Options:

  • Models: gpt-4.1, gpt-5, gpt-5-mini
  • Reasoning Effort (gpt-5/gpt-5-mini only): minimal, low, medium, high
  • Prompt Control: Use --no-prompt to test raw model performance without system prompt
  • Custom Naming: Use --experiment-name for better experiment tracking
  • Threading: Use --threads N to adjust parallel evaluation threads
  • Dataset: Use --dataset-name to specify different evaluation datasets

View all options:

uv run python run_basic_llm_eval.py --help

Expected Performance (based on comprehensive evaluations):

  • GPT-4.1-mini: ~53.8% accuracy, $0.00045/answer, ~4s response
  • GPT-4.1: ~59.4% accuracy, $0.0023/answer, ~4s response
  • GPT-5-mini minimal: ~54.0% accuracy, $0.00072/answer, ~6s response
  • GPT-5 minimal: ~61.3% accuracy, $0.0035/answer, ~6s response
  • GPT-5 low: ~61.7% accuracy, $0.0118/answer, ~15s response
  • GPT-5 medium: ~62.6% accuracy, $0.0247/answer, ~32s response
  • GPT-5 high: ~65.4% accuracy, $0.0243/answer, ~43s response

πŸ“– Detailed Guide: See eval/BASIC_LLM_EVALUATION_GUIDE.md for comprehensive usage examples and troubleshooting.

Legacy Script: The original eval_basic_llm.py is still available but requires manual configuration.

Web Search Experiments

Test web search augmented models using the new configurable evaluation runner:

cd eval

# Basic usage with defaults (gpt-5-mini, medium reasoning, multiple tools)
uv run python run_web_search_eval.py

# Model variations
uv run python run_web_search_eval.py --model gpt-4.1
uv run python run_web_search_eval.py --model gpt-5
uv run python run_web_search_eval.py --model gpt-5-mini

# Reasoning effort configurations (gpt-5/gpt-5-mini only)
uv run python run_web_search_eval.py --model gpt-5 --reasoning-effort minimal
uv run python run_web_search_eval.py --model gpt-5 --reasoning-effort low
uv run python run_web_search_eval.py --model gpt-5 --reasoning-effort medium
uv run python run_web_search_eval.py --model gpt-5 --reasoning-effort high

# Tool usage comparison
uv run python run_web_search_eval.py --model gpt-5 --reasoning-effort low
uv run python run_web_search_eval.py --model gpt-5 --reasoning-effort low --limit-tools

# Custom experiment naming and advanced options
uv run python run_web_search_eval.py \
    --model gpt-5-mini \
    --reasoning-effort high \
    --limit-tools \
    --experiment-name "gpt5mini_high_single_tool_test" \
    --threads 8

Configuration Options:

  • Models: gpt-4.1, gpt-5, gpt-5-mini
  • Reasoning Effort (gpt-5/gpt-5-mini only): minimal, low, medium, high
  • Tool Limit: Use --limit-tools to restrict to one web search per query
  • Custom Naming: Use --experiment-name for better experiment tracking
  • Threading: Use --threads N to adjust parallel evaluation threads
  • Dataset: Use --dataset-name to specify different evaluation datasets

View all options:

uv run python run_web_search_eval.py --help

Expected Performance (based on comprehensive evaluations):

  • GPT-4.1: ~42.5% accuracy, $0.0398/answer, ~9s response (⚠️ Poor performance)
  • GPT-5-mini low: ~63.1% accuracy, $0.0089/answer, ~19s response
  • GPT-5-mini medium: ~64.8% accuracy, $0.0398/answer, ~48s response
  • GPT-5-mini high: ~67.8% accuracy, $0.124/answer, ~150s response
  • GPT-5 low: ~71.3% accuracy, $0.149/answer, ~37s response
  • GPT-5 medium: ~71.6% accuracy, $0.295/answer, ~66s response
  • GPT-5 high: ~72.2% accuracy, $0.493/answer, ~114s response
  • GPT-5 low (1 tool): ~64.6% accuracy, $0.0394/answer, ~25s response

Legacy Script: The original eval_web_search.py is still available but requires manual configuration.

RAG (File Search) Experiments

Test retrieval augmented generation using the new configurable evaluation runner:

cd eval

# Basic usage with defaults (gpt-4.1, medium reasoning, multiple file searches)
uv run python run_file_search_eval.py

# Model variations
uv run python run_file_search_eval.py --model gpt-4.1
uv run python run_file_search_eval.py --model gpt-5
uv run python run_file_search_eval.py --model gpt-5-mini

# Reasoning effort configurations (gpt-5/gpt-5-mini only)
uv run python run_file_search_eval.py --model gpt-5 --reasoning-effort minimal
uv run python run_file_search_eval.py --model gpt-5 --reasoning-effort low
uv run python run_file_search_eval.py --model gpt-5 --reasoning-effort medium
uv run python run_file_search_eval.py --model gpt-5 --reasoning-effort high

# File search limit comparison
uv run python run_file_search_eval.py --model gpt-5 --reasoning-effort low
uv run python run_file_search_eval.py --model gpt-5 --reasoning-effort low --limit-tools

# Custom experiment naming and advanced options
uv run python run_file_search_eval.py \
    --model gpt-5-mini \
    --reasoning-effort high \
    --limit-tools \
    --experiment-name "gpt5mini_high_single_search_test" \
    --threads 8

Configuration Options:

  • Models: gpt-4.1, gpt-5, gpt-5-mini
  • Reasoning Effort (gpt-5/gpt-5-mini only): minimal, low, medium, high
  • Tool Limit: Use --limit-tools to restrict to one file search per query
  • Custom Naming: Use --experiment-name for better experiment tracking
  • Threading: Use --threads N to adjust parallel evaluation threads
  • Dataset: Use --dataset-name to specify different evaluation datasets

View all options:

uv run python run_file_search_eval.py --help

Expected Performance (based on comprehensive evaluations):

  • GPT-4.1: ~67.8% accuracy, $0.0376/answer, ~12s response
  • GPT-5-mini low: ~65.3% accuracy, $0.0064/answer, ~17s response
  • GPT-5-mini medium: ~68.0% accuracy, $0.0127/answer, ~37s response
  • GPT-5-mini high: ~69.2% accuracy, $0.0320/answer, ~88s response
  • GPT-5 low: ~71.9% accuracy, $0.0526/answer, ~25s response
  • GPT-5 medium: ~73.0% accuracy, $0.106/answer, ~52s response
  • GPT-5 high: ~73.9% accuracy, $0.170/answer, ~96s response
  • GPT-5 low (1 tool): ~70.3% accuracy, $0.0421/answer, ~24s response

Legacy Script: The original eval_file_search.py is still available but requires manual configuration.

Multi-Agent RAG Experiments

Test distributed RAG with specialized tools using the new configurable evaluation runner:

cd eval

# Basic usage with defaults (gpt-5, high reasoning)
uv run python run_agent_rag_eval.py

# Model variations
uv run python run_agent_rag_eval.py --model gpt-4.1
uv run python run_agent_rag_eval.py --model gpt-5
uv run python run_agent_rag_eval.py --model gpt-5-mini

# Reasoning effort configurations (gpt-5/gpt-5-mini only)
uv run python run_agent_rag_eval.py --model gpt-5 --reasoning-effort low
uv run python run_agent_rag_eval.py --model gpt-5 --reasoning-effort medium
uv run python run_agent_rag_eval.py --model gpt-5 --reasoning-effort high

# Custom experiment naming and advanced options
uv run python run_agent_rag_eval.py \
    --model gpt-5-mini \
    --reasoning-effort medium \
    --experiment-name "gpt5mini_medium_multiagent_test" \
    --threads 16

Configuration Options:

  • Models: gpt-4.1, gpt-5, gpt-5-mini
  • Reasoning Effort (gpt-5/gpt-5-mini only): minimal, low, medium, high
  • Custom Naming: Use --experiment-name for better experiment tracking
  • Threading: Use --threads N to adjust parallel evaluation threads
  • Dataset: Use --dataset-name to specify different evaluation datasets

View all options:

uv run python run_agent_rag_eval.py --help

Features:

  • Specialized Tools: Separate tools for different document types (VeriFactu, Invopop, GOBL docs/code)
  • Intelligent Routing: Agent automatically chooses appropriate tools based on question context
  • LangGraph Orchestration: Advanced workflow management with memory and checkpointing
  • Multi-Source RAG: Can access and combine information from multiple knowledge sources

Expected Performance (based on comprehensive evaluations):

  • GPT-4.1: ~67.3% accuracy, $0.0460/answer, ~14s response
  • GPT-5-mini low: ~65.4% accuracy, $0.0071/answer, ~18s response (πŸ† Best cost-performance)
  • GPT-5-mini high: ~68.1% accuracy, $0.0535/answer, ~160s response
  • GPT-5 low: ~73.9% accuracy, $0.0602/answer, ~40s response
  • GPT-5 medium: ~74.0% accuracy, $0.151/answer, ~101s response
  • GPT-5 high: ~75.0% accuracy, $0.267/answer, ~191s response (πŸ† Best overall accuracy)

Legacy Script: The original eval_agent_rag_all.py is still available but requires manual configuration.

MCP-Based Experiments

Test Model Context Protocol implementations using the new configurable evaluation runner:

cd eval

# Basic usage with defaults (gpt-5, medium reasoning)
uv run python run_mcp_eval.py

# Model variations
uv run python run_mcp_eval.py --model gpt-4.1
uv run python run_mcp_eval.py --model gpt-5
uv run python run_mcp_eval.py --model gpt-5-mini

# Reasoning effort configurations (gpt-5/gpt-5-mini only)
uv run python run_mcp_eval.py --model gpt-5 --reasoning-effort minimal
uv run python run_mcp_eval.py --model gpt-5 --reasoning-effort low
uv run python run_mcp_eval.py --model gpt-5 --reasoning-effort medium
uv run python run_mcp_eval.py --model gpt-5 --reasoning-effort high

# Custom experiment naming
uv run python run_mcp_eval.py \
    --model gpt-5-mini \
    --reasoning-effort low \
    --experiment-name "gpt5mini_low_mcp_test"

Configuration Options:

  • Models: gpt-4.1, gpt-5, gpt-5-mini
  • Reasoning Effort (gpt-5/gpt-5-mini only): minimal, low, medium, high
  • Custom Naming: Use --experiment-name for better experiment tracking
  • Threading: Use --threads N (default: 1, recommended for MCP stability)
  • Dataset: Use --dataset-name to specify different evaluation datasets

View all options:

uv run python run_mcp_eval.py --help

Features:

  • Integrated MCP Tools: Built-in access to documentation via MCP protocol
  • Multi-Source Knowledge: Access to Invopop, GOBL docs, and code repositories
  • Official Document Access: VeriFactu and other official documentation
  • Configurable Models: Support for different GPT models and reasoning levels

Expected Performance (based on comprehensive evaluations):

  • GPT-4.1: ~68.5% accuracy, $0.032/answer, ~14s response
  • GPT-5 low: ~74.2% accuracy, $0.0523/answer, ~54s response (🎯 Best balance)
  • GPT-5 medium: ~74.4% accuracy, $0.133/answer, ~165s response

Legacy Script: The original eval_mcp_new_prompts.py is still available but requires manual configuration.

πŸ“ˆ Analyzing Results

Getting Experiment Metrics

After running experiments, extract metrics:

cd eval
uv run python get_experiment_cost_duration.py --experiment_name your_experiment_name

This provides:

  • Accuracy scores (overall and VeriFactu-specific)
  • Cost analysis (per 1000 queries)
  • Response time percentiles (P50, P90, P99)
  • Success rates

Understanding the Metrics

Accuracy: Semantic correctness of answers compared to ground truth VeriFactu Accuracy: Specific accuracy on Spanish tax compliance questions
Cost: OpenAI API costs per 1000 queries Response Times: Latency percentiles in milliseconds

🎯 Model Recommendations

Based on our comprehensive evaluation results, here are our recommendations for different use cases:

For Production Applications

πŸ† Best Overall Choice: GPT-5 Low + MCP

  • Accuracy: 74.2% (excellent)
  • Cost: $0.0523/answer (reasonable)
  • Speed: 54s P50 (acceptable for complex queries)
  • Why: Best balance of accuracy, cost, and speed

πŸ’° Budget-Conscious: GPT-5-mini Low + Multi-Agent

  • Accuracy: 65.4% (good)
  • Cost: $0.0071/answer (very low)
  • Speed: 18s P50 (fast)
  • Why: Excellent cost-performance ratio

⚑ Speed-Critical: GPT-4.1-mini

  • Accuracy: 53.8% (basic)
  • Cost: $0.00045/answer (extremely low)
  • Speed: 4s P50 (very fast)
  • Why: Fastest responses at minimal cost

For Research & Development

πŸ”¬ Maximum Accuracy: GPT-5 High + Multi-Agent

  • Accuracy: 75.0% (highest)
  • VeriFactu Accuracy: 77.5% (highest)
  • Cost: $0.267/answer (expensive)
  • Speed: 191s P50 (slow)
  • Why: Best possible accuracy for research purposes

πŸ§ͺ Experimentation: GPT-5 Medium + RAG

  • Accuracy: 73.0% (very good)
  • Cost: $0.106/answer (moderate)
  • Speed: 52s P50 (reasonable)
  • Why: Good balance for testing and development

Architecture-Specific Recommendations

Use Case Recommended Configuration Accuracy Cost Speed
Customer Support GPT-5 Low + MCP 74.2% $0.0523 54s
Internal Tools GPT-5-mini Medium + RAG 68.0% $0.0127 37s
Documentation Search GPT-5 Low + RAG 71.9% $0.0526 25s
Research Analysis GPT-5 High + Multi-Agent 75.0% $0.267 191s
Quick Queries GPT-4.1 59.4% $0.0023 4s
Batch Processing GPT-5-mini Low + Multi-Agent 65.4% $0.0071 18s

Key Insights

  1. System Prompts Matter: Adding prompts improves accuracy by ~9-12% across all models
  2. GPT-4.1 + Web Search: Performs poorly (42.5% accuracy) - avoid this combination
  3. Tool Limiting: Restricting to 1 tool call reduces accuracy by ~7-8% but cuts costs significantly
  4. Mini Models: GPT-5-mini offers good performance at much lower cost than full GPT-5
  5. Architecture Impact: Multi-Agent β‰ˆ MCP > RAG > Web Search > Basic LLM in terms of accuracy

πŸ”§ Customization

Adding New Solutions

  1. Create a new directory in solutions/
  2. Implement the required interface (see existing solutions as examples)
  3. Add corresponding evaluation script in evaluation/
  4. Update this README with new results

Modifying Evaluation Metrics

Edit evaluation/correctness.py to:

  • Add new evaluation criteria
  • Adjust scoring algorithms
  • Include domain-specific metrics

Extending Document Types

To add new document sources:

  1. Add loading scripts in load/
  2. Create corresponding vector store scripts in vector_store/openai/
  3. Update solution prompts to handle new document types

🎯 Best Practices

For Cost Optimization

  • Start with GPT-4.1 for baseline experiments
  • Use GPT-5-mini for development and testing
  • Monitor token usage with Opik tracking

For Accuracy Optimization

  • Use higher reasoning levels for complex queries
  • Combine RAG with web search for comprehensive coverage
  • Implement multi-agent approaches for specialized domains

For Speed Optimization

  • Use simpler models for time-critical applications
  • Implement caching for repeated queries
  • Optimize vector store chunk sizes

πŸ› Troubleshooting

Common Issues

Vector Store Connection Errors

  • Verify OpenAI API key is set correctly
  • Check vector store ID in configuration files

MCP Server Issues

  • Ensure Node.js is installed for MCP servers
  • Verify MCP servers are running: mint-mcp list

Evaluation Dataset Issues

  • Confirm Opik API key is configured
  • Check dataset exists in Opik dashboard

Memory Issues

  • Reduce batch sizes in evaluation scripts
  • Use streaming for large document processing

πŸ“š Additional Resources

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Run experiments and document results
  4. Submit a pull request with performance comparisons

Note: This experimental framework is designed for research and development purposes. Production deployments should consider additional factors like security, scalability, and compliance requirements.

About

Repository to replicate Popbot experiments

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages