This repository contains a complete experimental setup for evaluating different AI agent architectures for answering questions about Invopop, GOBL, and VeriFactu documentation. The experiments compare various approaches including basic LLMs, web search augmentation, RAG systems, multi-agent architectures, and MCP-based solutions.
Based on our comprehensive evaluation across different AI architectures, here are the complete results:
Model | Method | Accuracy | VeriFactu Acc. | Cost ($/answer) | P50 Time (s) | P90 Time (s) | P99 Time (s) |
---|---|---|---|---|---|---|---|
Basic LLM (No System Prompts) | |||||||
GPT-4.1 | No prompt | 0.505 | 0.554 | 0.00479 | 11.668 | 17.808 | 31.023 |
GPT-5 | Minimal no prompt | 0.494 | 0.574 | 0.00442 | 8.170 | 15.880 | 25.440 |
GPT-5 | Low no prompt | 0.543 | 0.608 | 0.01110 | 18.863 | 32.797 | 46.871 |
Basic LLM (With System Prompts) | |||||||
GPT-4.1-mini | Base prompt | 0.538 | 0.568 | 0.00045 | 4.050 | 6.160 | 8.333 |
GPT-4.1 | Base prompt | 0.594 | 0.606 | 0.00228 | 4.071 | 6.335 | 10.390 |
GPT-5-mini | Minimal | 0.540 | 0.599 | 0.00072 | 6.154 | 9.557 | 12.907 |
GPT-5-mini | Low | 0.506 | 0.557 | 0.00128 | 9.839 | 14.660 | 21.774 |
GPT-5-mini | Medium | 0.522 | 0.547 | 0.00274 | 18.505 | 25.308 | 36.934 |
GPT-5-mini | High | 0.565 | 0.612 | 0.00759 | 54.903 | 83.322 | 130.443 |
GPT-5 | Minimal | 0.613 | 0.651 | 0.00347 | 5.593 | 8.552 | 13.877 |
GPT-5 | Low | 0.617 | 0.653 | 0.01183 | 15.447 | 25.019 | 32.805 |
GPT-5 | Medium | 0.626 | 0.650 | 0.02473 | 32.310 | 49.589 | 75.491 |
GPT-5 | High | 0.654 | 0.653 | 0.02434 | 43.003 | 79.790 | 127.549 |
Web Search Augmented | |||||||
GPT-4.1 | Web search | 0.425 | 0.446 | 0.03982 | 8.617 | 12.069 | 15.473 |
GPT-5-mini | Low + web | 0.631 | 0.676 | 0.00887 | 19.138 | 27.833 | 41.627 |
GPT-5-mini | Medium + web | 0.648 | 0.690 | 0.03984 | 48.344 | 84.924 | 137.963 |
GPT-5-mini | High + web | 0.678 | 0.705 | 0.12381 | 150.450 | 231.338 | 292.249 |
GPT-5 | Low + web | 0.713 | 0.743 | 0.14872 | 37.386 | 59.812 | 82.178 |
GPT-5 | Medium + web | 0.716 | 0.746 | 0.29536 | 66.178 | 101.190 | 134.978 |
GPT-5 | High + web | 0.722 | 0.751 | 0.49285 | 114.294 | 158.767 | 199.264 |
GPT-5 | Low + web (1 tool) | 0.646 | 0.664 | 0.03940 | 24.727 | 35.143 | 48.019 |
GPT-5 | High + web (1 tool) | 0.639 | 0.660 | 0.07069 | 62.029 | 93.400 | 134.649 |
RAG Systems | |||||||
GPT-4.1 | RAG | 0.678 | 0.717 | 0.0376 | 11.79 | 17.94 | 28.82 |
GPT-5-mini | Low + RAG | 0.653 | 0.681 | 0.0064 | 16.58 | 23.41 | 38.31 |
GPT-5-mini | Medium + RAG | 0.680 | 0.722 | 0.0127 | 36.5 | 56.7 | 91.99 |
GPT-5-mini | High + RAG | 0.692 | 0.733 | 0.0320 | 88.40 | 152.76 | 213.64 |
GPT-5 | Low + RAG | 0.719 | 0.734 | 0.0526 | 25.33 | 38.88 | 53.46 |
GPT-5 | Medium + RAG | 0.730 | 0.762 | 0.1061 | 52.12 | 78.73 | 116.30 |
GPT-5 | High + RAG | 0.739 | 0.769 | 0.1696 | 96.11 | 147.19 | 219.07 |
GPT-5 | Low + RAG (1 tool) | 0.703 | 0.738 | 0.0421 | 23.81 | 36.10 | 51.44 |
GPT-5 | Medium + RAG (1 tool) | 0.701 | 0.740 | 0.0612 | 44.80 | 68.02 | 87.97 |
Multi-Agent RAG (Distributed Tools) | |||||||
GPT-5-mini | Low + agents | 0.654 | 0.686 | 0.0071 | 18.44 | 26.17 | 37.16 |
GPT-5-mini | High + agents | 0.681 | 0.721 | 0.0535 | 159.57 | 288.55 | 407.10 |
GPT-4.1 | Multi-agent | 0.673 | 0.717 | 0.0460 | 14.45 | 28.18 | 47.20 |
GPT-5 | Low + agents | 0.739 | 0.772 | 0.0602 | 40.12 | 71.62 | 110.23 |
GPT-5 | Medium + agents | 0.740 | 0.757 | 0.1506 | 101.26 | 182.65 | 248.26 |
GPT-5 | High + agents | 0.750 | 0.775 | 0.2673 | 191.05 | 347.49 | 478.04 |
MCP-Based Solutions | |||||||
GPT-4.1 | MCP | 0.685 | 0.709 | 0.032 | 14.45 | 28.18 | 47.20 |
GPT-5 | Low + MCP | 0.742 | 0.757 | 0.0523 | 54.05 | 101.12 | 138.23 |
GPT-5 | Medium + MCP | 0.744 | 0.761 | 0.1332 | 165.26 | 275.75 | 379.76 |
- π Best Overall Performance: GPT-5 High + Multi-Agent (75.0% accuracy, 77.5% VeriFactu)
- π° Best Cost-Performance: GPT-5-mini Low + Multi-Agent (65.4% accuracy at $0.0071/answer)
- β‘ Fastest Response: GPT-4.1-mini (4.050s P50, $0.00045/answer)
- π― Best Balance: GPT-5 Low + MCP (74.2% accuracy, $0.0523/answer, 54s P50)
- π Best VeriFactu Performance: GPT-5 High + Multi-Agent (77.5% VeriFactu accuracy)
Before running any experiments, ensure you have the following API keys:
- OpenAI API Key - For LLM inference and vector store
- Opik API Key and workspace - For experiment tracking and evaluation
- FireCrawl API Key - For web crawling documentation
Set these as environment variables:
export OPENAI_API_KEY="your_openai_api_key"
export OPIK_API_KEY="your_opik_api_key"
export OPIK_WORKSPACE="your_opik_workspace"
export FIRECRAWL_API_KEY="your_firecrawl_api_key"
popbot-experiments/
βββ README.md # This file
βββ EXPERIMENT_GUIDE.md # Detailed experiment instructions
βββ QUICK_START.md # Quick start guide
βββ data/
β βββ anonymized_benchmark_june_july.json # Evaluation dataset
βββ solutions/ # Different AI architectures
β βββ basic_llm/ # Direct LLM calls
β βββ web/ # Web search augmented
β βββ file_search/ # RAG with single vector store
β βββ agent_all_rag_langgraph/ # Multi-agent with distributed RAG
β βββ mcp/ # MCP-based solutions
βββ eval/ # Evaluation scripts and metrics
β βββ run_basic_llm_eval.py # Configurable basic LLM evaluation
β βββ run_web_search_eval.py # Configurable web search evaluation
β βββ run_file_search_eval.py # Configurable RAG evaluation
β βββ run_agent_rag_eval.py # Configurable multi-agent evaluation
β βββ run_mcp_eval.py # Configurable MCP evaluation
β βββ correctness.py # Evaluation metrics
βββ vector_store/ # Vector database management
βββ load/ # Data loading and preprocessing
βββ files/ # Official documentation files
βββ pyproject.toml # Python project configuration
-
Clone this repository
-
Install dependencies:
uv sync
- Set up your API keys (see Prerequisites section above)
The evaluation dataset anonymized_benchmark_june_july.json
is already included in the data/
folder. This contains customer questions from Slack channels, anonymized for privacy.
You need to gather three types of documents for the RAG systems:
A. Crawled Documentation (Invopop & GOBL docs)
cd load/crawler
uv run python crawl_docs.py docs.invopop.com
uv run python crawl_docs.py docs.gobl.org
You can also use directly the docs in load/crawler/docs, but if you want to get updated docs, I recommend using firecrawl. The load/crawler/docs were the ones used for the results obtained, but they will differ if you want to compare to mcp or web which is always updated.
B. GitHub Repository Code For the github repository code, you would need to clone the repos github.com/invopop/gobl and github.com/invopop/gobl.verifactu at the same level as this repository.
C. Official VeriFactu Documents
The URLs for official documents are in files/verifactu/json/verifactu.json
. Download them manually. For this case we have downloaded for you and are stored in files/verifactu.
The vector store setup has been streamlined with improved scripts that handle configuration automatically.
Use the comprehensive setup script that handles everything:
cd vector_store/openai
# Complete setup with all document types
uv run python setup_vector_store.py
# Or with custom options
uv run python setup_vector_store.py --name "My Custom Vector Store" --skip-github
The setup script will:
- β Check all prerequisites automatically
- π Create the vector store with proper configuration
- π Add Firecrawl documentation (GOBL + Invopop)
- π» Add GitHub code repositories (gobl + gobl.verifactu)
- π Add official VeriFactu documents
- π Provide progress reporting and final statistics
- πΎ Save configuration for future use
For more control, run individual scripts:
cd vector_store/openai
# 1. Create the vector store
uv run python create_vector_store.py
# 2. Add different document types (in any order)
uv run python add_firecrawl_docs.py # Add crawled documentation
uv run python add_github_code.py # Add GitHub code repositories
uv run python add_official_docs.py # Add official PDF documents
If you encounter issues:
# Check what's missing
uv run python setup_vector_store.py --dry-run
# Skip problematic document types
uv run python setup_vector_store.py --skip-github
# Force recreation of vector store
uv run python setup_vector_store.py --force
The vector store configuration is saved in vector_store_config.json
for future reference.
After setting up your vector store, you need to configure the OPENAI_VECTOR_STORE_ID
environment variable with the ID from your configuration:
# Extract the vector store ID from the config file
export OPENAI_VECTOR_STORE_ID=$(python -c "import json; print(json.load(open('vector_store_config.json'))['vector_store_id'])")
# Or set it manually by copying the ID from vector_store_config.json
export OPENAI_VECTOR_STORE_ID="vs_your_vector_store_id_here"
# Add to your shell profile for persistence (optional)
echo "export OPENAI_VECTOR_STORE_ID=$OPENAI_VECTOR_STORE_ID" >> ~/.bashrc
This environment variable is required for all solutions to access the vector store. Alternatively, you can pass the vector_store_id
directly as a constructor argument when initializing the services.
Use the management utility to inspect and manage your vector store:
cd vector_store/openai
# Show vector store information and statistics
uv run python manage_vector_store.py info
uv run python manage_vector_store.py stats
# List all files in the vector store
uv run python manage_vector_store.py files
# List all vector stores in your account
uv run python manage_vector_store.py list
# Delete the vector store (with confirmation)
uv run python manage_vector_store.py delete
The MCP solution uses Model Context Protocol to access documentation through specialized MCP servers.
Install the required MCP servers using mint-mcp:
# Install the Invopop and GOBL MCP servers
npx mint-mcp add invopop
npx mint-mcp add gobl
Note: You may see error messages like "Error installing MCP for invopop: Cannot read properties of undefined (reading 'name')" during installation. This is a known issue with mint-mcp, but the servers are actually installed correctly in ~/.mcp/
.
Check that the MCP servers are installed:
ls -la ~/.mcp/
# Should show: invopop/ and gobl/ directories
Install Node.js dependencies for both MCP servers:
cd ~/.mcp/invopop && npm install
cd ~/.mcp/gobl && npm install
The MCP servers require tools.json
files that might be missing after installation. Create them:
# Create empty tools.json files for both servers
echo "[]" > ~/.mcp/invopop/src/tools.json
echo "[]" > ~/.mcp/gobl/src/tools.json
Test that the MCP setup works by running the main application:
# Set a temporary API key for testing (replace with your real key)
export OPENAI_API_KEY="your_openai_api_key_here"
# Run the MCP application
uv run python -m solutions.mcp.main --config solutions/mcp/config.yaml --verbose
Success indicators:
- β "MCP Server running on stdio" messages appear
- β Welcome message displays
- β π€ You: prompt appears (this confirms MCP setup is working)
The MCP configuration is handled via solutions/mcp/config.yaml
:
mcp:
servers:
invopop:
command: "node"
args: ["~/.mcp/invopop/src/index.js"]
transport: "stdio"
gobl:
command: "node"
args: ["~/.mcp/gobl/src/index.js"]
transport: "stdio"
Upload the dataset to Opik for tracking:
cd eval
uv run python create_dataset.py
Test direct LLM performance without any augmentation using the new configurable evaluation runner:
cd eval
# Basic usage with defaults (gpt-5-mini, medium reasoning, with prompt)
uv run python run_basic_llm_eval.py
# Model variations
uv run python run_basic_llm_eval.py --model gpt-4.1
uv run python run_basic_llm_eval.py --model gpt-5
uv run python run_basic_llm_eval.py --model gpt-5-mini
# Reasoning effort configurations (gpt-5/gpt-5-mini only)
uv run python run_basic_llm_eval.py --model gpt-5 --reasoning-effort minimal
uv run python run_basic_llm_eval.py --model gpt-5 --reasoning-effort low
uv run python run_basic_llm_eval.py --model gpt-5 --reasoning-effort medium
uv run python run_basic_llm_eval.py --model gpt-5 --reasoning-effort high
# Compare with/without system prompt
uv run python run_basic_llm_eval.py --model gpt-5 --reasoning-effort low
uv run python run_basic_llm_eval.py --model gpt-5 --reasoning-effort low --no-prompt
# Custom experiment naming and advanced options
uv run python run_basic_llm_eval.py \
--model gpt-5-mini \
--reasoning-effort high \
--no-prompt \
--experiment-name "gpt5mini_high_no_prompt_test" \
--threads 8
Configuration Options:
- Models:
gpt-4.1
,gpt-5
,gpt-5-mini
- Reasoning Effort (gpt-5/gpt-5-mini only):
minimal
,low
,medium
,high
- Prompt Control: Use
--no-prompt
to test raw model performance without system prompt - Custom Naming: Use
--experiment-name
for better experiment tracking - Threading: Use
--threads N
to adjust parallel evaluation threads - Dataset: Use
--dataset-name
to specify different evaluation datasets
View all options:
uv run python run_basic_llm_eval.py --help
Expected Performance (based on comprehensive evaluations):
- GPT-4.1-mini: ~53.8% accuracy, $0.00045/answer, ~4s response
- GPT-4.1: ~59.4% accuracy, $0.0023/answer, ~4s response
- GPT-5-mini minimal: ~54.0% accuracy, $0.00072/answer, ~6s response
- GPT-5 minimal: ~61.3% accuracy, $0.0035/answer, ~6s response
- GPT-5 low: ~61.7% accuracy, $0.0118/answer, ~15s response
- GPT-5 medium: ~62.6% accuracy, $0.0247/answer, ~32s response
- GPT-5 high: ~65.4% accuracy, $0.0243/answer, ~43s response
π Detailed Guide: See eval/BASIC_LLM_EVALUATION_GUIDE.md
for comprehensive usage examples and troubleshooting.
Legacy Script: The original eval_basic_llm.py
is still available but requires manual configuration.
Test web search augmented models using the new configurable evaluation runner:
cd eval
# Basic usage with defaults (gpt-5-mini, medium reasoning, multiple tools)
uv run python run_web_search_eval.py
# Model variations
uv run python run_web_search_eval.py --model gpt-4.1
uv run python run_web_search_eval.py --model gpt-5
uv run python run_web_search_eval.py --model gpt-5-mini
# Reasoning effort configurations (gpt-5/gpt-5-mini only)
uv run python run_web_search_eval.py --model gpt-5 --reasoning-effort minimal
uv run python run_web_search_eval.py --model gpt-5 --reasoning-effort low
uv run python run_web_search_eval.py --model gpt-5 --reasoning-effort medium
uv run python run_web_search_eval.py --model gpt-5 --reasoning-effort high
# Tool usage comparison
uv run python run_web_search_eval.py --model gpt-5 --reasoning-effort low
uv run python run_web_search_eval.py --model gpt-5 --reasoning-effort low --limit-tools
# Custom experiment naming and advanced options
uv run python run_web_search_eval.py \
--model gpt-5-mini \
--reasoning-effort high \
--limit-tools \
--experiment-name "gpt5mini_high_single_tool_test" \
--threads 8
Configuration Options:
- Models:
gpt-4.1
,gpt-5
,gpt-5-mini
- Reasoning Effort (gpt-5/gpt-5-mini only):
minimal
,low
,medium
,high
- Tool Limit: Use
--limit-tools
to restrict to one web search per query - Custom Naming: Use
--experiment-name
for better experiment tracking - Threading: Use
--threads N
to adjust parallel evaluation threads - Dataset: Use
--dataset-name
to specify different evaluation datasets
View all options:
uv run python run_web_search_eval.py --help
Expected Performance (based on comprehensive evaluations):
- GPT-4.1: ~42.5% accuracy, $0.0398/answer, ~9s response (
β οΈ Poor performance) - GPT-5-mini low: ~63.1% accuracy, $0.0089/answer, ~19s response
- GPT-5-mini medium: ~64.8% accuracy, $0.0398/answer, ~48s response
- GPT-5-mini high: ~67.8% accuracy, $0.124/answer, ~150s response
- GPT-5 low: ~71.3% accuracy, $0.149/answer, ~37s response
- GPT-5 medium: ~71.6% accuracy, $0.295/answer, ~66s response
- GPT-5 high: ~72.2% accuracy, $0.493/answer, ~114s response
- GPT-5 low (1 tool): ~64.6% accuracy, $0.0394/answer, ~25s response
Legacy Script: The original eval_web_search.py
is still available but requires manual configuration.
Test retrieval augmented generation using the new configurable evaluation runner:
cd eval
# Basic usage with defaults (gpt-4.1, medium reasoning, multiple file searches)
uv run python run_file_search_eval.py
# Model variations
uv run python run_file_search_eval.py --model gpt-4.1
uv run python run_file_search_eval.py --model gpt-5
uv run python run_file_search_eval.py --model gpt-5-mini
# Reasoning effort configurations (gpt-5/gpt-5-mini only)
uv run python run_file_search_eval.py --model gpt-5 --reasoning-effort minimal
uv run python run_file_search_eval.py --model gpt-5 --reasoning-effort low
uv run python run_file_search_eval.py --model gpt-5 --reasoning-effort medium
uv run python run_file_search_eval.py --model gpt-5 --reasoning-effort high
# File search limit comparison
uv run python run_file_search_eval.py --model gpt-5 --reasoning-effort low
uv run python run_file_search_eval.py --model gpt-5 --reasoning-effort low --limit-tools
# Custom experiment naming and advanced options
uv run python run_file_search_eval.py \
--model gpt-5-mini \
--reasoning-effort high \
--limit-tools \
--experiment-name "gpt5mini_high_single_search_test" \
--threads 8
Configuration Options:
- Models:
gpt-4.1
,gpt-5
,gpt-5-mini
- Reasoning Effort (gpt-5/gpt-5-mini only):
minimal
,low
,medium
,high
- Tool Limit: Use
--limit-tools
to restrict to one file search per query - Custom Naming: Use
--experiment-name
for better experiment tracking - Threading: Use
--threads N
to adjust parallel evaluation threads - Dataset: Use
--dataset-name
to specify different evaluation datasets
View all options:
uv run python run_file_search_eval.py --help
Expected Performance (based on comprehensive evaluations):
- GPT-4.1: ~67.8% accuracy, $0.0376/answer, ~12s response
- GPT-5-mini low: ~65.3% accuracy, $0.0064/answer, ~17s response
- GPT-5-mini medium: ~68.0% accuracy, $0.0127/answer, ~37s response
- GPT-5-mini high: ~69.2% accuracy, $0.0320/answer, ~88s response
- GPT-5 low: ~71.9% accuracy, $0.0526/answer, ~25s response
- GPT-5 medium: ~73.0% accuracy, $0.106/answer, ~52s response
- GPT-5 high: ~73.9% accuracy, $0.170/answer, ~96s response
- GPT-5 low (1 tool): ~70.3% accuracy, $0.0421/answer, ~24s response
Legacy Script: The original eval_file_search.py
is still available but requires manual configuration.
Test distributed RAG with specialized tools using the new configurable evaluation runner:
cd eval
# Basic usage with defaults (gpt-5, high reasoning)
uv run python run_agent_rag_eval.py
# Model variations
uv run python run_agent_rag_eval.py --model gpt-4.1
uv run python run_agent_rag_eval.py --model gpt-5
uv run python run_agent_rag_eval.py --model gpt-5-mini
# Reasoning effort configurations (gpt-5/gpt-5-mini only)
uv run python run_agent_rag_eval.py --model gpt-5 --reasoning-effort low
uv run python run_agent_rag_eval.py --model gpt-5 --reasoning-effort medium
uv run python run_agent_rag_eval.py --model gpt-5 --reasoning-effort high
# Custom experiment naming and advanced options
uv run python run_agent_rag_eval.py \
--model gpt-5-mini \
--reasoning-effort medium \
--experiment-name "gpt5mini_medium_multiagent_test" \
--threads 16
Configuration Options:
- Models:
gpt-4.1
,gpt-5
,gpt-5-mini
- Reasoning Effort (gpt-5/gpt-5-mini only):
minimal
,low
,medium
,high
- Custom Naming: Use
--experiment-name
for better experiment tracking - Threading: Use
--threads N
to adjust parallel evaluation threads - Dataset: Use
--dataset-name
to specify different evaluation datasets
View all options:
uv run python run_agent_rag_eval.py --help
Features:
- Specialized Tools: Separate tools for different document types (VeriFactu, Invopop, GOBL docs/code)
- Intelligent Routing: Agent automatically chooses appropriate tools based on question context
- LangGraph Orchestration: Advanced workflow management with memory and checkpointing
- Multi-Source RAG: Can access and combine information from multiple knowledge sources
Expected Performance (based on comprehensive evaluations):
- GPT-4.1: ~67.3% accuracy, $0.0460/answer, ~14s response
- GPT-5-mini low: ~65.4% accuracy, $0.0071/answer, ~18s response (π Best cost-performance)
- GPT-5-mini high: ~68.1% accuracy, $0.0535/answer, ~160s response
- GPT-5 low: ~73.9% accuracy, $0.0602/answer, ~40s response
- GPT-5 medium: ~74.0% accuracy, $0.151/answer, ~101s response
- GPT-5 high: ~75.0% accuracy, $0.267/answer, ~191s response (π Best overall accuracy)
Legacy Script: The original eval_agent_rag_all.py
is still available but requires manual configuration.
Test Model Context Protocol implementations using the new configurable evaluation runner:
cd eval
# Basic usage with defaults (gpt-5, medium reasoning)
uv run python run_mcp_eval.py
# Model variations
uv run python run_mcp_eval.py --model gpt-4.1
uv run python run_mcp_eval.py --model gpt-5
uv run python run_mcp_eval.py --model gpt-5-mini
# Reasoning effort configurations (gpt-5/gpt-5-mini only)
uv run python run_mcp_eval.py --model gpt-5 --reasoning-effort minimal
uv run python run_mcp_eval.py --model gpt-5 --reasoning-effort low
uv run python run_mcp_eval.py --model gpt-5 --reasoning-effort medium
uv run python run_mcp_eval.py --model gpt-5 --reasoning-effort high
# Custom experiment naming
uv run python run_mcp_eval.py \
--model gpt-5-mini \
--reasoning-effort low \
--experiment-name "gpt5mini_low_mcp_test"
Configuration Options:
- Models:
gpt-4.1
,gpt-5
,gpt-5-mini
- Reasoning Effort (gpt-5/gpt-5-mini only):
minimal
,low
,medium
,high
- Custom Naming: Use
--experiment-name
for better experiment tracking - Threading: Use
--threads N
(default: 1, recommended for MCP stability) - Dataset: Use
--dataset-name
to specify different evaluation datasets
View all options:
uv run python run_mcp_eval.py --help
Features:
- Integrated MCP Tools: Built-in access to documentation via MCP protocol
- Multi-Source Knowledge: Access to Invopop, GOBL docs, and code repositories
- Official Document Access: VeriFactu and other official documentation
- Configurable Models: Support for different GPT models and reasoning levels
Expected Performance (based on comprehensive evaluations):
- GPT-4.1: ~68.5% accuracy, $0.032/answer, ~14s response
- GPT-5 low: ~74.2% accuracy, $0.0523/answer, ~54s response (π― Best balance)
- GPT-5 medium: ~74.4% accuracy, $0.133/answer, ~165s response
Legacy Script: The original eval_mcp_new_prompts.py
is still available but requires manual configuration.
After running experiments, extract metrics:
cd eval
uv run python get_experiment_cost_duration.py --experiment_name your_experiment_name
This provides:
- Accuracy scores (overall and VeriFactu-specific)
- Cost analysis (per 1000 queries)
- Response time percentiles (P50, P90, P99)
- Success rates
Accuracy: Semantic correctness of answers compared to ground truth
VeriFactu Accuracy: Specific accuracy on Spanish tax compliance questions
Cost: OpenAI API costs per 1000 queries
Response Times: Latency percentiles in milliseconds
Based on our comprehensive evaluation results, here are our recommendations for different use cases:
π Best Overall Choice: GPT-5 Low + MCP
- Accuracy: 74.2% (excellent)
- Cost: $0.0523/answer (reasonable)
- Speed: 54s P50 (acceptable for complex queries)
- Why: Best balance of accuracy, cost, and speed
π° Budget-Conscious: GPT-5-mini Low + Multi-Agent
- Accuracy: 65.4% (good)
- Cost: $0.0071/answer (very low)
- Speed: 18s P50 (fast)
- Why: Excellent cost-performance ratio
β‘ Speed-Critical: GPT-4.1-mini
- Accuracy: 53.8% (basic)
- Cost: $0.00045/answer (extremely low)
- Speed: 4s P50 (very fast)
- Why: Fastest responses at minimal cost
π¬ Maximum Accuracy: GPT-5 High + Multi-Agent
- Accuracy: 75.0% (highest)
- VeriFactu Accuracy: 77.5% (highest)
- Cost: $0.267/answer (expensive)
- Speed: 191s P50 (slow)
- Why: Best possible accuracy for research purposes
π§ͺ Experimentation: GPT-5 Medium + RAG
- Accuracy: 73.0% (very good)
- Cost: $0.106/answer (moderate)
- Speed: 52s P50 (reasonable)
- Why: Good balance for testing and development
Use Case | Recommended Configuration | Accuracy | Cost | Speed |
---|---|---|---|---|
Customer Support | GPT-5 Low + MCP | 74.2% | $0.0523 | 54s |
Internal Tools | GPT-5-mini Medium + RAG | 68.0% | $0.0127 | 37s |
Documentation Search | GPT-5 Low + RAG | 71.9% | $0.0526 | 25s |
Research Analysis | GPT-5 High + Multi-Agent | 75.0% | $0.267 | 191s |
Quick Queries | GPT-4.1 | 59.4% | $0.0023 | 4s |
Batch Processing | GPT-5-mini Low + Multi-Agent | 65.4% | $0.0071 | 18s |
- System Prompts Matter: Adding prompts improves accuracy by ~9-12% across all models
- GPT-4.1 + Web Search: Performs poorly (42.5% accuracy) - avoid this combination
- Tool Limiting: Restricting to 1 tool call reduces accuracy by ~7-8% but cuts costs significantly
- Mini Models: GPT-5-mini offers good performance at much lower cost than full GPT-5
- Architecture Impact: Multi-Agent β MCP > RAG > Web Search > Basic LLM in terms of accuracy
- Create a new directory in
solutions/
- Implement the required interface (see existing solutions as examples)
- Add corresponding evaluation script in
evaluation/
- Update this README with new results
Edit evaluation/correctness.py
to:
- Add new evaluation criteria
- Adjust scoring algorithms
- Include domain-specific metrics
To add new document sources:
- Add loading scripts in
load/
- Create corresponding vector store scripts in
vector_store/openai/
- Update solution prompts to handle new document types
- Start with GPT-4.1 for baseline experiments
- Use GPT-5-mini for development and testing
- Monitor token usage with Opik tracking
- Use higher reasoning levels for complex queries
- Combine RAG with web search for comprehensive coverage
- Implement multi-agent approaches for specialized domains
- Use simpler models for time-critical applications
- Implement caching for repeated queries
- Optimize vector store chunk sizes
Vector Store Connection Errors
- Verify OpenAI API key is set correctly
- Check vector store ID in configuration files
MCP Server Issues
- Ensure Node.js is installed for MCP servers
- Verify MCP servers are running:
mint-mcp list
Evaluation Dataset Issues
- Confirm Opik API key is configured
- Check dataset exists in Opik dashboard
Memory Issues
- Reduce batch sizes in evaluation scripts
- Use streaming for large document processing
- GOBL Documentation
- Invopop Documentation
- OpenAI Vector Store Guide
- Opik Evaluation Framework
- LangGraph Multi-Agent Guide
- Fork the repository
- Create a feature branch
- Run experiments and document results
- Submit a pull request with performance comparisons
Note: This experimental framework is designed for research and development purposes. Production deployments should consider additional factors like security, scalability, and compliance requirements.