UCB Query Benchmark: Real Enterprise Data Problems for AI

A community-driven benchmark to understand why AI fails on real enterprise data problems and which techniques actually work.

This is a partnership between UC Berkeley (EPIC Data Lab) and PromptQL to create the first benchmark that:

Documents real failures of AI tools in enterprise settings
Tests multiple techniques (RAG, Agents, Text-to-SQL, etc.) against the same problems
Measures what works with deterministic, reproducible evaluations

🎯 We need your help: Share the data problems where AI tools failed you, and help us test solutions that might actually work.

The Problem

Every enterprise is trying to use AI for data analysis. They're using Cursor for SQL, Claude with MCP servers, ChatGPT with Code Interpreter, and countless other tools. And they're all hitting the same walls.

We've seen the same story repeatedly:

The demo works perfectly on clean, single-table data
It breaks immediately on real enterprise data with multiple sources, ambiguous schemas, and business logic
No one knows which technique (RAG, Agents, Fine-tuning) would actually solve their problem

This benchmark aims to:

Collect real enterprise data problems where current AI tools fail
Test different AI techniques against these problems systematically
Provide clear evidence of what works, what doesn't, and why

How to Contribute

1. Share Your Real-World Problems

We want to hear about your failures.

Edit EXAMPLES.md to add:

The problem you tried to solve (be specific about data sources, complexity)
What tools/approaches you tried (Claude, GPT-4, Cursor, custom agents)
Exactly how they failed (wrong results, timeouts, hallucinations)
Your hypothesis for why they failed
What a working solution would need to handle

Example contribution:

**Problem:** "Show me customers at risk of churning by combining Salesforce opportunity data with product usage logs"
**What failed:** Claude with database MCP server
**How it failed:** Couldn't handle the date misalignment between systems, generated SQL with wrong join keys
**Why:** No understanding that Salesforce uses fiscal quarters while usage logs use calendar months
**Solution needs:** Temporal alignment logic, business calendar awareness

2. Propose Techniques to Test

Have an idea for an approach that might work better? Add it to the techniques list:

Tool Calling with structured schemas
Multi-agent systems with specialized roles
Graph RAG over enterprise data
Semantic layer + LLM
Your novel approach

3. Run Evaluations

Help us test techniques against our problem set:

Pick a technique from techniques/
Run it against problems in EXAMPLES.md
Report results with full traces
Document what worked and what didn't

Techniques Under Test

We're systematically testing these approaches against real enterprise problems:

Technique	Description	Status
`techniques/tool-calling/`	Function calling with database tools	🟡 In Progress
`techniques/text-to-sql/`	Direct natural language to SQL	🟡 In Progress
`techniques/langgraph-agent/`	Multi-step planning with LangGraph	🔴 Planned
`techniques/rag-on-data/`	RAG over database schemas and docs	🔴 Planned
`techniques/graph-rag/`	Graph-based RAG for relationships	🔴 Planned
`techniques/semantic-layer/`	Business logic layer + LLM	🔴 Planned
`techniques/custom/`	Novel approaches from contributors	🔵 Open for PRs

Each technique folder contains:

ARCHITECTURE.md - How the technique works
implementation/ - Runnable code
results/ - Evaluation results on benchmark problems
analysis.md - What worked, what didn't, and why

Repository Structure

.
├── README.md                    # This file
├── EXAMPLES.md                  # Real enterprise problems and failures
├── techniques/                  # Approaches being tested
│   ├── tool-calling/           
│   │   ├── ARCHITECTURE.md     # How tool calling works
│   │   ├── implementation/     # Runnable code
│   │   └── results/            # Evaluation results
│   ├── text-to-sql/
│   ├── langgraph-agent/
│   └── ...
├── src/                         # Evaluation framework (in development)
│   ├── common_scaffold/         # Core utilities
│   ├── query_*/                 # Test datasets
│   └── requirements.txt        
└── evaluations/                 # Results and analysis
    └── comparison.md            # Cross-technique comparison

What Makes This Different

Not another leaderboard. We're not ranking models on clean datasets. We're:

Documenting specific failure modes in production
Testing complete techniques (not just models)
Providing detailed failure analysis (not just scores)
Building reproducible solutions (not just benchmarks)

Real problems, not toy examples. Every problem in this benchmark:

Comes from an actual enterprise use case
Has documented failure cases with current tools
Requires production-grade handling (errors, scale, governance)

Techniques, not just prompts. We test:

Complete architectures (agents, RAG, tools)
Error handling and recovery strategies
Performance at scale
Deterministic, reproducible approaches

Current Status

✅ Completed:

Initial problem collection from enterprise partners
Basic evaluation framework
First 5 test datasets (GoogleLocal, BookReview, Yelp, StockIndex, StockMarket)

🟡 In Progress:

Expanding problem documentation with failure analysis
Implementing technique comparison framework
Testing initial techniques (tool-calling, text-to-sql)

🔴 Need Help With:

More real-world failures - What broke for you?
Novel techniques - What might work better?
Evaluation runs - Help test techniques at scale
Failure analysis - Why exactly do these tools fail?

Getting Started

For Problem Contributors

Read EXAMPLES.md to see existing problems
Add your own failures and hypotheses
Submit a PR with title like examples: add Salesforce-MongoDB join failures

For Technique Contributors

Check techniques/ for existing approaches
Propose new techniques with an ARCHITECTURE.md
Implement and test against problems in EXAMPLES.md
Submit results with full traces

For Evaluators

Set up the environment:

git clone <repo>
cd ucb-query-benchmark
pip install -r src/requirements.txt

Pick a technique and problem set
Run evaluations and document results
Share findings in evaluations/

Quality Bar

For Problem Contributions:

Based on real enterprise scenario you've encountered
Includes specific tools/approaches that failed
Documents the exact failure mode
Provides hypothesis for root cause
Describes what a solution would need

For Technique Contributions:

Clear architecture documentation
Runnable implementation
Handles real enterprise complexity
Deterministic and reproducible
Includes failure recovery strategy

Join the Discussion

Discord: Join our server for real-time discussion
Issues: Report bugs or propose enhancements
Discussions: Share experiences and hypotheses

License

MIT License for code and specifications. Individual datasets may have their own licenses.

Acknowledgments

This benchmark is a collaboration between:

UC Berkeley EPIC Data Lab - Academic research on data systems
PromptQL - Enterprise data intelligence platform
You - The practitioners dealing with these problems daily

We especially thank the enterprises who shared their failures and helped us understand why AI tools break in production.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

UCB Query Benchmark: Real Enterprise Data Problems for AI

The Problem

How to Contribute

1. Share Your Real-World Problems

2. Propose Techniques to Test

3. Run Evaluations

Techniques Under Test

Repository Structure

What Makes This Different

Current Status

Getting Started

For Problem Contributors

For Technique Contributors

For Evaluators

Quality Bar

Join the Discussion

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
techniques		techniques
CATEGORIES.md		CATEGORIES.md
EXAMPLES.md		EXAMPLES.md
README.md		README.md

hasura/ucb-query-benchmark-study

Folders and files

Latest commit

History

Repository files navigation

UCB Query Benchmark: Real Enterprise Data Problems for AI

The Problem

How to Contribute

1. Share Your Real-World Problems

2. Propose Techniques to Test

3. Run Evaluations

Techniques Under Test

Repository Structure

What Makes This Different

Current Status

Getting Started

For Problem Contributors

For Technique Contributors

For Evaluators

Quality Bar

Join the Discussion

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages