Q Evaluation Harness

An open-source framework by KX for evaluating Large Language Models on Q/kdb+ code generation tasks.

This project introduces the first standardized evaluation benchmark for Q/kdb+, addressing a critical gap in assessing Large Language Models (LLMs) for this specialized programming language. The lack of such a benchmark has limited meaningful measurement and progress in Q code generation.

Our evaluation harness provides a robust and rigorous framework, beginning with a Q-language adaptation of OpenAI’s HumanEval. Our roadmap includes integrating additional benchmarks, such as a Q-language port of MBPP, empowering the community to effectively evaluate and advance LLM capabilities for Q.

Model Leaderboard

Track the performance of Large Language Models on Q/kdb+ code generation tasks using our standardized evaluation framework.

Rank	Model	Pass@1	Pass@5	Pass@10
🥇	Grok 4	43.37%	68.45%	74.32%
🥈	Claude 4 Sonnet	37.70%	53.47%	59.13%
🥉	Gemini 2.5 pro	27.75%	51.41%	59.68%

📈 View Complete Leaderboard →
See full results, historical data, and detailed analysis

Features

🚀 Simple CLI: One-command evaluation with qeval run <dataset> <model>.
📊 Q-HumanEval Dataset: 164 hand-crafted Q programming problems.
🔧 Multi-Model Support: Supports both closed-source APIs and open-source Hugging Face models.
📈 Standard Metrics: Pass@1, Pass@5, Pass@10 with isolated execution.
⏱️ Timeout Protection: Code execution with configurable timeout limits.

Installation

# Clone the repository
git clone https://github.com/KxSystems/q-evaluation-harness.git
cd q-evaluation-harness

# Install dependencies with Poetry
poetry install

# Activate the Poetry environment (Poetry 2.0+)
eval $(poetry env activate)

Requirements: Python 3.10+, Poetry, and a kdb+ license.

⚠️ Security Warning: This tool executes generated code in your local environment. While we provide timeout protection, the execution is not sandboxed. Only run evaluations with trusted models and in isolated environments. Sandboxed execution is planned for future releases.

Setup & Configuration

Mandatory: KDB Setup

Install KDB with PyKX license: Standard install KDB or follow KDB-X
For multithreaded execution: Set PYKX_THREADING=1 in your environment

🚧 Future: We plan to support MCP/REST API for Q execution to remove the PyKX dependency requirement.

Optional: API Keys

💡 Note: API keys are only needed for proprietary models (e.g., from OpenAI, Anthropic). You can skip this if you are using open-source Hugging Face models.

export OPENAI_API_KEY="your-key-here"
export ANTHROPIC_API_KEY="your-key-here"

Quick Start

Verify your installation by running an evaluation on an open-source model. This command should work without any API keys configured.

# Make sure you're in the Poetry environment (Poetry 2.0+)
eval $(poetry env activate)

# Run evaluation on an open-source model from Hugging Face
qeval run q-humaneval Qwen/Qwen2-1.5B-Instruct

Usage Guide: Running Evaluations

Use the run command to evaluate models using our standardized framework. This generates and executes Q code solutions in one step.

# Evaluate GPT-4o on Q-HumanEval (default: 50 samples for statistical significance)
qeval run q-humaneval gpt-4.1

# Evaluate an open-source model
qeval run q-humaneval google/gemma-3-4b-it

# Specify custom sample size (50 samples recommended for leaderboard submissions)
qeval run q-humaneval your-model --num-samples 50

📊 Evaluation Standard: Use 50 samples per problem for statistically significant results and leaderboard submissions.

Submission Guidelines

Help us grow the leaderboard! Submit your model evaluation results to contribute to the Q/kdb+ AI development community.

📋 Complete Submission Guide →

💬 Questions? Use GitHub Issues →

Project Reference

Dataset: Q-HumanEval

Our primary benchmark adapts HumanEval to Q/kdb+, featuring 164 problems with hand-verified solutions and comprehensive test cases.

Contributing

We welcome contributions! Key areas include new datasets, model integrations, and evaluation metrics.

# Development setup
poetry install --with dev
pre-commit install
# Run tests
poetry run pytest

Roadmap

Q-MBPP: Basic programming problems in Q.
Custom Metrics: Flexible framework for adding custom evaluation metrics beyond Pass@k.
Sandboxed Execution: Secure, isolated code evaluation environment.
MCP Server for Q: Support MCP/REST API for Q execution to remove the PyKX dependency requirement.
Native Q Execution: Remove the PyKX dependency.

Troubleshooting

vLLM Issues

Problem: vLLM (NCCL library) fails with NVLS-related errors on single-node setups.

Solution: Disable NVLS to force fallback:

export NCCL_NVLS_ENABLE=0

Citation

@software{q_evaluation_harness,
  title={Q Evaluation Harness: Benchmarking LLMs on Q/kdb+ Code Generation},
  author={Miahi, Erfan and Morrison, Andrew},
  year={2025},
  url={https://github.com/kxsystems/q-evaluation-harness}
}

License

MIT License - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
datasets		datasets
docs		docs
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Q Evaluation Harness

Model Leaderboard

Features

Installation

Setup & Configuration

Mandatory: KDB Setup

Optional: API Keys

Quick Start

Usage Guide: Running Evaluations

Submission Guidelines

Project Reference

Dataset: Q-HumanEval

Contributing

Roadmap

Troubleshooting

vLLM Issues

Citation

License

About

Uh oh!

Releases

Packages

Languages

License

KxSystems/q-evaluation-harness

Folders and files

Latest commit

History

Repository files navigation

Q Evaluation Harness

Model Leaderboard

Features

Installation

Setup & Configuration

Mandatory: KDB Setup

Optional: API Keys

Quick Start

Usage Guide: Running Evaluations

Submission Guidelines

Project Reference

Dataset: Q-HumanEval

Contributing

Roadmap

Troubleshooting

vLLM Issues

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages