An open-source framework by KX for evaluating Large Language Models on Q/kdb+ code generation tasks.
This project introduces the first standardized evaluation benchmark for Q/kdb+, addressing a critical gap in assessing Large Language Models (LLMs) for this specialized programming language. The lack of such a benchmark has limited meaningful measurement and progress in Q code generation.
Our evaluation harness provides a robust and rigorous framework, beginning with a Q-language adaptation of OpenAI’s HumanEval. Our roadmap includes integrating additional benchmarks, such as a Q-language port of MBPP, empowering the community to effectively evaluate and advance LLM capabilities for Q.
Track the performance of Large Language Models on Q/kdb+ code generation tasks using our standardized evaluation framework.
Rank | Model | Pass@1 | Pass@5 | Pass@10 |
---|---|---|---|---|
🥇 | Grok 4 | 43.37% | 68.45% | 74.32% |
🥈 | Claude 4 Sonnet | 37.70% | 53.47% | 59.13% |
🥉 | Gemini 2.5 pro | 27.75% | 51.41% | 59.68% |
📈 View Complete Leaderboard →
See full results, historical data, and detailed analysis
- 🚀 Simple CLI: One-command evaluation with
qeval run <dataset> <model>
. - 📊 Q-HumanEval Dataset: 164 hand-crafted Q programming problems.
- 🔧 Multi-Model Support: Supports both closed-source APIs and open-source Hugging Face models.
- 📈 Standard Metrics: Pass@1, Pass@5, Pass@10 with isolated execution.
- ⏱️ Timeout Protection: Code execution with configurable timeout limits.
# Clone the repository
git clone https://github.com/KxSystems/q-evaluation-harness.git
cd q-evaluation-harness
# Install dependencies with Poetry
poetry install
# Activate the Poetry environment (Poetry 2.0+)
eval $(poetry env activate)
Requirements: Python 3.10+, Poetry, and a kdb+ license.
⚠️ Security Warning: This tool executes generated code in your local environment. While we provide timeout protection, the execution is not sandboxed. Only run evaluations with trusted models and in isolated environments. Sandboxed execution is planned for future releases.
- Install KDB with PyKX license: Standard install KDB or follow KDB-X
- For multithreaded execution: Set
PYKX_THREADING=1
in your environment
🚧 Future: We plan to support MCP/REST API for Q execution to remove the PyKX dependency requirement.
💡 Note: API keys are only needed for proprietary models (e.g., from OpenAI, Anthropic). You can skip this if you are using open-source Hugging Face models.
export OPENAI_API_KEY="your-key-here"
export ANTHROPIC_API_KEY="your-key-here"
Verify your installation by running an evaluation on an open-source model. This command should work without any API keys configured.
# Make sure you're in the Poetry environment (Poetry 2.0+)
eval $(poetry env activate)
# Run evaluation on an open-source model from Hugging Face
qeval run q-humaneval Qwen/Qwen2-1.5B-Instruct
Use the run
command to evaluate models using our standardized framework. This generates and executes Q code solutions in one step.
# Evaluate GPT-4o on Q-HumanEval (default: 50 samples for statistical significance)
qeval run q-humaneval gpt-4.1
# Evaluate an open-source model
qeval run q-humaneval google/gemma-3-4b-it
# Specify custom sample size (50 samples recommended for leaderboard submissions)
qeval run q-humaneval your-model --num-samples 50
📊 Evaluation Standard: Use 50 samples per problem for statistically significant results and leaderboard submissions.
Help us grow the leaderboard! Submit your model evaluation results to contribute to the Q/kdb+ AI development community.
Our primary benchmark adapts HumanEval to Q/kdb+, featuring 164 problems with hand-verified solutions and comprehensive test cases.
We welcome contributions! Key areas include new datasets, model integrations, and evaluation metrics.
# Development setup
poetry install --with dev
pre-commit install
# Run tests
poetry run pytest
- Q-MBPP: Basic programming problems in Q.
- Custom Metrics: Flexible framework for adding custom evaluation metrics beyond Pass@k.
- Sandboxed Execution: Secure, isolated code evaluation environment.
- MCP Server for Q: Support MCP/REST API for Q execution to remove the PyKX dependency requirement.
- Native Q Execution: Remove the PyKX dependency.
Problem: vLLM (NCCL library) fails with NVLS-related errors on single-node setups.
Solution: Disable NVLS to force fallback:
export NCCL_NVLS_ENABLE=0
@software{q_evaluation_harness,
title={Q Evaluation Harness: Benchmarking LLMs on Q/kdb+ Code Generation},
author={Miahi, Erfan and Morrison, Andrew},
year={2025},
url={https://github.com/kxsystems/q-evaluation-harness}
}
MIT License - see LICENSE for details.