Cyber-AutoAgent v0.1.0 Benchmark Results 7-1-25 #12
westonbrown
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
The Cyber-AutoAgent v0.1, using AWS Bedrock with us.anthropic.claude-3-7-sonnet-20250219-v1:0, achieved a **45.92% accuracy across 98 XBOW benchmarks (6 runs didnt execute in our harness for this run). All results from the latest test and evaluation run are here in results.zip and the harness PR from @aggr0cr4g should be up by later this week.
While demonstrating strong capabilities in direct exploitation scenarios, the agent faces significant challenges with blind vulnerabilities, cryptographic attacks, and more complex multi-step exploits.
Context For Using XBOW Benchmarks
XBOW benchmarks are vulnerability scenarios created by pentesting firms, closely replicating various vulnerability classes from SQL Injections to IDOR and SSRF. Closed source XBOW itself achieves 75% accuracy on these benchmarks, with the most senior human pentester (20+ years experience) solving 85% in 40 hours while XBOW did it in 28 minutes.
Metrics
Performance Visualization
Overall Accuracy
Failure Mode Analysis
What Worked
1. Direct Vulnerability Exploitation
The agent excels when vulnerabilities have clear, linear exploitation paths:
2. Standard Cyber Tool Integration
When standard cyber tools work out-of-the-box, success rates are high:
3. Adaptive Payload Crafting
Some impressive evasion capabilities:
onfocus
withautofocus
after<script>
andonerror
were blockedWhat Didn't Work
1. Meta-Tooling Creation Death Spiral (28% of failures - 15/53)
Pattern: Agent creates Python tools with syntax errors, then creates new tools instead of fixing them.
Real Example - XBEN-103-24:
Root Cause: No error correction loop. Agent treats syntax errors as logic failures.
2. Blind Exploitation (0% success on 3 blind SQLi benchmarks)
Data:
Why It Failed: Agent relies on visible output. When responses are filtered or delayed, it exhausts all automated payloads without adapting strategy.
3. Lost Context in Multi-Step Attacks (75% failure rate when 3+ steps needed for a specific exploit within a run)
Concrete Failure - XBEN-092-24:
Average steps before context loss: 3.3
4. Objective Completion Detection
Critical Finding - XBEN-006-24:
Data: At least 4 benchmarks show clear evidence of finding the flag but being marked as failed. This represents ~7.5% of all "failures" that may actually be successes. More review is needed.
Why: The agent's flag extraction logic fails to recognize variations in flag format or location within output.
Root Cause Analysis
Time Distribution Impact
Meta Tooling Creation Anti-Pattern
When the agent creates custom tools:
Agentic Improvements (Toward More Automous Capabilities)
Autonomous Tool Debugging Loop
debug_tool()
that parses exceptions and auto-correctsSelf-Installing Dependencies
pip install
→ retryObjective Completion Validator
Causal Reasoning State Machine
Adaptive Planning and Reflection
What This Tells Us
Open Questions for the Community
Bottom Line
The Cyber-AutoAgent v0.1 captured flags in 45+ out of 98 benchmarks. The path towards improved capabilities should be: implementing self-corrective debugging for meta tooling, advanced planning and decomposition, autonomous dependency management, and causal reasoning for multi-step attacks.
With agentic improvements—tool self-repair, auto-installing packages, and adaptive strategy switching—we should be able reach 65-70% success rate.
Big thanks to @aggr0cr4g for working on the test and evaluation harness and for providing the data!
Beta Was this translation helpful? Give feedback.
All reactions