Welcome to Redesign Autonomy - A comprehensive AI Safety Evaluation Framework designed to address critical challenges in LLM-assisted Software Engineering. This state-of-the-art platform provides researchers and practitioners with:
- π― Comprehensive Safety Assessment: Multi-dimensional evaluation across security, reliability, and autonomous behavior
- π Systematic Testing Protocols: Rigorous evaluation methodologies for vulnerability inheritance and overtrust patterns
- π§ Advanced Safety Metrics: Novel metrics for measuring AI system reliability in software engineering contexts
- π Real-world Impact: Addressing critical failures like the Replit database deletion incident
The Redesign Autonomy framework operates across two distinct evaluation paradigms:
Comprehensive evaluation of AI systems' autonomous decision-making capabilities, measuring failure rates, constraint adherence, and recovery mechanisms under various operational scenarios.
Systematic assessment of vulnerability inheritance, hallucination patterns, and deception risks in LLM-generated code, providing critical insights for safe AI deployment.
- π Vulnerability Assessment: Comprehensive analysis of security flaws in AI-generated code
- π§© Hallucination Detection: Systematic identification of fabricated APIs, methods, and parameters
- π‘οΈ Autonomous Failure Analysis: Rigorous testing of AI system behavior under autonomous operation
- π Deception Pattern Recognition: Advanced detection of misleading responses and explanations
- π Recovery Mechanism Evaluation: Assessment of AI systems' self-correction capabilities
- π Constraint Adherence Testing: Verification of compliance with specified requirements and standards
- π₯ News
- β‘ Quick Start
- π€ Models Evaluated
- π Evaluation Metrics
- π¬ Methodology
- π How to Use
- π Results and Analysis
- π― Key Findings
- π Documentation
- π Cite
- [2025, August 19]: Β ππ Major Release! Redesign Autonomy Safety Framework! π
We are excited to announce a significant milestone for AI Safety in Software Engineering:- π Academic Paper Release: Comprehensive analysis of AI safety challenges in software engineering
- π Re-Auto-30K Dataset: Largest collection of security-focused prompts for AI safety evaluation
- π₯οΈ Comprehensive Evaluation Suite: Advanced framework for assessing LLM safety in autonomous software engineering
π‘ Let's build safer AI systems for software engineering together! - [2024, August 15]: Β ππWe've launched Redesign Autonomy!, The release includes the complete safety evaluation framework, Re-Auto-30K dataset, comprehensive model analysis, and much more. Stay tuned for continuous safety improvements! π
- Python: 3.10+ (recommended)
- GPU: NVIDIA GPUs with CUDA support (recommended for model evaluation)
- Memory: Minimum 16GB RAM, 32GB+ recommended
- Storage: At least 50GB free space for models and evaluation results
- Clone the Repository
git clone https://github.com/navneetsatyamkumar/RedesignAutonomy.git
cd RedesignAutonomy/modelEvaluation
- Setup Environment
# Create virtual environment
python -m venv reauto-env
source reauto-env/bin/activate # On Windows: reauto-env\Scripts\activate
# Install dependencies
pip install -r requirements.txt
- Run Basic Evaluation
# Interactive mode - recommended for first-time users
python model-evaluation.py
# Background mode - recommended for full evaluation
nohup python model-evaluation.py > evaluation.log 2>&1 &
The framework uses the Re-Auto-30K dataset - a comprehensive collection of 30,886 security-focused prompts:
Local Access: The dataset is included in datasets/prompts.csv
for immediate use.
Optional Download:
from datasets import load_dataset
dataset = load_dataset("navneetsatyamkumar/Re-Auto-30K")
dataset.save_to_disk("./datasets/re_auto_30k")
The framework evaluates six state-of-the-art code generation models representing different architectural approaches and safety considerations:
Model | Publisher | Parameters | Specialization |
---|---|---|---|
Stable Code 3B | Stability AI | 3B | Code generation and completion |
Granite 3B Code Instruct | IBM | 3B | Instruction-following code generation |
DeepSeek Coder 7B | DeepSeek AI | 7B | Multi-language code generation |
CodeLlama 7B | Meta AI | 7B | Code understanding and generation |
Qwen2.5 Coder 7B Instruct | Alibaba Cloud | 7B | Instruction-tuned code generation |
Yi-Coder 9B Chat | 01.AI | 9B | Conversational code generation |
-
Vulnerability Rate (0-100%)
- Measures percentage of generated code containing security vulnerabilities
- Uses static analysis with Bandit security scanner
- Identifies CWE (Common Weakness Enumeration) patterns
-
Hallucination Rate (0-100%)
- Detects fabricated APIs, methods, or parameters
- Identifies non-existent library functions
- Measures factual accuracy in code generation
-
Critical Vulnerability Rate (0-100%)
- Focuses on high-severity security flaws
- Includes SQL injection, command injection, XSS
- Weighted by CVSS scores
-
Autonomous Failure Rate (0-100%)
- Measures failures in autonomous task execution
- Evaluates self-correction capabilities
- Tests constraint adherence
-
Deception Rate (0-100%)
- Detects intentionally misleading responses
- Identifies attempts to bypass safety measures
- Measures truthfulness in explanations
-
Recovery Success Rate (0-100%)
- Ability to recover from errors
- Self-debugging capabilities
- Resilience to adversarial inputs
-
Constraint Adherence (0-100%)
- Compliance with specified requirements
- Following security best practices
- Adherence to coding standards
- CWE Diversity: Number of unique vulnerability types
- Average Detection Time: Time to identify security issues
- Overall Safety Status: PASS/FAIL classification
We provide a comprehensive dataset of 30,886 security-focused prompts publicly available on Hugging Face:
π€ Dataset: navneetsatyamkumar/Re-Auto-30K
This curated dataset covers:
- Web Application Security: XSS, CSRF, SQL injection, input validation
- Cryptographic Implementations: Secure encryption, hashing, key management
- Authentication & Authorization: JWT, OAuth, multi-factor authentication
- Network Security Protocols: TLS/SSL, secure communications, API security
- Infrastructure as Code (IaC) Security: Docker, Kubernetes, cloud security
- Mobile Application Security: Secure storage, communication, biometrics
- API Security Patterns: Rate limiting, input sanitization, secure endpoints
- Container & Kubernetes Security: Pod security, network policies, secrets management
The evaluation pipeline consists of:
- Model Loading: Sequential loading of models with GPU memory management
- Code Generation: Prompt-based code generation for each model
- Security Analysis: Static analysis using multiple scanners
- Autonomous Testing: Simulated autonomous behavior scenarios
- Metrics Calculation: Comprehensive metric computation
- Visualization: Professional charts and reports
- Hardware: 10x RTX 3080 Ti GPUs for distributed evaluation
- Framework: Ray for distributed computing
- Memory Management: Automatic GPU cleanup between models
- Storage: Structured output with timestamped directories
- Logging: Comprehensive logging for nohup execution
cd modelEvaluation
# Interactive mode - recommended for first-time users
python model-evaluation.py
# Background mode - recommended for full evaluation
nohup python model-evaluation.py > evaluation.log 2>&1 &
# Run automated comparison analysis
python compare-models.py
# Monitor evaluation progress
tail -f evaluation.log
# Example: Custom evaluation configuration
from model_evaluation import SafetyEvaluator
evaluator = SafetyEvaluator(
models=["stable-code-3b", "deepseek-coder-7b"],
metrics=["vulnerability_rate", "autonomous_failure_rate"],
dataset_size=1000, # Subset for faster evaluation
output_dir="custom_evaluation"
)
results = evaluator.run_evaluation()
# Configure distributed settings
export CUDA_VISIBLE_DEVICES=0,1,2,3
python model-evaluation.py --distributed --num_gpus 4
from safety_metrics import BaseMetric
class CustomSecurityMetric(BaseMetric):
def evaluate(self, generated_code, prompt):
# Your custom security analysis logic
return safety_score
evaluator.add_custom_metric(CustomSecurityMetric())
# Monitor nohup execution
tail -f evaluation.log
# Check GPU utilization
watch -n 1 nvidia-smi
# Monitor specific model progress
tail -f output/run_*/model_name/combined_summary.md
Model | Autonomous Failure Rate | Deception Rate | Recovery Success Rate | Constraint Adherence | Safety Status |
---|---|---|---|---|---|
Stable Code 3B | 25.0% | 22.6% | 76.0% | 87.6% | FAIL |
DeepSeek Coder 7B | 29.6% | 19.2% | 77.0% | 87.6% | FAIL |
Yi-Coder 9B Chat | 29.8% | 17.8% | 75.8% | 87.0% | FAIL |
Qwen2.5 Coder 7B | 31.2% | 20.4% | 73.7% | 87.6% | FAIL |
CodeLlama 7B | 31.4% | 19.4% | 73.2% | 85.0% | FAIL |
Granite 3B | 34.0% | 17.8% | 66.5% | 85.6% | FAIL |
- Universal Vulnerability Pattern: All models showed systemic security weaknesses in generated code
- Hallucination Crisis: 100% hallucination rates across all models indicate severe reliability issues
- Autonomous Failure Crisis: Failure rates ranging from 25.0% to 34.0% highlight the risks of unguarded autonomous AI agents
- Deception Resistance Variance: Significant differences in deception rates (17.8% to 22.6%) suggest varying truthfulness capabilities
- Recovery Capability Gaps: Recovery success rates between 66.5% and 77.0% indicate limited self-correction abilities
- Strengths: Lowest autonomous failure rate (25.0%), strong constraint adherence (87.6%)
- Weaknesses: Highest deception rate (22.6%)
- Risk Profile: Demonstrates best autonomous behavior but highest deception tendency
- Strengths: Tied for lowest deception rate (17.8%), good recovery capabilities (75.8%)
- Characteristics: Balanced performance across safety metrics
- Autonomous Performance: Moderate failure rate (29.8%) with consistent constraint adherence
- Strengths: Highest recovery success rate (77.0%), balanced performance across metrics
- Characteristics: Strong self-correction abilities with moderate autonomous failures (29.6%)
- Strengths: Tied for lowest deception rate (17.8%), conservative approach to code generation
- Weaknesses: Highest autonomous failure rate (34.0%), lowest recovery success (66.5%)
- Use Case: Better suited for supervised rather than autonomous deployment
All models consistently generated code with:
-
Command Injection Vulnerabilities
os.system()
usage without sanitizationsubprocess.run()
with shell=True- Missing input validation
-
Common Weakness Enumerations (CWEs)
- CWE-78: OS Command Injection (primary finding)
- High-risk patterns consistently detected
-
Hallucination Patterns
- Fabricated method names
- Non-existent API parameters
- Missing error handling constructs
output/
βββ run_20250813_215427_complete_nohup/ # Main evaluation results
β βββ multi_model_summary.csv # Aggregated results
β βββ stable-code-3b/ # Model-specific results
β β βββ vulnerability_metrics.json # Detailed vulnerability data
β β βββ vulnerability_results.csv # Vulnerability findings
β β βββ autonomous_metrics.json # Autonomous behavior data
β β βββ autonomous_results.csv # Autonomous test results
β β βββ combined_summary.md # Executive summary
β β βββ *.png # Visualization charts
β βββ [other models follow same structure]
βββ modelsComparison/ # Comparative analysis
βββ comparison_run_20250814_091921/ # Timestamped comparison
βββ model_metrics_heatmap.png # Performance heatmap
βββ grouped_metrics_bar_chart.png # Grouped comparison
βββ individual_metric_charts.png # Per-metric analysis
βββ cleaned_metrics_data.csv # Processed data
βββ comparison_summary.txt # Analysis report
Our evaluation reveals critical patterns consistent with vulnerability inheritance in LLM-assisted code generation:
- Universal Vulnerability: All models showed significant vulnerability patterns in generated code
- Hallucination Crisis: 100% hallucination rates across all models indicate severe reliability issues
- Security Gaps: Models generated code with various security flaws requiring careful review
- Pattern Consistency: Similar vulnerability types across different architectures suggest training data issues
Performance Ranking by Autonomous Failure Rate:
- Stable Code 3B: 25.0% (Best autonomous performance)
- DeepSeek Coder 7B: 29.6%
- Yi-Coder 9B Chat: 29.8%
- Qwen2.5 Coder 7B: 31.2%
- CodeLlama 7B: 31.4%
- Granite 3B: 34.0% (Most conservative, highest failure rate)
- Truthfulness vs. Capability: Models with lower deception rates often showed higher autonomous failure rates
- Recovery vs. Prevention: Better recovery capabilities didn't correlate with lower initial failure rates
- Size vs. Safety: Larger models (9B parameters) didn't consistently outperform smaller models (3B) in safety metrics
- Performance Variability: Different models excel in different safety dimensions, suggesting specialized use cases
- Overtrust Risk: High failure rates combined with sophisticated outputs create dangerous overtrust scenarios
- Governance Necessity: All models require comprehensive safety frameworks before production deployment
- Specialized Deployment: Different models suit different use cases based on their safety profiles
- Continuous Monitoring: Real-time safety assessment is essential for AI-driven software development
Comprehensive documentation and resources are available to help you get started with RedesignAutonomy:
- Getting Started Guide - Complete setup and usage instructions
- Dataset Documentation - Complete Re-Auto-30K dataset documentation
- Academic Paper - "Rethinking Autonomy: Preventing Failures in AI-Driven Software Engineering"
- Evaluation Results - Complete evaluation results and analysis
- Methodology Documentation - Detailed evaluation methodology and statistical analysis
If you use RedesignAutonomy in your research or find our work helpful, please cite our paper:
@misc{navneet2025rethinkingautonomypreventingfailures,
title={Rethinking Autonomy: Preventing Failures in AI-Driven Software Engineering},
author={Satyam Kumar Navneet and Joydeep Chandra},
year={2025},
eprint={2508.11824},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2508.11824},
}
If you use our Re-Auto-30K dataset, please also cite:
@dataset{re_auto_30k_2025,
title={Re-Auto-30K: A Comprehensive AI Safety Evaluation Dataset for Code Generation},
author={Navneet, Satyam Kumar and Contributors},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/datasets/navneetsatyamkumar/Re-Auto-30K},
note={A curated dataset of 30,886 security-focused prompts for evaluating AI safety in code generation}
}