Skip to content

Redesign Autonomy is an AI safety evaluation framework for LLM-assisted software engineering. It assesses risks like security flaws, overtrust, and misinterpretation in AI-generated code.

License

Notifications You must be signed in to change notification settings

Satyamkumarnavneet/RedesignAutonomy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Rethinking Autonomy: Preventing Failures in AI-Driven Software Engineering

RedesignAutonomy Theme
Paper Benchmark GitHub Repository
Check out the documentation Quick Start Key Findings

🎯 Overview

Welcome to Redesign Autonomy - A comprehensive AI Safety Evaluation Framework designed to address critical challenges in LLM-assisted Software Engineering. This state-of-the-art platform provides researchers and practitioners with:

  • 🎯 Comprehensive Safety Assessment: Multi-dimensional evaluation across security, reliability, and autonomous behavior
  • πŸ”„ Systematic Testing Protocols: Rigorous evaluation methodologies for vulnerability inheritance and overtrust patterns
  • 🧠 Advanced Safety Metrics: Novel metrics for measuring AI system reliability in software engineering contexts
  • πŸš€ Real-world Impact: Addressing critical failures like the Replit database deletion incident

✨ How It Works

RedesignAutonomy Workflow
Redesign Autonomy Framework Workflow: End-to-End AI Safety Evaluation Pipeline

The Redesign Autonomy framework operates across two distinct evaluation paradigms:

🧠 Autonomous Behavior Assessment

Comprehensive evaluation of AI systems' autonomous decision-making capabilities, measuring failure rates, constraint adherence, and recovery mechanisms under various operational scenarios.

πŸ”’ Security & Reliability Evaluation

Systematic assessment of vulnerability inheritance, hallucination patterns, and deception risks in LLM-generated code, providing critical insights for safe AI deployment.

πŸš€ Core Safety Functions

  • πŸ”’ Vulnerability Assessment: Comprehensive analysis of security flaws in AI-generated code
  • 🧩 Hallucination Detection: Systematic identification of fabricated APIs, methods, and parameters
  • πŸ›‘οΈ Autonomous Failure Analysis: Rigorous testing of AI system behavior under autonomous operation
  • 🎭 Deception Pattern Recognition: Advanced detection of misleading responses and explanations
  • πŸ”„ Recovery Mechanism Evaluation: Assessment of AI systems' self-correction capabilities
  • πŸ“Š Constraint Adherence Testing: Verification of compliance with specified requirements and standards
Logo
Comprehensive Safety Evaluation Metrics Comparison Across Multiple LLMs.

πŸ“‘ Table of Contents

πŸ”₯ News

  • [2025, August 19]: Β πŸŽ‰πŸŽ‰ Major Release! Redesign Autonomy Safety Framework! πŸš€
    We are excited to announce a significant milestone for AI Safety in Software Engineering:
    • πŸ“„ Academic Paper Release: Comprehensive analysis of AI safety challenges in software engineering
    • πŸ“Š Re-Auto-30K Dataset: Largest collection of security-focused prompts for AI safety evaluation
    • πŸ–₯️ Comprehensive Evaluation Suite: Advanced framework for assessing LLM safety in autonomous software engineering
    🀝 Join Us! We welcome researchers, developers, and AI safety enthusiasts to contribute to safer AI systems. Whether it's code contributions, bug reports, dataset improvements, or safety research, every contribution advances the field!
    πŸ’‘ Let's build safer AI systems for software engineering together!
  • [2024, August 15]: Β πŸŽ‰πŸŽ‰We've launched Redesign Autonomy!, The release includes the complete safety evaluation framework, Re-Auto-30K dataset, comprehensive model analysis, and much more. Stay tuned for continuous safety improvements! πŸš€

⚑ Quick Start

Quick Setup Python Required Easy Install

Prerequisites

  • Python: 3.10+ (recommended)
  • GPU: NVIDIA GPUs with CUDA support (recommended for model evaluation)
  • Memory: Minimum 16GB RAM, 32GB+ recommended
  • Storage: At least 50GB free space for models and evaluation results

Installation

  1. Clone the Repository
git clone https://github.com/navneetsatyamkumar/RedesignAutonomy.git
cd RedesignAutonomy/modelEvaluation
  1. Setup Environment
# Create virtual environment
python -m venv reauto-env
source reauto-env/bin/activate  # On Windows: reauto-env\Scripts\activate

# Install dependencies
pip install -r requirements.txt
  1. Run Basic Evaluation
# Interactive mode - recommended for first-time users
python model-evaluation.py

# Background mode - recommended for full evaluation
nohup python model-evaluation.py > evaluation.log 2>&1 &

Dataset Setup

The framework uses the Re-Auto-30K dataset - a comprehensive collection of 30,886 security-focused prompts:

Local Access: The dataset is included in datasets/prompts.csv for immediate use.

Optional Download:

from datasets import load_dataset
dataset = load_dataset("navneetsatyamkumar/Re-Auto-30K")
dataset.save_to_disk("./datasets/re_auto_30k")

πŸ€– Models Evaluated

The framework evaluates six state-of-the-art code generation models representing different architectural approaches and safety considerations:

Model Publisher Parameters Specialization
Stable Code 3B Stability AI 3B Code generation and completion
Granite 3B Code Instruct IBM 3B Instruction-following code generation
DeepSeek Coder 7B DeepSeek AI 7B Multi-language code generation
CodeLlama 7B Meta AI 7B Code understanding and generation
Qwen2.5 Coder 7B Instruct Alibaba Cloud 7B Instruction-tuned code generation
Yi-Coder 9B Chat 01.AI 9B Conversational code generation

πŸ“Š Evaluation Metrics

Primary Safety Metrics

  1. Vulnerability Rate (0-100%)

    • Measures percentage of generated code containing security vulnerabilities
    • Uses static analysis with Bandit security scanner
    • Identifies CWE (Common Weakness Enumeration) patterns
  2. Hallucination Rate (0-100%)

    • Detects fabricated APIs, methods, or parameters
    • Identifies non-existent library functions
    • Measures factual accuracy in code generation
  3. Critical Vulnerability Rate (0-100%)

    • Focuses on high-severity security flaws
    • Includes SQL injection, command injection, XSS
    • Weighted by CVSS scores
  4. Autonomous Failure Rate (0-100%)

    • Measures failures in autonomous task execution
    • Evaluates self-correction capabilities
    • Tests constraint adherence
  5. Deception Rate (0-100%)

    • Detects intentionally misleading responses
    • Identifies attempts to bypass safety measures
    • Measures truthfulness in explanations
  6. Recovery Success Rate (0-100%)

    • Ability to recover from errors
    • Self-debugging capabilities
    • Resilience to adversarial inputs
  7. Constraint Adherence (0-100%)

    • Compliance with specified requirements
    • Following security best practices
    • Adherence to coding standards

Secondary Metrics

  • CWE Diversity: Number of unique vulnerability types
  • Average Detection Time: Time to identify security issues
  • Overall Safety Status: PASS/FAIL classification

πŸ”¬ Methodology

1. Dataset Preparation

Re-Auto-30K: A Comprehensive Security-Focused Dataset

We provide a comprehensive dataset of 30,886 security-focused prompts publicly available on Hugging Face:

πŸ€— Dataset: navneetsatyamkumar/Re-Auto-30K

This curated dataset covers:

  • Web Application Security: XSS, CSRF, SQL injection, input validation
  • Cryptographic Implementations: Secure encryption, hashing, key management
  • Authentication & Authorization: JWT, OAuth, multi-factor authentication
  • Network Security Protocols: TLS/SSL, secure communications, API security
  • Infrastructure as Code (IaC) Security: Docker, Kubernetes, cloud security
  • Mobile Application Security: Secure storage, communication, biometrics
  • API Security Patterns: Rate limiting, input sanitization, secure endpoints
  • Container & Kubernetes Security: Pod security, network policies, secrets management

2. Evaluation Pipeline

The evaluation pipeline consists of:

  1. Model Loading: Sequential loading of models with GPU memory management
  2. Code Generation: Prompt-based code generation for each model
  3. Security Analysis: Static analysis using multiple scanners
  4. Autonomous Testing: Simulated autonomous behavior scenarios
  5. Metrics Calculation: Comprehensive metric computation
  6. Visualization: Professional charts and reports

3. Technical Infrastructure

  • Hardware: 10x RTX 3080 Ti GPUs for distributed evaluation
  • Framework: Ray for distributed computing
  • Memory Management: Automatic GPU cleanup between models
  • Storage: Structured output with timestamped directories
  • Logging: Comprehensive logging for nohup execution

πŸ” How to Use

1. Basic Safety Evaluation

cd modelEvaluation

# Interactive mode - recommended for first-time users
python model-evaluation.py

# Background mode - recommended for full evaluation
nohup python model-evaluation.py > evaluation.log 2>&1 &

2. Model Comparison and Analysis

# Run automated comparison analysis
python compare-models.py

# Monitor evaluation progress
tail -f evaluation.log

3. Custom Evaluation Configuration

# Example: Custom evaluation configuration
from model_evaluation import SafetyEvaluator

evaluator = SafetyEvaluator(
    models=["stable-code-3b", "deepseek-coder-7b"],
    metrics=["vulnerability_rate", "autonomous_failure_rate"],
    dataset_size=1000,  # Subset for faster evaluation
    output_dir="custom_evaluation"
)

results = evaluator.run_evaluation()

4. Advanced Usage

Distributed Evaluation

# Configure distributed settings
export CUDA_VISIBLE_DEVICES=0,1,2,3
python model-evaluation.py --distributed --num_gpus 4

Custom Safety Metrics

from safety_metrics import BaseMetric

class CustomSecurityMetric(BaseMetric):
    def evaluate(self, generated_code, prompt):
        # Your custom security analysis logic
        return safety_score

evaluator.add_custom_metric(CustomSecurityMetric())

5. Monitoring and Output Analysis

Real-time Monitoring

# Monitor nohup execution
tail -f evaluation.log

# Check GPU utilization
watch -n 1 nvidia-smi

# Monitor specific model progress
tail -f output/run_*/model_name/combined_summary.md

πŸ“ˆ Results and Analysis

Overall Results Summary

Model Autonomous Failure Rate Deception Rate Recovery Success Rate Constraint Adherence Safety Status
Stable Code 3B 25.0% 22.6% 76.0% 87.6% FAIL
DeepSeek Coder 7B 29.6% 19.2% 77.0% 87.6% FAIL
Yi-Coder 9B Chat 29.8% 17.8% 75.8% 87.0% FAIL
Qwen2.5 Coder 7B 31.2% 20.4% 73.7% 87.6% FAIL
CodeLlama 7B 31.4% 19.4% 73.2% 85.0% FAIL
Granite 3B 34.0% 17.8% 66.5% 85.6% FAIL

Critical Findings

  1. Universal Vulnerability Pattern: All models showed systemic security weaknesses in generated code
  2. Hallucination Crisis: 100% hallucination rates across all models indicate severe reliability issues
  3. Autonomous Failure Crisis: Failure rates ranging from 25.0% to 34.0% highlight the risks of unguarded autonomous AI agents
  4. Deception Resistance Variance: Significant differences in deception rates (17.8% to 22.6%) suggest varying truthfulness capabilities
  5. Recovery Capability Gaps: Recovery success rates between 66.5% and 77.0% indicate limited self-correction abilities

Model-Specific Analysis

Stable Code 3B (Best Autonomous Performance)

  • Strengths: Lowest autonomous failure rate (25.0%), strong constraint adherence (87.6%)
  • Weaknesses: Highest deception rate (22.6%)
  • Risk Profile: Demonstrates best autonomous behavior but highest deception tendency

Yi-Coder 9B Chat (Unique Security Profile)

  • Strengths: Tied for lowest deception rate (17.8%), good recovery capabilities (75.8%)
  • Characteristics: Balanced performance across safety metrics
  • Autonomous Performance: Moderate failure rate (29.8%) with consistent constraint adherence

DeepSeek Coder 7B (Best Recovery Capabilities)

  • Strengths: Highest recovery success rate (77.0%), balanced performance across metrics
  • Characteristics: Strong self-correction abilities with moderate autonomous failures (29.6%)

Granite 3B Code Instruct (Most Truthful)

  • Strengths: Tied for lowest deception rate (17.8%), conservative approach to code generation
  • Weaknesses: Highest autonomous failure rate (34.0%), lowest recovery success (66.5%)
  • Use Case: Better suited for supervised rather than autonomous deployment

Vulnerability Analysis

All models consistently generated code with:

  1. Command Injection Vulnerabilities

    • os.system() usage without sanitization
    • subprocess.run() with shell=True
    • Missing input validation
  2. Common Weakness Enumerations (CWEs)

    • CWE-78: OS Command Injection (primary finding)
    • High-risk patterns consistently detected
  3. Hallucination Patterns

    • Fabricated method names
    • Non-existent API parameters
    • Missing error handling constructs

Output Documentation

Directory Structure

output/
β”œβ”€β”€ run_20250813_215427_complete_nohup/    # Main evaluation results
β”‚   β”œβ”€β”€ multi_model_summary.csv           # Aggregated results
β”‚   β”œβ”€β”€ stable-code-3b/                   # Model-specific results
β”‚   β”‚   β”œβ”€β”€ vulnerability_metrics.json    # Detailed vulnerability data
β”‚   β”‚   β”œβ”€β”€ vulnerability_results.csv     # Vulnerability findings
β”‚   β”‚   β”œβ”€β”€ autonomous_metrics.json       # Autonomous behavior data
β”‚   β”‚   β”œβ”€β”€ autonomous_results.csv        # Autonomous test results
β”‚   β”‚   β”œβ”€β”€ combined_summary.md          # Executive summary
β”‚   β”‚   └── *.png                        # Visualization charts
β”‚   └── [other models follow same structure]
└── modelsComparison/                     # Comparative analysis
    └── comparison_run_20250814_091921/   # Timestamped comparison
        β”œβ”€β”€ model_metrics_heatmap.png     # Performance heatmap
        β”œβ”€β”€ grouped_metrics_bar_chart.png # Grouped comparison
        β”œβ”€β”€ individual_metric_charts.png  # Per-metric analysis
        β”œβ”€β”€ cleaned_metrics_data.csv      # Processed data
        └── comparison_summary.txt        # Analysis report

🎯 Key Findings

1. Systemic Security Vulnerabilities

Our evaluation reveals critical patterns consistent with vulnerability inheritance in LLM-assisted code generation:

  • Universal Vulnerability: All models showed significant vulnerability patterns in generated code
  • Hallucination Crisis: 100% hallucination rates across all models indicate severe reliability issues
  • Security Gaps: Models generated code with various security flaws requiring careful review
  • Pattern Consistency: Similar vulnerability types across different architectures suggest training data issues

2. Autonomous Failure Ranking (Best to Worst)

Performance Ranking by Autonomous Failure Rate:

  1. Stable Code 3B: 25.0% (Best autonomous performance)
  2. DeepSeek Coder 7B: 29.6%
  3. Yi-Coder 9B Chat: 29.8%
  4. Qwen2.5 Coder 7B: 31.2%
  5. CodeLlama 7B: 31.4%
  6. Granite 3B: 34.0% (Most conservative, highest failure rate)

3. Safety Trade-offs and Risk Patterns

  • Truthfulness vs. Capability: Models with lower deception rates often showed higher autonomous failure rates
  • Recovery vs. Prevention: Better recovery capabilities didn't correlate with lower initial failure rates
  • Size vs. Safety: Larger models (9B parameters) didn't consistently outperform smaller models (3B) in safety metrics
  • Performance Variability: Different models excel in different safety dimensions, suggesting specialized use cases

4. Implications for AI-Driven Software Engineering

  • Overtrust Risk: High failure rates combined with sophisticated outputs create dangerous overtrust scenarios
  • Governance Necessity: All models require comprehensive safety frameworks before production deployment
  • Specialized Deployment: Different models suit different use cases based on their safety profiles
  • Continuous Monitoring: Real-time safety assessment is essential for AI-driven software development

πŸ“– Documentation

Comprehensive documentation and resources are available to help you get started with RedesignAutonomy:

πŸ“š Core Documentation

πŸ”¬ Research Resources

🌟 Cite

If you use RedesignAutonomy in your research or find our work helpful, please cite our paper:

@misc{navneet2025rethinkingautonomypreventingfailures,
      title={Rethinking Autonomy: Preventing Failures in AI-Driven Software Engineering}, 
      author={Satyam Kumar Navneet and Joydeep Chandra},
      year={2025},
      eprint={2508.11824},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2508.11824}, 
}

Related Datasets

If you use our Re-Auto-30K dataset, please also cite:

@dataset{re_auto_30k_2025,
  title={Re-Auto-30K: A Comprehensive AI Safety Evaluation Dataset for Code Generation},
  author={Navneet, Satyam Kumar and Contributors},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/datasets/navneetsatyamkumar/Re-Auto-30K},
  note={A curated dataset of 30,886 security-focused prompts for evaluating AI safety in code generation}
}

About

Redesign Autonomy is an AI safety evaluation framework for LLM-assisted software engineering. It assesses risks like security flaws, overtrust, and misinterpretation in AI-generated code.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

  •  

Packages

No packages published

Languages