
An intelligent code debugging agent that automatically fixes broken Python functions using LLM-powered iterative refinement.
Patchwork is a Python framework for automated code debugging and repair. It combines large language models with dynamic tool execution to iteratively analyze, test, and fix broken code. The agent can handle complex debugging scenarios including performance optimization, plotting functions, and algorithmic corrections using multiple specialized tools.
- π€ LLM-Powered Debugging: Uses GPT-4.1 series models for intelligent code analysis and repair
- π Iterative Refinement: Automatically iterates through fix attempts until success or timeout
- π οΈ Comprehensive Tool Suite: Extensible tool system with 4 specialized debugging tools
- π Multi-Level Evaluation: 3-tier evaluation framework (deterministic, objective, and LLM-based scoring)
- π― Best-of-N Sampling: Generate multiple solutions and pick the best one
- π Results Visualization: Built-in plotting and analysis tools for performance comparison
- β‘ Flexible Configuration: Support for different models, temperatures, and iteration limits
- π§ͺ Rich Test Dataset: 5 diverse problem types covering common debugging scenarios
# Clone the repository
git clone https://github.com/tejaskhot/patchwork.git
cd patchwork
# Install dependencies using uv (recommended) or pip
uv sync
# OR
pip install -e .
# Set up your API key
export OPENAI_API_KEY="your_api_key_here"
from agent import create_agent, ProblemContext
# Create an agent
agent = create_agent(
model="gpt-4.1-nano", # or gpt-4.1-mini, gpt-4.1
max_iterations=5,
temperature=0.1
)
# Define your problem
problem = ProblemContext(
entry_point="fibonacci",
goal="Fix the fibonacci function to be efficient for large inputs",
quality_criteria="Must handle n=30 in under 1 second",
tests_formatted="fibonacci(5) should return 5, fibonacci(10) should return 55",
broken_code="""
def fibonacci(n):
if n <= 1:
return n
return fibonacci(n-1) + fibonacci(n-2) # Too slow!
"""
)
# Run the agent
solution = agent.run(problem)
print(solution)
# Run with built-in test problems
python run_patchwork.py --problem filter_top_students --model gpt-4.1-nano
python run_patchwork.py --problem plot_line_chart --model gpt-4.1-mini
# List all available test problems
python run_patchwork.py --list-problems
# Run with custom parameters
python run_patchwork.py --problem remove_outliers --model gpt-4.1 --max-iterations 10
# Batch testing with multiple problems
python run_patchwork.py --batch --model gpt-4.1-nano
# Generate performance comparison plots
python plot_results.py --model-comparison
# Create per-problem heatmaps
python plot_results.py --heatmap
# Analyze specific model performance
python plot_results.py --model gpt-4.1-nano --problems
gpt-4.1-nano
- Smallest, fastest, most cost-effective (best for simple problems)gpt-4.1-mini
- Balanced performance and cost (recommended for most use cases)gpt-4.1
- Most capable, highest quality (complex debugging scenarios)
Patchwork supports any model available through LiteLLM, including models from OpenAI, Anthropic, Google, Cohere, Hugging Face, and many other providers. Simply use the appropriate model identifier when creating your agent.
- PatchworkAgent: Main agent class orchestrating the debugging process via ReAct pattern
- ToolRegistry: Dynamic tool discovery and management system
- ProblemContext: Structured representation of debugging problems with validation
- Multi-Level Evaluator: Comprehensive 3-tier evaluation framework
The agent automatically discovers and uses these specialized debugging tools:
-
run_tests
- Primary tool for verifying code correctness by executing code against test cases in a secure, isolated subprocess. Compares function output against expected results and returns formatted test summaries. -
lint
- Static code analysis using pylint to identify errors, style violations, and potential bugs without executing code. Returns quality scores (0-10) and specific issue reports to help improve code quality and Python best practices adherence. -
run_with_debugger
- Deep execution analysis tool that uses Python'ssys.settrace
to provide insight into failing test cases. Pinpoints exact error locations, captures local variable states at failure points, and provides detailed execution traces for understanding why code is broken. -
inspect_plot
- Specialized matplotlib validation tool that executes plotting code using non-interactive 'Agg' backend. Inspects generated plot objects for visual properties like titles, axis labels, line colors, and styles without rendering images, enabling automated visual debugging.
- Success Rate: Binary completion (1.0 if all tests pass)
- Completion Rate: Percentage of individual tests passed
- Efficiency Score: Inverse relationship to tool calls used
- Invalid Action Penalty: Deductions for tool errors or failures
- Regression Penalty: Penalties for decreasing test pass rates
- Linter Score: Objective code quality assessment (0-10 scale)
- Code Elegance Score: LLM-based assessment of code quality and style
- Strategic Efficiency Score: LLM evaluation of debugging approach quality
Unified Patchwork Score: Combines all metrics into single performance measure
The framework includes 5 diverse problem types:
filter_top_students
- List filtering and sortinggroup_by_first_letter
- Dictionary grouping with normalizationplot_line_chart
- Matplotlib visualization with stylingremove_outliers
- Statistical analysis and outlier detectiongenerate_slug
- String processing and URL slug generation
Each problem includes:
- Broken reference implementation
- Comprehensive test cases
- Quality criteria for evaluation
- Expected outputs for validation
export OPENAI_API_KEY="your_openai_key"
export ANTHROPIC_API_KEY="your_anthropic_key" # Optional, for Claude models
export VERBOSE=1 # Enable detailed logging
export DEBUG_LITELLM=1 # Enable LiteLLM debug logs
from tools.registry import ToolRegistry
# The registry automatically discovers tools
# Add new tools to the tools/ directory following the existing pattern
def my_custom_tool(code: str, context: str) -> str:
"""
Custom debugging tool description.
Args:
code: The code to analyze
context: Additional context for analysis
Returns:
Analysis results as string
"""
# Your tool implementation
return "analysis results"
# Tools are automatically registered when placed in tools/ directory
See example_usage.py
for comprehensive examples including:
- Basic fibonacci optimization with error handling
- Custom tool registry usage and dependency injection
- Best-of-N sampling with temperature control
- Advanced logging and session statistics
from evals import PatchworkEvaluator
# Evaluate a single solution
evaluator = PatchworkEvaluator()
metrics, patchwork_score = evaluator.evaluate(
run_log=agent.get_run_log(),
test_cases=test_cases,
original_code=broken_code,
entry_point="function_name"
)
print(f"Success Rate: {metrics.success_rate:.1%}")
print(f"Patchwork Score: {patchwork_score.score:.4f}")
Results are automatically organized by model:
results/
βββ gpt-4.1-nano/
β βββ run_log_problem_name_timestamp.json
β βββ batch_summary_timestamp.json
βββ gpt-4.1-mini/
βββ gpt-4.1/
Each result file contains:
- Complete debugging session logs
- Tool call history with structured results
- Evaluation metrics and scores
- Timestamps and metadata
Core dependencies managed in pyproject.toml
:
- LiteLLM: Multi-provider LLM access
- Pydantic: Data validation and settings management
- Matplotlib/Seaborn: Plotting and visualization
- Pandas/NumPy: Data analysis and manipulation
- Pylint: Code quality analysis
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Make your changes following the existing patterns
- Run tests and ensure code formatting with Black
- Submit a pull request
MIT License - see LICENSE file for details.
Generate updated performance visualizations by running:
python plot_results.py --model-comparison
python plot_results.py --heatmap
Results are based on experimental runs with the built-in test dataset. Performance may vary based on problem complexity, model configuration, and other factors.