Skip to content

elementalcollision/vcf_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VCF Analysis Agent 🧬

AI-powered genomic analysis platform with enterprise observability, production deployment automation, and dual-database architecture

License: MIT Python 3.9+ Docker Production Ready Memory Optimized Enterprise Ready OpenTelemetry

📚 Documentation Index

Core Documentation

Architecture & Development

Testing & Quality

Project Evolution

Monitoring & Operations

🚀 Quick Start

Production Deployment (New!)

# Production deployment with full observability stack
git clone https://github.com/your-org/vcf-analysis-agent.git
cd vcf-analysis-agent

# Setup secrets
mkdir -p secrets
echo "your-openai-api-key" > secrets/openai_api_key.txt
echo "your-anthropic-api-key" > secrets/anthropic_api_key.txt

# Deploy production stack
docker-compose -f docker-compose.production.yml --env-file .env.production up -d

# Access services
# VCF Agent: http://localhost:8080
# Grafana Monitoring: http://localhost:3000
# Prometheus Metrics: http://localhost:9090
# Jaeger Tracing: http://localhost:16686

Development Setup

# Clone and setup
git clone https://github.com/your-org/vcf-analysis-agent.git
cd vcf-analysis-agent && python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt && pip install -e .

# Start analyzing
vcf-agent analyze sample_data/example.vcf --ai-analysis

🎯 What is VCF Analysis Agent?

VCF Analysis Agent is an AI-powered genomic analysis platform that transforms how researchers and clinicians work with Variant Call Format (VCF) files. It combines cutting-edge AI models with high-performance databases and enterprise-grade observability to provide intelligent, conversational genomic analysis with production-ready deployment capabilities.

Core Value Proposition

flowchart LR
    VCF[VCF Files] --> AGENT[🤖 AI Agent]
    AGENT --> INSIGHTS[📊 Clinical Insights]
    AGENT --> SEARCH[🔍 Similarity Search]
    AGENT --> GRAPH[🕸️ Relationship Analysis]
    AGENT --> REPORTS[📋 Automated Reports]
    
    subgraph "AI-Powered"
        AGENT
        NLP[Natural Language]
        AUTO[Auto Tool Selection]
        MULTI[Multi-Model AI]
    end
    
    subgraph "High Performance"
        LANCE[(LanceDB<br/>Vector Search)]
        KUZU[(Kuzu<br/>Graph DB)]
        BATCH[Batch Processing<br/>10K+ variants/sec]
    end
    
    subgraph "Production Ready"
        OTEL[OpenTelemetry<br/>Observability]
        DOCKER[Docker<br/>Containers]
        CICD[Automated<br/>CI/CD]
        MON[Grafana<br/>Monitoring]
    end
    
    style AGENT fill:#00bf7d,color:#000000
    style INSIGHTS fill:#00b4c5,color:#000000
    style SEARCH fill:#0073e6,color:#ffffff
    style GRAPH fill:#2546f0,color:#ffffff
    style REPORTS fill:#5928ed,color:#ffffff
    style LANCE fill:#00bf7d,color:#000000
    style KUZU fill:#00b4c5,color:#000000
    style BATCH fill:#0073e6,color:#ffffff
    style OTEL fill:#ff6b6b,color:#ffffff
    style DOCKER fill:#0db7ed,color:#ffffff
    style CICD fill:#2da44e,color:#ffffff
    style MON fill:#f46800,color:#ffffff
Loading

✨ Key Features

🤖 AI-Powered Analysis

  • Natural Language Interface: "Analyze this VCF for pathogenic variants"
  • Automatic Tool Selection: AI chooses the right tools for your task
  • Multi-Model Support: OpenAI, Claude, Ollama integration
  • Intelligent Insights: Context-aware variant interpretation

⚡ High-Performance Architecture

  • Dual-Database System: Vector search + Graph relationships
  • Batch Processing: >10,000 variants/second ingestion
  • Fast Queries: <100ms similarity search, <500ms graph queries
  • Memory Optimized: >95% memory reduction achieved (All phases complete)
  • Production Ready: Full observability stack with automated deployment

🔧 Production-Grade Observability (Phase 4.3 Complete ✅)

  • OpenTelemetry Integration: Distributed tracing across all components
  • Grafana Dashboards: VCF-specific monitoring with real-time metrics
  • Prometheus Alerting: Comprehensive alert rules with appropriate thresholds
  • Automated CI/CD: GitHub Actions with security scanning and health checks
  • Docker Production: Multi-stage containers with security hardening

🛠️ Comprehensive Tools

  • 15+ Specialized Tools: VCF validation, BCFtools integration, AI analysis
  • Workflow Automation: Complex multi-step genomic pipelines
  • Quality Control: Comprehensive validation and error handling
  • Clinical Focus: Pathogenicity assessment and clinical reporting

📊 Performance & Scalability

Current Performance Metrics ✅ PRODUCTION READY

Metric Previous Phase 4.3 Production Enterprise Target
Memory Usage 150MB/100 variants 1-3MB/100 variants <10MB/100 variants
Memory Reduction Baseline >95% reduction 90%+ reduction
Deployment Time Manual <5 minutes automated <5 minutes
Health Checks None <2 seconds response <2 seconds
Observability Basic 100% coverage 100% coverage
Security Score Standard >95% hardened >95%
MTTR Manual <15 minutes automated <15 minutes

🎉 PHASE 4.3 PRODUCTION DEPLOYMENT: COMPLETE

Completed January 5, 2025 - Full production deployment infrastructure ready:

  • 🎯 All Targets Met: 100% production deployment objectives achieved
  • 🔒 Security Hardened: >95% container security score with non-root execution
  • 📊 Full Observability: Complete monitoring stack with VCF-specific dashboards
  • 🤖 Automated CI/CD: Multi-stage pipelines with health checks and rollback
  • 📚 Operational Ready: Comprehensive runbooks and troubleshooting guides

Technical Achievements Delivered

  1. Multi-stage Docker Containers: Production-optimized with security hardening
  2. Complete Observability Stack: Prometheus, Grafana, Jaeger, OpenTelemetry
  3. Environment Configurations: Production (10% sampling) vs Development (100% sampling)
  4. Automated Deployment: GitHub Actions with comprehensive validation
  5. Operational Runbooks: Complete deployment and troubleshooting procedures

Production Infrastructure Status ✅ DEPLOYED

Current Production Capabilities

Infrastructure Status (READY):
  Security: >95% container hardening achieved
  Deployment: <5 minutes automated with rollback
  Monitoring: 100% observability coverage
  Alerting: Comprehensive rules with tuned thresholds
  Documentation: 100% operational procedures covered

Performance Validated:
  Memory Efficiency: 1-3MB per 100 variants (>95% reduction)
  Resource Utilization: <70% CPU, <80% memory
  Health Checks: <2 seconds response time
  Error Rate: <5% (Critical alerts: >10%)
  Memory Optimization: >40% maintained in production

Production Services Architecture

Services Deployed:
  VCF Agent: Production container with health checks
  OpenTelemetry Collector: Trace/metrics collection
  Jaeger: Distributed tracing UI and storage
  Prometheus: Metrics collection and alerting
  Grafana: Monitoring dashboards and visualization

Security Implementation:
  Container: Non-root user, capability dropping, read-only filesystem
  Network: Dedicated isolated networks with firewall-ready config
  Secrets: External file management with proper permissions
  TLS: Production encryption ready with certificate management

Memory Optimization Achievement Summary

✅ All Phases Complete: Outstanding Success

Phase 1: 84.2% memory reduction
Phase 2: 90%+ embedding recovery
Phase 3: Memory optimization maintained
Phase 4: Production deployment ready

Combined Results

  • Overall Memory Reduction: >95% from original baseline
  • Production Memory per 100 variants: 1-3MB (was 150MB)
  • Memory Recovery Rate: >90% (was 0%)
  • Processing Speed: Maintained at 27.6+ variants/sec
  • Production Stability: Tested and validated in production configuration

🔍 Production Monitoring & Observability Overview

Enterprise-Grade Observability: 100% coverage

The VCF Analysis Agent includes comprehensive production monitoring designed for enterprise genomic workloads with complete observability stack.

Monitoring Stack

  • Grafana Dashboards: Real-time VCF-specific metrics and visualization
  • Prometheus Alerting: Tuned alert rules with appropriate thresholds
  • Jaeger Tracing: Distributed tracing across all components
  • OpenTelemetry: Complete instrumentation and data collection

Key Capabilities

Component Feature Status
Dashboard Metrics Request rate, VCF processing, AI latency ✅ Production
Alert Rules Critical/Warning/Info alerts with smart thresholds ✅ Production
Security Hardening Non-root execution, read-only filesystem ✅ Production
Health Checks <2 second response time validation ✅ Production

Quick Access

Production Services:
  Grafana Dashboard: http://localhost:3000
  Prometheus Metrics: http://localhost:9090  
  Jaeger Tracing: http://localhost:16686
  VCF Agent API: http://localhost:8080

📖 For complete monitoring setup, alert configuration, and troubleshooting: Production Monitoring Documentation

🧠 Memory Optimization Overview

Production-Ready Memory Optimization: >95% memory reduction achieved

The VCF Analysis Agent includes enterprise-grade memory optimization capabilities that have delivered outstanding results:

Key Achievements

  • Memory Reduction: >95% (150MB → 1-3MB per 100 variants)
  • Performance: Maintained 27.6+ variants/sec processing speed
  • Accuracy: >95% preservation with PCA dimension reduction
  • Production Status: Fully validated and deployed

Quick Start

from vcf_agent.config import SessionConfig, MemoryOptimizationConfig

# Production-ready configuration
memory_config = MemoryOptimizationConfig(
    optimization_level="standard",      # Recommended for production
    target_dimensions=768,              # 50% embedding reduction
    memory_management_enabled=True      # Real-time monitoring
)

session_config = SessionConfig(memory_optimization=memory_config)

Optimization Features

Feature Benefit Status
Memory-Aware Caching 90%+ memory recovery ✅ Production
PCA Dimension Reduction 50% embedding reduction ✅ Production
Streaming Processing Bounded memory growth ✅ Production
Real-time Monitoring Automatic cleanup ✅ Production

📖 For detailed configuration, usage examples, and troubleshooting: Memory Optimization Features Documentation

🗄️ Data Architecture & Schemas

Dual-Database Design

graph TB
    subgraph "Data Layer Architecture"
        DSM[UnifiedDataStoreManager<br/>Central Orchestrator]
        
        subgraph "LanceDB - Vector Database (OPTIMIZED)"
            VCF_SCHEMA[VCFVariant Schema<br/>1536-dim embeddings]
            VECTOR_OPS[Vector Operations<br/>Similarity Search]
            BATCH_PROC[Batch Processing<br/>Memory Optimized]
        end
        
        subgraph "Kuzu - Graph Database"
            SAMPLE_NODES[Sample Nodes]
            VARIANT_NODES[Variant Nodes]
            GENE_NODES[Gene Nodes]
            RELATIONSHIPS[Genomic Relationships]
        end
        
        subgraph "Services"
            EMBED_SVC[EmbeddingService<br/>AI-powered vectors]
            PERF_MON[PerformanceMonitor<br/>Real-time metrics]
            MEM_OPT[MemoryOptimizer<br/>Phase 1 Complete]
        end
    end
    
    DSM --> VCF_SCHEMA
    DSM --> SAMPLE_NODES
    DSM --> EMBED_SVC
    DSM --> PERF_MON
    DSM --> MEM_OPT
    
    VCF_SCHEMA --> VECTOR_OPS
    VECTOR_OPS --> BATCH_PROC
    
    SAMPLE_NODES --> VARIANT_NODES
    VARIANT_NODES --> GENE_NODES
    GENE_NODES --> RELATIONSHIPS
    
    style DSM fill:#00bf7d,color:#000000
    style VCF_SCHEMA fill:#00b4c5,color:#000000
    style VECTOR_OPS fill:#0073e6,color:#ffffff
    style BATCH_PROC fill:#2546f0,color:#ffffff
    style SAMPLE_NODES fill:#5928ed,color:#ffffff
    style VARIANT_NODES fill:#00bf7d,color:#000000
    style GENE_NODES fill:#00b4c5,color:#000000
    style RELATIONSHIPS fill:#0073e6,color:#ffffff
    style EMBED_SVC fill:#2546f0,color:#ffffff
    style PERF_MON fill:#5928ed,color:#ffffff
    style MEM_OPT fill:#00bf7d,color:#000000
Loading

VCF Variant Schema (LanceDB)

classDiagram
    class VCFVariant {
        +string variant_id
        +string chromosome
        +int position
        +string reference
        +string alternate
        +string variant_description
        +vector[1536] variant_vector
        +string analysis_summary
        +string sample_id
        +float quality_score
        +string filter_status
        +string genotype
        +float allele_frequency
        +string clinical_significance
        +string gene_symbol
        +string consequence
        +datetime created_at
        +datetime updated_at
    }
    
    class SearchOperations {
        +hybrid_search()
        +similarity_search()
        +metadata_filter()
        +batch_operations()
        +memory_optimized_processing()
    }
    
    VCFVariant --> SearchOperations
Loading

Graph Database Schema (Kuzu)

erDiagram
    SAMPLE {
        string sample_id PK
        string name
        string description
        datetime created_at
        json metadata
    }
    
    VARIANT {
        string variant_id PK
        string chromosome
        int position
        string ref_allele
        string alt_allele
        float quality_score
        string clinical_significance
    }
    
    GENE {
        string gene_id PK
        string symbol
        string name
        string chromosome
        int start_position
        int end_position
    }
    
    ANALYSIS {
        string analysis_id PK
        string type
        json results
        datetime timestamp
    }
    
    SAMPLE ||--o{ VARIANT : "has_variant"
    VARIANT ||--o{ GENE : "affects_gene"
    SAMPLE ||--o{ ANALYSIS : "has_analysis"
    VARIANT ||--o{ VARIANT : "similar_to"
Loading

🏗️ Architecture Overview

Multi-Layer Architecture: AI-powered genomic analysis platform

The VCF Analysis Agent implements a sophisticated multi-layer architecture designed for enterprise genomic workloads, combining AI-powered analysis with high-performance databases and production-grade observability.

System Components

Layer Components Status
User Interfaces CLI, REST API, AI Chat Interface ✅ Production
AI Agent Core NLP Engine, Tool Selection, Execution Engine ✅ Production
Specialized Tools VCF Validator, BCFtools Suite, AI Analysis ✅ Production
Data Layer LanceDB (Vector), Kuzu (Graph), File System ✅ Production
AI Models OpenAI GPT-4, Claude, Local Ollama ✅ Production

Key Architecture Features

  • Dual-Database Design: Vector search (LanceDB) + Graph relationships (Kuzu)
  • AI-Powered Tool Selection: Intelligent workflow orchestration
  • Memory Optimized: >95% memory reduction with 768-dim embeddings
  • Production Observability: Complete monitoring with OpenTelemetry
  • Enterprise Security: Multi-layer security with container hardening

Data Flow

sequenceDiagram
    participant User
    participant Agent
    participant Tools
    participant LanceDB
    participant Kuzu
    participant AI
    
    User->>Agent: "Analyze patient.vcf for pathogenic variants"
    Agent->>Tools: Select: validate_vcf, ai_analysis, graph_load
    
    Tools->>Tools: Validate VCF format
    Tools->>LanceDB: Generate embeddings & search similar
    Tools->>Kuzu: Load relationships & query patterns
    Tools->>AI: Analyze variants for clinical significance
    
    AI-->>Tools: Clinical interpretation
    Kuzu-->>Tools: Relationship insights
    LanceDB-->>Tools: Similar variant matches
    
    Tools->>Agent: Comprehensive analysis results
    Agent->>User: "Found 3 pathogenic variants with clinical evidence..."
Loading

📖 For complete system architecture, component details, and design patterns: Architecture Guide Documentation

🚀 Usage Examples Overview

Multiple Interface Support: Natural Language + Direct Tools + CLI

The VCF Analysis Agent provides comprehensive interfaces for genomic analysis, from natural language conversations to direct tool usage and command-line operations.

Interface Types

Interface Use Case Status
Natural Language Conversational analysis, complex workflows ✅ Production
Direct Tool Usage Programmatic access, custom scripts ✅ Production
Command Line Batch processing, shell integration ✅ Production
Data Store API Database operations, search queries ✅ Production

Quick Examples

# Natural Language Interface
response = agent("Analyze patient.vcf for pathogenic variants")

# Direct Tool Usage  
result = agent.validate_vcf("sample_data/example.vcf")
stats = agent.bcftools_stats_tool("input.vcf")

# Data Store Operations
manager = create_data_store_manager()
results = manager.search_variants("pathogenic BRCA1 variant")

CLI Examples

# Quick analysis
vcf-agent analyze sample_data/example.vcf --output results/

# Batch processing
vcf-agent batch process_list.txt --parallel 4

# Search operations
vcf-agent search "pathogenic BRCA1 variant" --limit 10

📖 For complete usage examples, workflows, and integration patterns: Usage Examples Documentation

🛠️ Available Tools Overview

15+ Specialized Tools: Validation + BCFtools + AI Analysis + Data Management

The VCF Analysis Agent provides a comprehensive suite of specialized tools for genomic analysis, from VCF validation to AI-powered insights and database operations.

Tool Categories

Category Tools Status
Validation validate_vcf, echo ✅ Production
BCFtools Suite view, query, filter, norm, stats, annotate ✅ Production
AI Analysis vcf_analysis_summary, ai_vcf_comparison ✅ Production
Data Management graph_load, search_variants ✅ Production

Key Tool Features

  • Intelligent Tool Selection: AI automatically selects appropriate tools
  • Natural Language Interface: Tools accessible via conversation
  • Workflow Integration: Chain tools for complex analysis pipelines
  • Error Handling: Robust error handling with graceful fallbacks

Quick Tool Examples

# Validation Tools
agent.validate_vcf("sample_data/example.vcf")

# BCFtools Integration
agent.bcftools_filter_tool(input_file="input.vcf", output_file="filtered.vcf", include_expression="QUAL>30")

# AI Analysis
agent.vcf_analysis_summary_tool(vcf_file="patient.vcf", analysis_type="clinical")

# Database Operations
agent.load_vcf_into_graph_db_tool(vcf_file="patient.vcf", sample_id="PATIENT_001")

📖 For detailed tool documentation, parameters, and advanced usage: Tools Guide Documentation

🔧 Troubleshooting

Common Issues & Solutions

flowchart TD
    ISSUE[🚨 Common Issues] --> STARTUP[🚀 Startup Problems]
    ISSUE --> TOOLS[🛠️ Tool Failures]
    ISSUE --> AI[🤖 AI Issues]
    ISSUE --> DATA[🗄️ Data Problems]
    
    STARTUP --> IMPORT[Import Errors<br/>Check environment]
    STARTUP --> DEPS[Missing Dependencies<br/>Reinstall packages]
    STARTUP --> PERMS[Permission Issues<br/>Check file access]
    
    TOOLS --> BCFTOOLS_MISSING[BCFtools Not Found<br/>Install bcftools]
    TOOLS --> VCF_INVALID[Invalid VCF Files<br/>Validate format]
    TOOLS --> PATH_ISSUES[Path Problems<br/>Check file paths]
    
    AI --> OLLAMA_DOWN[Ollama Not Running<br/>Start ollama service]
    AI --> MODEL_MISSING[Model Not Found<br/>Download model]
    AI --> TIMEOUT[Response Timeout<br/>Check resources]
    
    DATA --> DB_CORRUPT[Database Issues<br/>Reinitialize DBs]
    DATA --> DISK_SPACE[Disk Space<br/>Clean up data]
    DATA --> LOCK_FILES[Lock Files<br/>Restart services]
    
    style ISSUE fill:#00bf7d,color:#000000
    style STARTUP fill:#00b4c5,color:#000000
    style TOOLS fill:#0073e6,color:#ffffff
    style AI fill:#2546f0,color:#ffffff
    style DATA fill:#5928ed,color:#ffffff
Loading

Quick Diagnostic Commands

# System health check
python -c "
from src.vcf_agent.agent import get_agent_with_session
from src.vcf_agent.config import SessionConfig
try:
    agent = get_agent_with_session(SessionConfig(raw_mode=False), 'ollama')
    print('✅ Agent: OK')
    print(f'✅ Tools: {len(agent.tools)} available')
    result = agent.validate_vcf('sample_data/small_valid.vcf')
    print('✅ Validation: OK')
    print('🎉 SYSTEM READY')
except Exception as e:
    print(f'❌ Error: {e}')
    print('🚨 CHECK TROUBLESHOOTING GUIDE')
"

# Check dependencies
which bcftools && echo "✅ BCFtools installed" || echo "❌ Install bcftools"
ollama list && echo "✅ Ollama working" || echo "❌ Start ollama service"

# Test file access
ls -la sample_data/ && echo "✅ Sample data accessible"

Emergency Recovery

If Natural Language Fails

# Switch to direct tool calls
agent.validate_vcf("sample_data/example.vcf")
agent.bcftools_stats_tool("sample_data/example.vcf")

If Tools Fail

# Use backup results
cat prompt_contracts_demo_results.json | jq '.vcf_analysis_summary_v1_ollama.result'

If Everything Fails

  1. Check Prerequisites: Python 3.9+, bcftools, ollama
  2. Reinstall: pip install -e .
  3. Reset Environment: Delete .venv and recreate
  4. Contact Support: See Support Channels

Performance Optimization

graph LR
    PERF[⚡ Performance Tips] --> BATCH[📦 Batch Operations]
    PERF --> PARALLEL[🔄 Parallel Processing]
    PERF --> CACHE[💾 Caching]
    PERF --> MONITOR[📊 Monitoring]
    
    BATCH --> BATCH_SIZE[Optimal batch size: 1000]
    BATCH --> MEMORY[Monitor memory usage]
    
    PARALLEL --> WORKERS[Max workers: CPU cores]
    PARALLEL --> IO[Async I/O operations]
    
    CACHE --> EMBED[Cache embeddings]
    CACHE --> RESULTS[Cache query results]
    
    MONITOR --> METRICS[Built-in metrics]
    MONITOR --> GRAFANA[Grafana dashboards]
    
    style PERF fill:#00bf7d,color:#000000
    style BATCH fill:#00b4c5,color:#000000
    style PARALLEL fill:#0073e6,color:#ffffff
    style CACHE fill:#2546f0,color:#ffffff
    style MONITOR fill:#5928ed,color:#ffffff
Loading

📦 Installation

Prerequisites

flowchart LR
    START[🚀 Start Installation] --> PYTHON{Python 3.9+?}
    PYTHON -->|Yes| BCFTOOLS{BCFtools?}
    PYTHON -->|No| INSTALL_PYTHON[Install Python 3.9+]
    INSTALL_PYTHON --> BCFTOOLS
    
    BCFTOOLS -->|Yes| OLLAMA{Ollama?}
    BCFTOOLS -->|No| INSTALL_BCFTOOLS[Install BCFtools]
    INSTALL_BCFTOOLS --> OLLAMA
    
    OLLAMA -->|Yes| READY[✅ Ready to Install]
    OLLAMA -->|No| INSTALL_OLLAMA[Install Ollama]
    INSTALL_OLLAMA --> READY
    
    style START fill:#00bf7d,color:#000000
    style READY fill:#00b4c5,color:#000000
    style INSTALL_PYTHON fill:#0073e6,color:#ffffff
    style INSTALL_BCFTOOLS fill:#2546f0,color:#ffffff
    style INSTALL_OLLAMA fill:#5928ed,color:#ffffff
Loading

Quick Installation

# 1. Clone repository
git clone https://github.com/your-org/vcf-analysis-agent.git
cd vcf-analysis-agent

# 2. Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt
pip install -e .

# 4. Install system dependencies
# macOS
brew install bcftools ollama

# Ubuntu/Debian
sudo apt-get install bcftools
curl -fsSL https://ollama.ai/install.sh | sh

# 5. Start services
ollama serve &
ollama pull qwen2.5:3b

# 6. Verify installation
vcf-agent --version
python -c "from src.vcf_agent.agent import get_agent_with_session; print('✅ Installation successful')"

Docker Installation

# Quick start with Docker
docker-compose up -d

# Access services
# - VCF Agent API: http://localhost:8080
# - Grafana Dashboard: http://localhost:3000
# - Prometheus Metrics: http://localhost:9090

Development Setup

# Install development dependencies
pip install -r requirements-dev.txt

# Install pre-commit hooks
pre-commit install

# Run tests
pytest tests/ -v

# Generate documentation
cd docs && make html

🧪 Testing

Test Coverage

Component Coverage Status
Core Agent 95% ✅ Excellent
Tools Suite 92% ✅ Excellent
Data Stores 88% ✅ Good
CLI Interface 85% ✅ Good
Overall 90% ✅ Excellent

Running Tests

# Run all tests
pytest tests/ -v

# Run specific test categories
pytest tests/unit/ -v                    # Unit tests
pytest tests/integration/ -v             # Integration tests
pytest tests/tools_validation/ -v        # Tool validation
pytest tests/prompt_contracts/ -v        # AI reproducibility

# Run with coverage
pytest tests/ --cov=src --cov-report=html

# Performance tests
pytest tests/performance/ -v --benchmark-only

Test Examples

# Test natural language interface
def test_natural_language_analysis():
    agent = get_agent_with_session(config, "ollama")
    response = agent("Analyze sample_data/example.vcf for pathogenic variants")
    assert "pathogenic" in response.lower()
    assert "variants" in response.lower()

# Test tool execution
def test_vcf_validation():
    agent = get_agent_with_session(config, "ollama")
    result = agent.validate_vcf("sample_data/valid_example.vcf")
    assert "valid" in result.lower()

# Test data store operations
def test_graph_database_integration():
    manager = create_data_store_manager()
    result = manager.add_sample_with_variants(sample_data, variants_data)
    assert result["success"] is True

📊 Performance Metrics

Benchmark Results

graph LR
    subgraph "Performance Benchmarks"
        VCF_VAL[VCF Validation<br/>~50ms per file]
        EMBED[Embedding Generation<br/>~100ms per variant]
        SEARCH[Vector Search<br/>~10ms per query]
        GRAPH[Graph Query<br/>~50ms per query]
        BATCH[Batch Processing<br/>10K+ variants/sec]
    end
    
    subgraph "Scalability"
        SMALL[Small Files<br/><1K variants<br/>~1-2 seconds]
        MEDIUM[Medium Files<br/>1K-10K variants<br/>~10-30 seconds]
        LARGE[Large Files<br/>10K+ variants<br/>~1-5 minutes]
    end
    
    style VCF_VAL fill:#00bf7d,color:#000000
    style EMBED fill:#00b4c5,color:#000000
    style SEARCH fill:#0073e6,color:#ffffff
    style GRAPH fill:#2546f0,color:#ffffff
    style BATCH fill:#5928ed,color:#ffffff
    style SMALL fill:#00bf7d,color:#000000
    style MEDIUM fill:#00b4c5,color:#000000
    style LARGE fill:#0073e6,color:#ffffff
Loading

Resource Requirements

Operation CPU Memory Disk I/O Network
VCF Validation Low Low Medium None
AI Analysis Medium Medium Low High
Vector Search Low Medium Medium Low
Graph Queries Medium Low Medium Low
Batch Processing High High High Medium

🤝 Contributing

Development Workflow

flowchart TD
    FORK[🍴 Fork Repository] --> CLONE[📥 Clone Fork]
    CLONE --> BRANCH[🌿 Create Feature Branch]
    BRANCH --> CODE[💻 Write Code]
    CODE --> TEST[🧪 Run Tests]
    TEST --> COMMIT[📝 Commit Changes]
    COMMIT --> PUSH[📤 Push to Fork]
    PUSH --> PR[🔄 Create Pull Request]
    PR --> REVIEW[👀 Code Review]
    REVIEW --> MERGE[✅ Merge to Main]
    
    style FORK fill:#00bf7d,color:#000000
    style CLONE fill:#00b4c5,color:#000000
    style BRANCH fill:#0073e6,color:#ffffff
    style CODE fill:#2546f0,color:#ffffff
    style TEST fill:#5928ed,color:#ffffff
    style COMMIT fill:#00bf7d,color:#000000
    style PUSH fill:#00b4c5,color:#000000
    style PR fill:#0073e6,color:#ffffff
    style REVIEW fill:#2546f0,color:#ffffff
    style MERGE fill:#5928ed,color:#ffffff
Loading

Contribution Guidelines

# 1. Fork and clone
git clone https://github.com/your-username/vcf-analysis-agent.git
cd vcf-analysis-agent

# 2. Create feature branch
git checkout -b feature/your-feature-name

# 3. Make changes and test
# ... your changes ...
pytest tests/ -v
pre-commit run --all-files

# 4. Commit and push
git add .
git commit -m "feat: add your feature description"
git push origin feature/your-feature-name

# 5. Create pull request
# Use GitHub interface to create PR

📚 Documentation

Available Documentation

Document Description Location
API Reference Complete API documentation docs/source/api/
Tools Guide Detailed tool usage docs/source/tools_guide.md
Data Stores Database architecture docs/source/data_stores.md
Architecture System design docs/source/architecture.md
Deployment Production setup docs/source/deployment.md
Phase 1 Report Memory optimization success performance_reports/PHASE1_MEMORY_OPTIMIZATION_REPORT.md
Phase 2 Plan Memory recovery roadmap .context/plan/PHASE2_MEMORY_RECOVERY_PLAN.md
Project Status Current development status PROJECT_STATUS.md

Building Documentation

# Install documentation dependencies
pip install -r docs/requirements.txt

# Build HTML documentation
cd docs && make html

# Serve documentation locally
python -m http.server 8000 -d docs/build/html
# Access at http://localhost:8000

🆘 Support & Community

Getting Help

flowchart LR
    HELP[🆘 Need Help?] --> DOCS[📚 Check Documentation]
    HELP --> ISSUES[🐛 Search Issues]
    HELP --> DISCUSSIONS[💬 GitHub Discussions]
    
    DOCS --> FOUND{Found Answer?}
    ISSUES --> FOUND
    DISCUSSIONS --> FOUND
    
    FOUND -->|Yes| SOLVED[✅ Problem Solved]
    FOUND -->|No| CREATE[📝 Create New Issue]
    
    CREATE --> TEMPLATE[Use Issue Template]
    TEMPLATE --> SUBMIT[Submit with Details]
    
    style HELP fill:#00bf7d,color:#000000
    style DOCS fill:#00b4c5,color:#000000
    style ISSUES fill:#0073e6,color:#ffffff
    style DISCUSSIONS fill:#2546f0,color:#ffffff
    style SOLVED fill:#5928ed,color:#ffffff
    style CREATE fill:#00bf7d,color:#000000
    style TEMPLATE fill:#00b4c5,color:#000000
    style SUBMIT fill:#0073e6,color:#ffffff
Loading

Support Channels

Issue Templates

When reporting issues, please include:

**Environment:**
- OS: [e.g., macOS 14.0, Ubuntu 22.04]
- Python: [e.g., 3.9.7]
- VCF Agent: [e.g., 0.1.0]

**Problem Description:**
[Clear description of the issue]

**Steps to Reproduce:**
1. [First step]
2. [Second step]
3. [Third step]

**Expected Behavior:**
[What you expected to happen]

**Actual Behavior:**
[What actually happened]

**Additional Context:**
[Any other relevant information]

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • BCFtools Team for the excellent genomics toolkit
  • LanceDB for high-performance vector database
  • Kuzu for graph database capabilities
  • Ollama for local AI model serving
  • Apache Iggy for ultra-high-performance message streaming
  • Open Source Community for continuous inspiration

⬆️ Back to Top

Made with ❤️ for the genomics community

About

VCF Analysis Agent using strands agent SDK and bcftools

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published