Skip to content

[CURRENT WIP] Production-grade AI training data mining system. Hardware-adaptive architecture scales from Raspberry Pi to GPU workstations. Features advanced web scraping, multi-level deduplication, domain-specific extractors, and enterprise monitoring.

License

Notifications You must be signed in to change notification settings

shua-ie/QuarryCore

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

31 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

QuarryCore - AI Training Data Pipeline

QuarryCore

Enterprise-Grade AI Training Data Pipeline

Python License CI/CD Coverage

Transform raw web content into high-quality AI training datasets with production-grade reliability

Quick Start β€’ Features β€’ Documentation β€’ Architecture β€’ Enterprise


πŸš€ Why QuarryCore?

QuarryCore is a production-grade data pipeline that transforms raw web content into high-quality AI training datasets. Built with modular architecture and protocol-based design, it seamlessly adapts from Raspberry Pi (4GB RAM) to enterprise GPU clusters.

🎯 Key Benefits

  • πŸ—οΈ Hardware Adaptive - Automatically optimizes performance from Pi to GPU workstations
  • 🧠 Intelligent Processing - Multi-strategy extraction with ML-powered quality assessment
  • πŸ”’ Enterprise Ready - JWT auth, rate limiting, audit logging, and monitoring built-in
  • πŸš€ Developer Friendly - Clean APIs, comprehensive docs, and extensible architecture
# Extract and process web content in 3 lines
from quarrycore import Pipeline

async with Pipeline() as pipeline:
    result = await pipeline.run(["https://example.com"])
    print(f"βœ… Processed {result['processed_count']} documents")

⚑ Quick Start

Installation

# Clone the repository
git clone https://github.com/shua-ie/quarrycore
cd quarrycore

# Install in development mode
pip install -e .

# With GPU acceleration
pip install -e ".[gpu]"

# With all development tools
pip install -e ".[dev]"

Your First Dataset (60 seconds)

import asyncio
from quarrycore import Pipeline, Config
from quarrycore.container import DependencyContainer

async def create_dataset():
    # Create configuration
    config = Config()
    config.quality.default.min_overall_score = 0.8  # Only high-quality content
    
    # Create container with configuration
    container = DependencyContainer()
    
    # Create and run pipeline
    pipeline = Pipeline(container)
    
    # Process URLs
    urls = ["https://example.com"]
    result = await pipeline.run(urls)
    
    # Display results
    print(f"βœ… Processed {result['processed_count']} documents")
    print(f"⏱️  Duration: {result.get('duration', 0):.2f} seconds")

# Run it
asyncio.run(create_dataset())

Environment Configuration

QuarryCore supports configuration via environment variables for production deployments:

# Pipeline checkpointing (AC-06)
export CHECKPOINT_INTERVAL=60.0          # Checkpoint save interval (seconds)
export CHECKPOINT_DIR=/app/checkpoints   # Checkpoint storage directory

# Domain failure backpressure
export DOMAIN_FAILURE_THRESHOLD=5        # Max failures per domain before backoff
export DOMAIN_FAILURE_WINDOW=60.0        # Failure tracking window (seconds)
export DOMAIN_BACKOFF_DURATION=120.0     # Backoff duration (seconds)

# Dead letter queue
export DEAD_LETTER_DB_PATH=/app/dead_letter.db  # Failed URL storage

# Run with environment configuration
python -m quarrycore.cli process urls.txt

---

## 🎨 Features

### Currently Available

| Feature | Description | Status |
|---------|-------------|---------|
| πŸ”„ **Multi-Strategy Extraction** | Cascade extraction with multiple fallback strategies | βœ… Available |
| ⚑ **Hardware Adaptation** | Auto-optimization from Pi to GPU clusters | βœ… Available |
| πŸ” **Multi-Level Deduplication** | Hash, MinHash, semantic, and fuzzy matching | βœ… Available |
| πŸ“Š **Quality Assessment** | ML-powered scoring with domain intelligence | βœ… Available |
| πŸ›‘οΈ **Enterprise Security** | JWT auth, rate limiting, audit logging | βœ… Available |
| 🐳 **Container Deployment** | Docker files for CPU and GPU deployment | βœ… Available |
| πŸ“ˆ **Prometheus Metrics** | Business KPIs and system monitoring | βœ… Available |

### Planned Enhancements

| Feature | Description | Status |
|---------|-------------|---------|
| πŸ“Š **Grafana Dashboards** | Pre-built monitoring dashboards with alerts | πŸš€ Planned |
| πŸ“š **Comprehensive Documentation** | Full API and architecture documentation | πŸš€ Planned |
| πŸ” **Enterprise SSO** | SAML/OIDC integration | πŸš€ Planned |
| 🧩 **Plugin Architecture** | Extensible extractors and processors | πŸš€ Planned |
| 🌐 **Multi-Cloud Support** | AWS/GCP/Azure native integrations | πŸš€ Planned |

---

## πŸ“– Documentation

| Resource | Description | Status |
|----------|-------------|---------|
| [πŸ—οΈ Architecture Overview](ANALYSIS_REPORT.md) | System design and component analysis | βœ… Available |
| [πŸš€ Deployment Guide](DEPLOYMENT.md) | Production deployment instructions | βœ… Available |
| [πŸ”§ Configuration Guide](config.example.yaml) | Configuration options and examples | βœ… Available |
| [πŸ”’ Security Guide](SECURITY.md) | Security best practices | βœ… Available |
| [🀝 Contributing Guide](CONTRIBUTING.md) | Developer contribution guidelines | βœ… Available |

---

## πŸ“Š Performance Characteristics

*Performance varies based on hardware capabilities and configuration*

| Hardware | Target Throughput | Memory Usage | Notes |
|----------|-------------------|--------------|-------|
| **Raspberry Pi 4** | 50-200 docs/min | 2-4GB | CPU-only, optimized for memory |
| **MacBook Pro M2** | 200-500 docs/min | 6-8GB | Balanced performance |
| **Workstation** | 500-1000 docs/min | 8-12GB | Multi-core optimization |
| **GPU Server** | 1000+ docs/min | 12-16GB | GPU acceleration for ML tasks |

<details>
<summary><b>Performance Optimization Tips</b></summary>

### Hardware Adaptation
The system automatically detects hardware capabilities and adjusts:
- **Concurrency levels** based on CPU cores
- **Batch sizes** based on available memory
- **GPU utilization** for ML-powered quality assessment
- **Storage strategy** based on disk type

### Configuration Tuning
```yaml
# config.yaml - Optimize for your use case
crawler:
  max_concurrent_requests: 50  # Adjust based on network
  
quality:
  default:
    min_overall_score: 0.7    # Balance quality vs quantity
    
dataset:
  chunking:
    chunk_size: 2048          # Adjust based on memory

🏒 Enterprise Features

Security & Compliance

  • πŸ” Authentication - JWT tokens with refresh, API keys, RBAC
  • πŸ›‘οΈ Rate Limiting - Redis-backed distributed enforcement
  • πŸ“ Audit Logging - Complete activity tracking with correlation IDs
  • πŸ”’ Data Security - Input validation and secure processing
  • πŸ“Š Monitoring Ready - Prometheus metrics and health checks

Monitoring & Operations

Current Capabilities

  • βœ… Prometheus metrics export
  • βœ… Health check endpoints
  • βœ… Structured JSON logging
  • βœ… Performance profiling
  • βœ… Resource monitoring

Planned Enhancements

  • πŸš€ Grafana dashboard templates
  • πŸš€ Distributed tracing (Jaeger)
  • πŸš€ Advanced alerting rules
  • πŸš€ SLA monitoring
  • πŸš€ Cost analytics

Deployment Options

# Docker deployment
docker build -f docker/Dockerfile.cpu -t quarrycore:cpu .
docker run -p 8000:8000 quarrycore:cpu

# Kubernetes deployment
kubectl apply -f k8s/production/

πŸ’‘ Use Cases

Web Content Processing

from quarrycore import Pipeline, Config
from quarrycore.container import DependencyContainer

# Configure for web scraping
config = Config()
config.crawler.rate_limiter.max_requests = 10  # Respectful crawling
config.quality.default.min_overall_score = 0.8

# Create pipeline
container = DependencyContainer()
pipeline = Pipeline(container)

# Process URLs
urls = ["https://example.com/blog", "https://example.com/docs"]
result = await pipeline.run(urls)

print(f"Processed {result['processed_count']} documents")

Batch Processing Configuration

# Configure for batch processing
config = Config()
config.crawler.max_concurrent_requests = 50
config.deduplication.enabled_levels = [1, 2, 3]  # Enable all dedup levels
config.storage.hot.pool_size = 20  # Increase connection pool

# Process with custom configuration
container = DependencyContainer(config_path="config.yaml")
pipeline = Pipeline(container, max_concurrency=100)

πŸ—οΈ Architecture

View System Architecture
graph TB
    subgraph "Data Sources"
        W[Websites]
        D[Documents]
        A[APIs]
    end
    
    subgraph "Processing Pipeline"
        CR[Crawler<br/>Rate-limited]
        EX[Extractor<br/>Multi-strategy]
        QA[Quality<br/>ML Scoring]
        DD[Dedup<br/>Multi-level]
        ST[Storage<br/>Tiered]
    end
    
    subgraph "Infrastructure"
        M[Monitoring]
        S[Security]
        C[Config]
    end
    
    W --> CR
    D --> CR
    A --> CR
    CR --> EX
    EX --> QA
    QA --> DD
    DD --> ST
    
    M -.-> CR
    M -.-> EX
    M -.-> QA
    S -.-> CR
    C -.-> CR
    
    style CR fill:#e3f2fd
    style EX fill:#f3e5f5
    style QA fill:#e8f5e9
    style DD fill:#fff3e0
    style ST fill:#fce4ec
Loading

Key Components

Module Purpose Implementation
Container Dependency injection with hot-reload src/quarrycore/container.py
Pipeline Orchestration with checkpointing src/quarrycore/pipeline.py
Crawler Adaptive web crawling src/quarrycore/crawler/
Extractor Multi-strategy content extraction src/quarrycore/extractor/
Quality ML-powered assessment src/quarrycore/quality/
Deduplicator Multi-level deduplication src/quarrycore/deduplicator/
Storage Tiered storage system src/quarrycore/storage/
Security Authentication and rate limiting src/quarrycore/security/

🀝 Community & Support

Getting Help

Channel Response Time Best For
GitHub Issues 24-48 hours Bug reports, features
Discussions 2-3 days Questions, ideas

Contributing

We welcome contributions! See our Contributing Guide for details.

# Get started with development
git clone https://github.com/shua-ie/quarrycore
cd quarrycore
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Check code quality
mypy src/
black src/
ruff check src/

πŸ“Š Project Stats

Stars Forks Contributors Issues

Code Quality

Metric Status Details
Type Safety Mypy 100% typed
Test Coverage Coverage Comprehensive test suite
Code Quality Ruff Zero issues
Security Security-first design Regular audits

πŸ“œ License

MIT License - see LICENSE for details.


Built with ❀️ by engineers who needed better AI training data

⭐ Star us on GitHub β€’ πŸ’‘ Request a Feature β€’ 🀝 Contribute

If QuarryCore helps your project, please star this repository!

About

[CURRENT WIP] Production-grade AI training data mining system. Hardware-adaptive architecture scales from Raspberry Pi to GPU workstations. Features advanced web scraping, multi-level deduplication, domain-specific extractors, and enterprise monitoring.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages