
Transform raw web content into high-quality AI training datasets with production-grade reliability
Quick Start β’ Features β’ Documentation β’ Architecture β’ Enterprise
QuarryCore is a production-grade data pipeline that transforms raw web content into high-quality AI training datasets. Built with modular architecture and protocol-based design, it seamlessly adapts from Raspberry Pi (4GB RAM) to enterprise GPU clusters.
- ποΈ Hardware Adaptive - Automatically optimizes performance from Pi to GPU workstations
- π§ Intelligent Processing - Multi-strategy extraction with ML-powered quality assessment
- π Enterprise Ready - JWT auth, rate limiting, audit logging, and monitoring built-in
- π Developer Friendly - Clean APIs, comprehensive docs, and extensible architecture
# Extract and process web content in 3 lines
from quarrycore import Pipeline
async with Pipeline() as pipeline:
result = await pipeline.run(["https://example.com"])
print(f"β
Processed {result['processed_count']} documents")
# Clone the repository
git clone https://github.com/shua-ie/quarrycore
cd quarrycore
# Install in development mode
pip install -e .
# With GPU acceleration
pip install -e ".[gpu]"
# With all development tools
pip install -e ".[dev]"
import asyncio
from quarrycore import Pipeline, Config
from quarrycore.container import DependencyContainer
async def create_dataset():
# Create configuration
config = Config()
config.quality.default.min_overall_score = 0.8 # Only high-quality content
# Create container with configuration
container = DependencyContainer()
# Create and run pipeline
pipeline = Pipeline(container)
# Process URLs
urls = ["https://example.com"]
result = await pipeline.run(urls)
# Display results
print(f"β
Processed {result['processed_count']} documents")
print(f"β±οΈ Duration: {result.get('duration', 0):.2f} seconds")
# Run it
asyncio.run(create_dataset())
QuarryCore supports configuration via environment variables for production deployments:
# Pipeline checkpointing (AC-06)
export CHECKPOINT_INTERVAL=60.0 # Checkpoint save interval (seconds)
export CHECKPOINT_DIR=/app/checkpoints # Checkpoint storage directory
# Domain failure backpressure
export DOMAIN_FAILURE_THRESHOLD=5 # Max failures per domain before backoff
export DOMAIN_FAILURE_WINDOW=60.0 # Failure tracking window (seconds)
export DOMAIN_BACKOFF_DURATION=120.0 # Backoff duration (seconds)
# Dead letter queue
export DEAD_LETTER_DB_PATH=/app/dead_letter.db # Failed URL storage
# Run with environment configuration
python -m quarrycore.cli process urls.txt
---
## π¨ Features
### Currently Available
| Feature | Description | Status |
|---------|-------------|---------|
| π **Multi-Strategy Extraction** | Cascade extraction with multiple fallback strategies | β
Available |
| β‘ **Hardware Adaptation** | Auto-optimization from Pi to GPU clusters | β
Available |
| π **Multi-Level Deduplication** | Hash, MinHash, semantic, and fuzzy matching | β
Available |
| π **Quality Assessment** | ML-powered scoring with domain intelligence | β
Available |
| π‘οΈ **Enterprise Security** | JWT auth, rate limiting, audit logging | β
Available |
| π³ **Container Deployment** | Docker files for CPU and GPU deployment | β
Available |
| π **Prometheus Metrics** | Business KPIs and system monitoring | β
Available |
### Planned Enhancements
| Feature | Description | Status |
|---------|-------------|---------|
| π **Grafana Dashboards** | Pre-built monitoring dashboards with alerts | π Planned |
| π **Comprehensive Documentation** | Full API and architecture documentation | π Planned |
| π **Enterprise SSO** | SAML/OIDC integration | π Planned |
| π§© **Plugin Architecture** | Extensible extractors and processors | π Planned |
| π **Multi-Cloud Support** | AWS/GCP/Azure native integrations | π Planned |
---
## π Documentation
| Resource | Description | Status |
|----------|-------------|---------|
| [ποΈ Architecture Overview](ANALYSIS_REPORT.md) | System design and component analysis | β
Available |
| [π Deployment Guide](DEPLOYMENT.md) | Production deployment instructions | β
Available |
| [π§ Configuration Guide](config.example.yaml) | Configuration options and examples | β
Available |
| [π Security Guide](SECURITY.md) | Security best practices | β
Available |
| [π€ Contributing Guide](CONTRIBUTING.md) | Developer contribution guidelines | β
Available |
---
## π Performance Characteristics
*Performance varies based on hardware capabilities and configuration*
| Hardware | Target Throughput | Memory Usage | Notes |
|----------|-------------------|--------------|-------|
| **Raspberry Pi 4** | 50-200 docs/min | 2-4GB | CPU-only, optimized for memory |
| **MacBook Pro M2** | 200-500 docs/min | 6-8GB | Balanced performance |
| **Workstation** | 500-1000 docs/min | 8-12GB | Multi-core optimization |
| **GPU Server** | 1000+ docs/min | 12-16GB | GPU acceleration for ML tasks |
<details>
<summary><b>Performance Optimization Tips</b></summary>
### Hardware Adaptation
The system automatically detects hardware capabilities and adjusts:
- **Concurrency levels** based on CPU cores
- **Batch sizes** based on available memory
- **GPU utilization** for ML-powered quality assessment
- **Storage strategy** based on disk type
### Configuration Tuning
```yaml
# config.yaml - Optimize for your use case
crawler:
max_concurrent_requests: 50 # Adjust based on network
quality:
default:
min_overall_score: 0.7 # Balance quality vs quantity
dataset:
chunking:
chunk_size: 2048 # Adjust based on memory
- π Authentication - JWT tokens with refresh, API keys, RBAC
- π‘οΈ Rate Limiting - Redis-backed distributed enforcement
- π Audit Logging - Complete activity tracking with correlation IDs
- π Data Security - Input validation and secure processing
- π Monitoring Ready - Prometheus metrics and health checks
Current Capabilities
|
Planned Enhancements
|
# Docker deployment
docker build -f docker/Dockerfile.cpu -t quarrycore:cpu .
docker run -p 8000:8000 quarrycore:cpu
# Kubernetes deployment
kubectl apply -f k8s/production/
from quarrycore import Pipeline, Config
from quarrycore.container import DependencyContainer
# Configure for web scraping
config = Config()
config.crawler.rate_limiter.max_requests = 10 # Respectful crawling
config.quality.default.min_overall_score = 0.8
# Create pipeline
container = DependencyContainer()
pipeline = Pipeline(container)
# Process URLs
urls = ["https://example.com/blog", "https://example.com/docs"]
result = await pipeline.run(urls)
print(f"Processed {result['processed_count']} documents")
# Configure for batch processing
config = Config()
config.crawler.max_concurrent_requests = 50
config.deduplication.enabled_levels = [1, 2, 3] # Enable all dedup levels
config.storage.hot.pool_size = 20 # Increase connection pool
# Process with custom configuration
container = DependencyContainer(config_path="config.yaml")
pipeline = Pipeline(container, max_concurrency=100)
View System Architecture
graph TB
subgraph "Data Sources"
W[Websites]
D[Documents]
A[APIs]
end
subgraph "Processing Pipeline"
CR[Crawler<br/>Rate-limited]
EX[Extractor<br/>Multi-strategy]
QA[Quality<br/>ML Scoring]
DD[Dedup<br/>Multi-level]
ST[Storage<br/>Tiered]
end
subgraph "Infrastructure"
M[Monitoring]
S[Security]
C[Config]
end
W --> CR
D --> CR
A --> CR
CR --> EX
EX --> QA
QA --> DD
DD --> ST
M -.-> CR
M -.-> EX
M -.-> QA
S -.-> CR
C -.-> CR
style CR fill:#e3f2fd
style EX fill:#f3e5f5
style QA fill:#e8f5e9
style DD fill:#fff3e0
style ST fill:#fce4ec
Module | Purpose | Implementation |
---|---|---|
Container | Dependency injection with hot-reload | src/quarrycore/container.py |
Pipeline | Orchestration with checkpointing | src/quarrycore/pipeline.py |
Crawler | Adaptive web crawling | src/quarrycore/crawler/ |
Extractor | Multi-strategy content extraction | src/quarrycore/extractor/ |
Quality | ML-powered assessment | src/quarrycore/quality/ |
Deduplicator | Multi-level deduplication | src/quarrycore/deduplicator/ |
Storage | Tiered storage system | src/quarrycore/storage/ |
Security | Authentication and rate limiting | src/quarrycore/security/ |
Channel | Response Time | Best For |
---|---|---|
GitHub Issues | 24-48 hours | Bug reports, features |
Discussions | 2-3 days | Questions, ideas |
We welcome contributions! See our Contributing Guide for details.
# Get started with development
git clone https://github.com/shua-ie/quarrycore
cd quarrycore
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Check code quality
mypy src/
black src/
ruff check src/
Metric | Status | Details |
---|---|---|
Type Safety | 100% typed | |
Test Coverage | Comprehensive test suite | |
Code Quality | Zero issues | |
Security | Security-first design | Regular audits |
MIT License - see LICENSE for details.
β Star us on GitHub β’ π‘ Request a Feature β’ π€ Contribute
If QuarryCore helps your project, please star this repository!