Skip to content

graceyangfan/CrawlAdapter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CrawlAdapter

Universal Proxy Management for Web Scraping

Python Version License Status

CrawlAdapter is a comprehensive and extensible proxy management library built on Clash, designed for web scraping applications that require intelligent proxy rotation, custom routing rules, and seamless integration with various scraping frameworks.

πŸš€ Key Features

  • Intelligent Proxy Management: Automatic proxy node fetching, health checking, and rotation
  • Clash Integration: Built on the powerful Clash proxy engine with full configuration control
  • Smart Routing: Rule-based traffic routing with support for domain patterns and custom rules
  • Health Monitoring: Adaptive health checking with multiple strategies and automatic failover
  • Easy Integration: Simple API for integration with existing web scraping projects
  • Extensible Architecture: Modular design supporting custom sources, health strategies, and configurations
  • Production Ready: Comprehensive error handling, logging, and monitoring capabilities

πŸ“‹ Table of Contents

πŸ”§ Installation

Prerequisites

  • Python 3.8 or higher
  • Clash binary (mihomo) - automatically downloaded during setup

Install from Source

git clone https://github.com/graceyangfan/CrawlAdapter.git
cd CrawlAdapter
pip install -e .

Setup Clash Binary

CrawlAdapter requires the Clash (mihomo) binary. Run the setup script to automatically download it:

python setup_clash_binary.py

Or use the utility module:

from utils import download_clash_binary
download_clash_binary()

πŸš€ Quick Start

Basic Usage

import asyncio
from crawladapter import ProxyClient

async def main():
    # Initialize the proxy client
    client = ProxyClient()
    
    # Start with custom rules for specific domains
    await client.start(rules=["*.example.com", "*.target-site.com"])
    
    # Get proxy URL for a request
    proxy_url = await client.get_proxy("https://example.com")
    
    # Use the proxy with your HTTP client
    if proxy_url:
        # Make request through proxy
        async with aiohttp.ClientSession() as session:
            async with session.get("https://example.com", proxy=proxy_url) as response:
                content = await response.text()
    
    # Clean shutdown
    await client.stop()

asyncio.run(main())

Simple Client for Quick Setup

from crawladapter import create_simple_client

async def quick_example():
    # Create and start client in one step
    client = await create_simple_client(
        rules=["*.panewslab.com"],
        custom_sources={
            'clash': ['https://example.com/config.yml']
        }
    )
    
    # Use the client
    proxy_url = await client.get_proxy("https://panewslab.com")
    
    await client.stop()

πŸ—οΈ Core Components

Architecture Overview

CrawlAdapter follows a modular architecture with clear separation of concerns:

crawladapter/
β”œβ”€β”€ client.py              # Main ProxyClient interface
β”œβ”€β”€ simple_client.py       # Simplified client for quick setup
β”œβ”€β”€ core.py                # Legacy compatibility layer
β”œβ”€β”€ fetchers.py            # Node fetching from various sources
β”œβ”€β”€ health_checker.py      # Health monitoring and validation
β”œβ”€β”€ health_strategies.py   # Different health checking strategies
β”œβ”€β”€ process_manager.py     # Clash process lifecycle management
β”œβ”€β”€ config_generator.py    # Dynamic configuration generation
β”œβ”€β”€ config_loader.py       # Configuration loading and validation
β”œβ”€β”€ managers.py            # Configuration, proxy, and rule managers
β”œβ”€β”€ rules.py               # Traffic routing rule management
β”œβ”€β”€ types.py               # Type definitions and data models
└── exceptions.py          # Custom exception classes

Key Classes

ProxyClient

The main interface for proxy management:

  • Handles complete proxy lifecycle
  • Integrates all components seamlessly
  • Provides intelligent proxy selection
  • Manages health monitoring

SimpleProxyClient

Simplified interface for quick integration:

  • Minimal configuration required
  • Automatic setup and teardown
  • Perfect for simple use cases

NodeFetcher

Responsible for obtaining proxy nodes:

  • Supports Clash and V2Ray configurations
  • Custom source integration
  • Automatic parsing and validation

HealthChecker

Monitors proxy health and performance:

  • Multiple health checking strategies
  • Adaptive check intervals
  • Automatic failover handling

βš™οΈ Configuration

Configuration Sources

CrawlAdapter supports multiple configuration methods:

  1. Default Configuration: Built-in templates for common use cases
  2. Custom Sources: Fetch nodes from external URLs
  3. Manual Configuration: Direct proxy node specification
  4. Environment Variables: Runtime configuration overrides

Configuration Templates

The library includes optimized templates for different scenarios:

# Scraping-optimized configuration
clash_templates:
  scraping:
    mode: rule
    log_level: warning
    dns:
      enable: true
      enhanced_mode: fake-ip
    proxy_groups:
      - name: "PROXY"
        type: select
      - name: "AUTO"
        type: url-test
        interval: 300

Custom Rules

Define routing rules for specific domains or patterns:

rules = [
    "*.example.com",           # All subdomains of example.com
    "target-site.com",         # Specific domain
    "DOMAIN-SUFFIX,api.com",   # Clash rule format
    "IP-CIDR,192.168.1.0/24"  # IP range
]

πŸ“š Usage Examples

Web Scraping Integration

import aiohttp
from crawladapter import ProxyClient

class WebScraper:
    def __init__(self):
        self.proxy_client = ProxyClient()

    async def start(self):
        await self.proxy_client.start(rules=["*.target-site.com"])

    async def scrape_url(self, url):
        proxy_url = await self.proxy_client.get_proxy(url)

        async with aiohttp.ClientSession() as session:
            kwargs = {"proxy": proxy_url} if proxy_url else {}
            async with session.get(url, **kwargs) as response:
                return await response.text()

    async def stop(self):
        await self.proxy_client.stop()

Real-World Example: News Crawler

Based on the included examples/panewslab_crawler.py:

import aiohttp
from crawladapter import ProxyClient

class PanewsLabCrawler:
    """Professional news crawler with proxy support."""

    def __init__(self, custom_sources=None):
        self.proxy_client = ProxyClient()
        self.custom_sources = custom_sources
        self.api_endpoint = "https://www.panewslab.com/webapi/flashnews"

    async def start(self):
        """Initialize the crawler with proxy support."""
        await self.proxy_client.start(
            rules=["*.panewslab.com"],
            custom_sources=self.custom_sources
        )

    async def fetch_news(self, limit=10):
        """Fetch latest news with automatic proxy rotation."""
        proxy_url = await self.proxy_client.get_proxy(self.api_endpoint)

        params = {"LId": 1, "Rn": limit, "tw": 0}

        async with aiohttp.ClientSession() as session:
            kwargs = {"proxy": proxy_url} if proxy_url else {}
            async with session.get(
                self.api_endpoint,
                params=params,
                **kwargs
            ) as response:
                data = await response.json()
                return self._parse_news_data(data)

    def _parse_news_data(self, data):
        """Parse news data from API response."""
        news_items = []
        for item in data.get('data', []):
            news_items.append({
                'title': item.get('title', ''),
                'content': item.get('content', ''),
                'publish_time': item.get('publish_time', ''),
                'symbols': item.get('symbols', [])
            })
        return news_items

    async def stop(self):
        """Clean shutdown."""
        await self.proxy_client.stop()

# Usage
async def main():
    crawler = PanewsLabCrawler(
        custom_sources={
            'clash': ['https://example.com/clash-config.yml']
        }
    )

    await crawler.start()
    news = await crawler.fetch_news(limit=5)

    for item in news:
        print(f"πŸ“° {item['title']}")

    await crawler.stop()

asyncio.run(main())

Custom Node Sources

custom_sources = {
    'clash': [
        'https://example.com/clash-config.yml',
        'https://another-source.com/config.yaml'
    ],
    'v2ray': [
        'https://v2ray-source.com/subscription'
    ]
}

client = ProxyClient()
await client.start(
    rules=["*.target.com"],
    custom_sources=custom_sources
)

Health Monitoring

from crawladapter import ProxyClient
from crawladapter.health_strategies import AdaptiveHealthStrategy

# Use adaptive health checking
client = ProxyClient()
await client.start(
    rules=["*.example.com"],
    health_strategy=AdaptiveHealthStrategy(
        base_interval=60,  # Base check interval
        max_concurrent=10  # Max concurrent checks
    )
)

# Monitor health status
stats = await client.get_proxy_stats()
for proxy_name, stats in stats.items():
    print(f"{proxy_name}: {stats.health_score:.2f}")

πŸ“– API Reference

ProxyClient

Constructor

ProxyClient(
    config_dir: Optional[str] = None,
    clash_binary_path: Optional[str] = None,
    proxy_port: int = 7890,
    api_port: int = 9090
)

Methods

start()
async def start(
    rules: Optional[List[str]] = None,
    custom_sources: Optional[Dict] = None,
    config_path: Optional[str] = None
) -> bool

Initialize and start the proxy client with specified rules and sources.

get_proxy()
async def get_proxy(
    url: Optional[str] = None,
    strategy: str = 'health_weighted'
) -> Optional[str]

Get proxy URL for a specific target URL based on routing rules.

stop()
async def stop() -> None

Clean shutdown of all proxy services and processes.

get_proxy_stats()
async def get_proxy_stats() -> Dict[str, ProxyStats]

Retrieve current health and performance statistics for all proxies.

SimpleProxyClient

create_simple_client()

async def create_simple_client(
    config_dir: Optional[str] = None,
    clash_binary_path: Optional[str] = None,
    proxy_port: int = 7890,
    api_port: int = 9090,
    rules: Optional[List[str]] = None,
    custom_sources: Optional[Dict] = None
) -> SimpleProxyClient

Create and initialize a simple proxy client in one step.

Configuration Types

ProxyNode

@dataclass
class ProxyNode:
    name: str
    type: ProxyType
    server: str
    port: int
    cipher: Optional[str] = None
    password: Optional[str] = None
    # ... additional fields

HealthCheckResult

@dataclass
class HealthCheckResult:
    proxy_name: str
    success: bool
    response_time: float
    overall_score: float
    error_message: Optional[str] = None

πŸ”§ Advanced Features

Custom Health Strategies

Implement custom health checking logic:

from crawladapter.health_strategies import BaseHealthStrategy

class CustomHealthStrategy(BaseHealthStrategy):
    async def check_proxy(self, proxy_name: str, clash_api_base: str):
        # Custom health check implementation
        result = await self._perform_custom_test(proxy_name)
        return HealthCheckResult(
            proxy_name=proxy_name,
            success=result.success,
            response_time=result.latency,
            overall_score=self._calculate_score(result)
        )

# Use custom strategy
client = ProxyClient()
client.health_checker.strategy = CustomHealthStrategy()

Rule-Based Routing

Advanced routing configuration:

from crawladapter.rules import RuleManager, RuleCategory

rule_manager = RuleManager()

# Add rules by category
rule_manager.add_rule("*.social-media.com", RuleCategory.SOCIAL)
rule_manager.add_rule("*.news-site.com", RuleCategory.NEWS)
rule_manager.add_rule("DIRECT", RuleCategory.BYPASS)

# Custom rule logic
def custom_rule_logic(url: str) -> bool:
    # Implement custom routing logic
    return "api" in url or "cdn" in url

rule_manager.add_custom_rule(custom_rule_logic)

Performance Monitoring

Monitor and optimize proxy performance:

import time
from crawladapter import ProxyClient

class PerformanceMonitor:
    def __init__(self, client: ProxyClient):
        self.client = client
        self.metrics = {}

    async def monitor_request(self, url: str):
        start_time = time.time()
        proxy_url = await self.client.get_proxy(url)

        # Make request and measure performance
        # ... request logic ...

        end_time = time.time()
        self.metrics[url] = {
            'proxy_used': proxy_url is not None,
            'response_time': end_time - start_time,
            'timestamp': start_time
        }

    def get_performance_report(self):
        return self.metrics

Integration with Popular Frameworks

Scrapy Integration

from scrapy import Spider
from crawladapter import ProxyClient

class ProxySpider(Spider):
    name = 'proxy_spider'

    def __init__(self):
        super().__init__()
        self.proxy_client = ProxyClient()

    async def start_requests(self):
        await self.proxy_client.start(rules=["*.target-site.com"])

        for url in self.start_urls:
            proxy_url = await self.proxy_client.get_proxy(url)
            meta = {'proxy': proxy_url} if proxy_url else {}
            yield scrapy.Request(url, meta=meta)

    def closed(self, reason):
        asyncio.run(self.proxy_client.stop())

Requests Integration

import requests
from crawladapter import get_proxy_for_url

def make_request(url: str):
    proxy_url = asyncio.run(get_proxy_for_url(url, rules=["*.example.com"]))

    proxies = {'http': proxy_url, 'https': proxy_url} if proxy_url else None
    response = requests.get(url, proxies=proxies)
    return response

πŸ› οΈ Troubleshooting

Common Issues

1. Clash Binary Not Found

Problem: ClashProcessError: Clash binary not found

Solution:

# Run the setup script
python setup_clash_binary.py

# Or manually specify the path
client = ProxyClient(clash_binary_path="/path/to/mihomo")

2. No Healthy Proxies

Problem: All proxies fail health checks

Solutions:

  • Check network connectivity
  • Verify proxy sources are accessible
  • Adjust health check timeout settings
  • Use different health check URLs
from crawladapter.health_strategies import BasicHealthStrategy

# Use more lenient health checking
strategy = BasicHealthStrategy(
    timeout=30,  # Increase timeout
    test_urls=[
        "http://www.gstatic.com/generate_204",
        "http://httpbin.org/ip"
    ]
)

3. Configuration Errors

Problem: Invalid configuration or parsing errors

Solution:

# Enable debug logging
import logging
logging.basicConfig(level=logging.DEBUG)

# Validate configuration
from crawladapter.config_loader import load_config
config = load_config("path/to/config.yaml")

4. Port Conflicts

Problem: Ports already in use

Solution:

# Use different ports
client = ProxyClient(
    proxy_port=7891,  # Default: 7890
    api_port=9091     # Default: 9090
)

Performance Optimization

1. Health Check Optimization

from crawladapter.health_strategies import AdaptiveHealthStrategy

# Optimize health checking
strategy = AdaptiveHealthStrategy(
    base_interval=120,      # Check every 2 minutes
    max_concurrent=5,       # Limit concurrent checks
    timeout=15,             # Faster timeout
    retry_count=2           # Fewer retries
)

2. Memory Usage

# Limit node count for memory efficiency
client = ProxyClient()
await client.start(
    rules=["*.target.com"],
    max_nodes=50  # Limit to 50 best nodes
)

3. Network Optimization

# Configure for high-throughput scenarios
config = {
    'clash_config': {
        'mixed_port': 7890,
        'allow_lan': False,
        'mode': 'rule',
        'log_level': 'warning',  # Reduce logging overhead
        'external_controller': '127.0.0.1:9090'
    }
}

Debugging

Enable Detailed Logging

import logging

# Configure logging
logging.basicConfig(
    level=logging.DEBUG,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

# Enable specific module logging
logging.getLogger('crawladapter.health_checker').setLevel(logging.DEBUG)
logging.getLogger('crawladapter.process_manager').setLevel(logging.DEBUG)

Health Check Debugging

# Manual health check
from crawladapter.health_checker import HealthChecker
from crawladapter.health_strategies import BasicHealthStrategy

checker = HealthChecker(BasicHealthStrategy())
results = await checker.check_proxies(proxies, "http://127.0.0.1:9090")

for name, result in results.items():
    print(f"{name}: {'βœ…' if result.success else '❌'} ({result.response_time:.2f}ms)")

πŸ—οΈ Development and Contributing

Development Setup

# Clone the repository
git clone https://github.com/graceyangfan/CrawlAdapter.git
cd CrawlAdapter

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=crawladapter --cov-report=html

Project Structure

CrawlAdapter/
β”œβ”€β”€ crawladapter/           # Main package
β”‚   β”œβ”€β”€ __init__.py        # Package exports
β”‚   β”œβ”€β”€ client.py          # Main client interface
β”‚   β”œβ”€β”€ simple_client.py   # Simplified client
β”‚   β”œβ”€β”€ core.py            # Legacy compatibility
β”‚   β”œβ”€β”€ fetchers.py        # Node fetching
β”‚   β”œβ”€β”€ health_checker.py  # Health monitoring
β”‚   β”œβ”€β”€ process_manager.py # Process management
β”‚   β”œβ”€β”€ config_generator.py # Config generation
β”‚   β”œβ”€β”€ managers.py        # Various managers
β”‚   β”œβ”€β”€ rules.py           # Routing rules
β”‚   β”œβ”€β”€ types.py           # Type definitions
β”‚   └── exceptions.py      # Custom exceptions
β”œβ”€β”€ examples/              # Usage examples
β”œβ”€β”€ utils/                 # Utility tools
β”œβ”€β”€ tests/                 # Test suite
β”œβ”€β”€ requirements.txt       # Dependencies
β”œβ”€β”€ setup.py              # Package setup
└── README.md             # This file

Contributing Guidelines

  1. Fork the repository and create a feature branch
  2. Write tests for new functionality
  3. Follow PEP 8 style guidelines
  4. Add documentation for new features
  5. Submit a pull request with clear description

Running Tests

# Run all tests
pytest

# Run specific test file
pytest tests/test_client.py

# Run with coverage
pytest --cov=crawladapter

# Run integration tests (requires clash binary)
pytest tests/integration/ -v

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Clash/Mihomo: The powerful proxy engine that powers CrawlAdapter
  • Community Contributors: Thanks to all contributors who help improve this project
  • Open Source Libraries: Built on top of excellent open source Python libraries

πŸ“ž Support


CrawlAdapter - Making web scraping proxy management simple and reliable.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages