This server is a high-performance, distributed web scraping and crawling platform designed for AI-powered data extraction at scale. It provides:
- Concurrent scraping of multiple URLs using Playwright and FastAPI
- Distributed job queueing with Celery and Redis/Upstash
- Real-time job status tracking and resource monitoring
- Dynamic rate limiting and backpressure to adapt to system load
- API endpoints for submitting scrape jobs, checking status, and managing operations
- Metrics and observability via Prometheus integration
- Scalable deployment on Fly.io with multi-process and multi-worker support
- Support for screenshots, PDFs, and JavaScript execution via API
- Adaptive queue management and error tracking for robust operation
The server is suitable for large-scale, production-grade web crawling, data collection, and AI-driven content extraction tasks.
A high-performance, distributed web scraping service built with:
- Python
- FastAPI
- Playwright
- Upstash Redis
- Fly.io Deployment
- Concurrent scraping of multiple URLs
- Distributed job queueing
- Scalable architecture
- Real-time job status tracking
- Python 3.11+
- Upstash Redis Account
- Fly.io Account
-
Clone the repository
-
Install dependencies:
pip install -r requirements.txt
-
Set Upstash Redis Environment Variables:
export UPSTASH_REDIS_REST_URL=your_redis_url export UPSTASH_REDIS_REST_TOKEN=your_redis_token
-
Run API Server:
uvicorn server:app --reload
fly launch
POST /scrape
— Submit a scraping job (with URL(s), options)POST /llm/job
— Submit an LLM extraction jobGET /user/data
— Get user data (requires authentication)POST /config/dump
— Evaluate and dump configGET /ws/events
— WebSocket endpoint for real-time eventsGET /llm/job/{task_id}
— Get LLM job status/resultGET /crawl/job/{task_id}
— Get crawl job status/resultPOST /crawl/job/{task_id}/cancel
— Cancel a running crawl jobGET /metrics
— Prometheus metrics endpointGET /health
— Health check endpoint
Adjust concurrency and worker settings in worker.py