A high-performance web crawler designed to extract product URLs from e-commerce websites. Built with Python, asynchronous I/O, and PostgreSQL for scalable data storage.
This crawler efficiently scrapes product URLs from websites using:
- Hybrid Fetching: Static pages (HTTP) + dynamic pages (Playwright browser automation).
- Priority Queue: Prioritizes URLs matching product page patterns (e.g.,
/p/
,/dp/
,p-
,/buy
). - Bloom Filter: Tracks visited URLs to avoid redundant requests.
- Rate Limiting: Respects domain crawl delays (configurable per domain).
- PostgreSQL: Stores extracted URLs with domain metadata.
- ๐ Asynchronous Crawling: Leverages
asyncio
for concurrent requests. - ๐ธ๏ธ JavaScript Rendering: Uses Playwright for Single-Page Applications (SPAs).
- ๐ Metrics Tracking: Monitors URLs crawled, errors, and product URLs found.
- ๐ ๏ธ Resilient Retries: Auto-retry failed requests with exponential backoff.
- ๐ณ Dockerized: Preconfigured PostgreSQL and app containers.
graph TD
A[User Request] --> B[FastAPI]
B --> C{New Crawl?}
C -->|Yes| D[WebScraper]
D --> E[PriorityFrontier]
E --> F[DomainRateLimiter]
F --> G{Browser Required?}
G -->|Yes| H[BrowserPool]
G -->|No| I[aioHTTP]
H --> J[Playwright]
I & J --> K[Content Parser]
K --> L[VisitedURLTracker]
L --> M[Product Detection]
M -->|Yes| N[AsyncPostgres]
M -->|No| O[Link Extractor]
O --> E
N --> P[(PostgreSQL)]
P --> Q[Metrics]
Component | Description |
---|---|
WebScraper |
Orchestrates crawling, URL processing, and database batch inserts. |
PriorityFrontier |
Manages URL queues (high/medium/low priority) for efficient crawling. |
BrowserPool |
Pool of Chromium instances for parallel headless browsing. |
VisitedURLTracker |
Uses a Bloom filter + in-memory set to track visited URLs. |
DomainRateLimiter |
Enforces per-domain crawl delays to avoid IP bans. |
AsyncPostgres |
Async PostgreSQL client for bulk inserts and connection pooling. |
ETL |
Initializes database tables and manages schema migrations. |
HybridFetcher |
Decides between static (aioHTTP) and dynamic (Playwright) fetching strategies. |
CrawlerMetrics |
Tracks and reports key performance indicators (URLs crawled, errors, product URLs detected). |
ProductURLsManagement |
Manages PostgreSQL table operations for storing product URLs and domain metadata. |
graph TD
subgraph User["User Interface"]
API[FastAPI Endpoints]
Health[Health Check]
MetricsAPI[Metrics Dashboard]
end
subgraph Core["Core Components"]
WS[WebScraper]
PQ[PriorityFrontier]
BF[VisitedURLTracker]
RL[DomainRateLimiter]
HM[HybridFetcher]
MET[CrawlerMetrics]
end
subgraph Browsers["Browser Management"]
BP[BrowserPool]
PW[Playwright]
subgraph Fetch["Fetch Strategies"]
Static[Static HTTP]
Dynamic[Dynamic JS]
end
end
subgraph Storage["Data Storage"]
PG[(PostgreSQL)]
subgraph Tables
URLs[product_urls]
Domains[domain_metadata]
end
end
subgraph Processing["Data Processing"]
Parser[HTML Parser]
Extractor[Link Extractor]
Validator[URL Validator]
ProdDetect[Product Detector]
end
%% Data Flow
API -->|POST /crawl| WS
Health -->|GET /| WS
WS -->|URL management| PQ
WS -->|Rate limiting| RL
WS -->|Fetch strategy| HM
HM -->|Static| Static
HM -->|Dynamic| BP
BP --> PW
PW --> Dynamic
Static --> Parser
Dynamic --> Parser
Parser --> ProdDetect
ProdDetect -->|Yes| URLs
ProdDetect -->|No| Extractor
Extractor --> Validator
Validator -->|New URLs| PQ
Validator -->|URL check| BF
WS -->|Store metrics| MET
MET --> Domains
WS -->|Batch insert| URLs
%% Error Handling
WS -.->|Error logging| EL[(Error Logs)]
BP -.->|Crash recovery| EL
%% Styles
classDef primary fill:#2563eb,stroke:#1e40af,color:white
classDef secondary fill:#4b5563,stroke:#374151,color:white
classDef storage fill:#065f46,stroke:#064e3b,color:white
classDef browser fill:#7c3aed,stroke:#6d28d9,color:white
classDef processing fill:#d97706,stroke:#b45309,color:white
class API,WS,MET primary
class PQ,BF,RL,HM,Parser,Extractor,Validator,ProdDetect secondary
class PG,URLs,Domains,EL storage
class BP,PW,Static,Dynamic browser
class Processing processing
- Python 3.8+
- PostgreSQL 13+
- Playwright browsers
-
Clone the repository:
git clone https://github.com/makkarss929/WEB_CRAWLER.git
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # Linux/Mac # or .\venv\Scripts\activate # Windows
-
Install dependencies:
pip install -r requirements.txt
-
Install Playwright browsers:
playwright install
-
Create a
.env
file with database configuration:DATABASE_HOST=localhost DATABASE_PORT=5432 DATABASE_USER=admin DATABASE_PASSWORD=your_password DATABASE_NAME=crawler_db
-
Start the app:
python -u app.py
-
Make sure Docker and Docker Compose are installed.
-
Create a
.env
file as described above. -
Build and start the containers:
docker compose up -d
import requests
# Start a crawl
response = requests.post(
"http://localhost:5001/crawl",
json={
"domains": [
"https://www.flipkart.com/"
]
}
)
# Check health
health = requests.get("http://localhost:5001/")
-
GET /
- Health check endpoint- Returns:
{"status": "ok", "active_scrapers": <count>}
- Returns:
-
POST /crawl
- Start crawling specified domains- Body:
{"domains": ["domain1.com", "domain2.com"]}
- Returns:
{"metrics": { 'urls_crawled': 0, 'product_urls': 0, 'avg_response_time': 0, 'error_rate': 0 }
- Body:
- Distributed Task Queue: Use RabbitMQ for message brokering and Celery workers for parallel task processing.
- Visited URL Tracking: Currently uses in-memory set. Planned upgrade to RedisBloom
- Flower Dashboard: Real-time monitoring of Celery tasks.
- Kubernetes Deployment: Auto scaling
This project is licensed under the MIT License - see the LICENSE file for details.