β Project Status: CLEANED & OPTIMIZED
π§Ή Recently Cleaned: Removed unnecessary files, optimized project structure, and enhanced documentation for better performance and maintainability.
π° August 2024 Update: Kaler Kantho English version discontinued. Spider disabled (now
.disabled
). Only Bangla content remains at kalerkantho.com.
# 1. Clone and setup
git clone https://github.com/EhsanulHaqueSiam/BDNewsPaperScraper.git
cd BDNewsPaperScraper
chmod +x setup.sh && ./setup.sh --all
# 2. Test with fastest spider
uv run scrapy crawl prothomalo -s CLOSESPIDER_ITEMCOUNT=10
# 3. Run optimized batch (RECOMMENDED)
chmod +x run_spiders_optimized.sh
./run_spiders_optimized.sh prothomalo --monitor
# 4. Check results
./toxlsx.py --list
# 5. Export data
./toxlsx.py --output news_data.xlsx
# 1. Clone and setup (Command Prompt or PowerShell)
git clone https://github.com/EhsanulHaqueSiam/BDNewsPaperScraper.git
cd BDNewsPaperScraper
uv sync
# 2. Test with fastest spider
uv run scrapy crawl prothomalo -s CLOSESPIDER_ITEMCOUNT=10
# 3. Run optimized batch (RECOMMENDED) - Use Python script
python run_spiders_optimized.py prothomalo --monitor
# 4. Check results
python toxlsx.py --list
# 5. Export data
python toxlsx.py --output news_data.xlsx
# All spiders support date filtering!
uv run scrapy crawl prothomalo -a start_date=2024-08-01 -a end_date=2024-08-31
python run_spiders_optimized.py --start-date 2024-08-01 --end-date 2024-08-31
- Python 3.9+ - Modern Python support
- UV Package Manager - Ultra-fast dependency management
- Git - For cloning the repository
git clone https://github.com/EhsanulHaqueSiam/BDNewsPaperScraper.git
cd BDNewsPaperScraper
# Install UV package manager
curl -LsSf https://astral.sh/uv/install.sh | sh
# Reload shell or restart terminal
source ~/.bashrc # or ~/.zshrc for zsh
# Automatic setup (recommended)
chmod +x setup.sh
./setup.sh --all
# OR Manual setup
uv venv --python 3.11
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv sync
# Install UV if not already installed (PowerShell - run as administrator)
# Option 1: Using PowerShell
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
# Option 2: Using Python pip
pip install uv
# Manual setup (recommended for Windows)
uv venv --python 3.11
.venv\Scripts\activate
uv sync
# OR if you have WSL (Windows Subsystem for Linux)
# Follow the Linux/macOS instructions in WSL
# Check if spiders are available
uv run scrapy list
# Should show:
# BDpratidin
# bangladesh_today
# dailysun
# ittefaq
# prothomalo
# thedailystar
# Test run a single spider to verify everything works
uv run scrapy crawl prothomalo -s CLOSESPIDER_ITEMCOUNT=5
The BDNewsPaper scraper provides 16 different methods to run the project, covering every possible use case:
Method | Use Case | Complexity | Best For |
---|---|---|---|
Method 1: Individual Commands | Development, Testing | β | Learning, debugging |
Method 2: Enhanced Batch Runner | Production | ββ | RECOMMENDED |
Method 3: Selective Running | Targeted scraping | ββ | Specific needs |
Method 4: Development & Testing | Debug, development | ββ | Development workflow |
Method 5: Scheduled/Cron | Automation | βββ | Production automation |
Method 6: Python Scripts | Custom automation | βββ | Custom workflows |
Method 7: Container/Docker | Containerized | ββββ | Cloud deployment |
Method 8: Virtual Environment | Direct execution | ββ | Speed optimization |
Method 9: IDE Integration | Development | ββ | IDE users |
Method 10: System Service | Background service | ββββ | Server deployment |
Method 11: Environment-Specific | Multi-environment | βββ | Dev/staging/prod |
Method 12: Multi-Instance Parallel | High performance | βββββ | Maximum speed |
Method 13: Makefile | Build automation | βββ | Build systems |
Method 14: CI/CD Pipeline | Automated deployment | βββββ | DevOps |
Method 15: Remote/Cloud | Cloud execution | ββββ | Cloud platforms |
Method 16: API/Webhook | Event-driven | βββββ | Microservices |
-
π₯ Enhanced Batch Runner (
./run_spiders_optimized.sh
)- Best performance, monitoring, logging
- Recommended for 95% of users
-
π₯ Individual Commands (
uv run scrapy crawl spider
)- Perfect for development and testing
- Most flexible for custom settings
-
π₯ Scheduled Cron Jobs (cron + optimized runner)
- Ideal for automated daily/hourly runs
- Production automation
π¨βπ» For Developers:
- Development: Method 1 (Individual Commands)
- Testing: Method 4 (Development & Testing)
- IDE Integration: Method 9
π For Production:
- Standard: Method 2 (Enhanced Batch Runner)
- Automation: Method 5 (Scheduled/Cron)
- High Performance: Method 12 (Multi-Instance)
βοΈ For Cloud/Enterprise:
- Containers: Method 7 (Docker)
- CI/CD: Method 14 (Pipeline)
- Microservices: Method 16 (API/Webhook)
π οΈ For System Administrators:
- Background Service: Method 10 (System Service)
- Remote Execution: Method 15 (Remote/Cloud)
- Build Systems: Method 13 (Makefile)
# ULTIMATE PERFORMANCE: Multi-instance + Monitoring + Cron
# Terminal 1-3 (parallel execution)
./run_spiders_optimized.sh prothomalo --monitor &
./run_spiders_optimized.sh dailysun --monitor &
./run_spiders_optimized.sh ittefaq --monitor &
# ULTIMATE AUTOMATION: Docker + CI/CD + Webhook
# Containerized, automated, event-driven execution
# ULTIMATE RELIABILITY: System Service + Monitoring
# Background service with performance tracking
This project now provides full Windows support with a cross-platform Python runner script (run_spiders_optimized.py
) that provides all the same features as the Linux/macOS bash script.
-
Install Prerequisites
# Install Python 3.9+ from python.org # Install Git from git-scm.com # Install UV package manager (PowerShell as administrator): powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
-
Clone and Setup
git clone https://github.com/EhsanulHaqueSiam/BDNewsPaperScraper.git cd BDNewsPaperScraper uv sync
-
Test Run
# Basic test (minimal output with UV) uv run scrapy crawl prothomalo -s CLOSESPIDER_ITEMCOUNT=10 # Better visibility (shows scraping progress) uv run scrapy crawl prothomalo -L INFO -s CLOSESPIDER_ITEMCOUNT=10
-
Production Run
# Best option for Windows (full visibility) python run_spiders_optimized.py prothomalo --monitor
π‘ Windows Tip: If
uv run
shows only "Bytecode compiled" and no scraping info, use-L INFO
flag or switch to the Python runner for better visibility!
The Python script provides identical functionality to the bash script but works on Windows:
# Cross-platform runner that works on Windows, macOS, and Linux
python run_spiders_optimized.py [spider_name] [--monitor] [--start-date YYYY-MM-DD] [--end-date YYYY-MM-DD]
# Cross-platform runner that works on Windows, macOS, and Linux
python run_spiders_optimized.py [spider_name] [--monitor] [--start-date YYYY-MM-DD] [--end-date YYYY-MM-DD]
# OR use the Windows batch file for easier access
run_spiders_optimized.bat [spider_name] [--monitor] [--start-date YYYY-MM-DD] [--end-date YYYY-MM-DD]
Python Script Examples:
# Run all spiders with optimized settings
python run_spiders_optimized.py
# Run specific spider
python run_spiders_optimized.py prothomalo
python run_spiders_optimized.py dailysun
# Run with performance monitoring
python run_spiders_optimized.py --monitor
python run_spiders_optimized.py prothomalo --monitor
# Date range filtering
python run_spiders_optimized.py --start-date 2024-01-01 --end-date 2024-01-31
python run_spiders_optimized.py prothomalo --start-date 2024-08-01 --end-date 2024-08-31
# Combined options
python run_spiders_optimized.py dailysun --monitor --start-date 2024-08-01
# Get help
python run_spiders_optimized.py --help
Windows Batch File Examples:
# Easier syntax using the .bat wrapper
run_spiders_optimized.bat
run_spiders_optimized.bat prothomalo --monitor
run_spiders_optimized.bat --start-date 2024-08-01 --end-date 2024-08-31
# Run specific spiders directly
uv run scrapy crawl prothomalo
uv run scrapy crawl dailysun
uv run scrapy crawl ittefaq
uv run scrapy crawl bdpratidin
uv run scrapy crawl thebangladeshtoday
uv run scrapy crawl thedailystar
# πͺ WINDOWS TIP: Add -L INFO to see scraping progress (UV can be quiet)
uv run scrapy crawl prothomalo -L INFO
uv run scrapy crawl dailysun -L INFO -s CLOSESPIDER_ITEMCOUNT=10
# With date filtering
uv run scrapy crawl prothomalo -a start_date=2024-01-01 -a end_date=2024-01-31 -L INFO
uv run scrapy crawl dailysun -a start_date=2024-08-01 -L INFO
# With custom settings (always include -L INFO for visibility)
uv run scrapy crawl ittefaq -L INFO -s CLOSESPIDER_ITEMCOUNT=100 -s DOWNLOAD_DELAY=2
# Check scraped data
python toxlsx.py --list
# Export to Excel
python toxlsx.py --output news_data.xlsx
# Export specific newspaper
python toxlsx.py --paper "ProthomAlo" --output prothomalo.xlsx
# Export to CSV
python toxlsx.py --format csv --output news_data.csv
# Export with limits
python toxlsx.py --limit 100 --output recent_news.xlsx
# Run PowerShell as Administrator
# Install UV
irm https://astral.sh/uv/install.ps1 | iex
# Clone and setup project
git clone https://github.com/EhsanulHaqueSiam/BDNewsPaperScraper.git
cd BDNewsPaperScraper
uv sync
# Test run
uv run scrapy crawl prothomalo -s CLOSESPIDER_ITEMCOUNT=5
# Install UV via pip (if PowerShell not available)
pip install uv
# Clone and setup project
git clone https://github.com/EhsanulHaqueSiam/BDNewsPaperScraper.git
cd BDNewsPaperScraper
uv sync
# Test run
uv run scrapy crawl prothomalo -s CLOSESPIDER_ITEMCOUNT=5
# Install WSL first, then follow Linux instructions
wsl --install Ubuntu
# Restart computer
wsl
# Follow Linux/macOS instructions inside WSL
- Open Task Scheduler
- Create Basic Task
- Set trigger (daily, weekly, etc.)
- Set action to run:
python run_spiders_optimized.py
- Set working directory to project folder
# Save as daily_scrape.ps1
Set-Location "C:\path\to\BDNewsPaperScraper"
# Run fast spiders
& python run_spiders_optimized.py prothomalo --monitor
& python run_spiders_optimized.py dailysun --monitor
# Export data
& python toxlsx.py --output "daily_news_$(Get-Date -Format 'yyyyMMdd').xlsx"
Write-Output "Daily scraping completed: $(Get-Date)"
# Use Windows Defender exclusions for better performance
# Add project folder to Windows Defender exclusions
# Set high priority for scraping process (CMD as administrator)
wmic process where name="python.exe" call setpriority "high priority"
# Use SSD storage for better database performance
# Ensure adequate RAM (8GB+ recommended for all spiders)
# Adjust concurrent requests for Windows
uv run scrapy crawl prothomalo -s CONCURRENT_REQUESTS=32 -s DOWNLOAD_DELAY=0.5
# Use Windows-friendly log levels
uv run scrapy crawl dailysun -L INFO
# Windows path-safe output files
python toxlsx.py --output "news_data_%date:~10,4%%date:~4,2%%date:~7,2%.xlsx"
Issue: On Windows, uv run
often shows only "Bytecode compiled" and minimal output, making it hard to see scraping progress.
Solutions:
-
Use Explicit Log Levels (Recommended)
# Force INFO level logging to see scraping progress uv run scrapy crawl prothomalo -L INFO uv run scrapy crawl dailysun -L INFO -s CLOSESPIDER_ITEMCOUNT=10 # For detailed debugging output uv run scrapy crawl prothomalo -L DEBUG -s CLOSESPIDER_ITEMCOUNT=5 # For minimal output (only warnings/errors) uv run scrapy crawl prothomalo -L WARNING
-
Use the Python Runner (Best for Windows)
# Python script shows full output by default python run_spiders_optimized.py prothomalo python run_spiders_optimized.py --monitor # Shows real-time progress # Even better - shows live statistics and progress bars python run_spiders_optimized.py prothomalo --monitor
-
Direct Scrapy Commands (Without UV)
# Activate virtual environment first .venv\Scripts\activate # Run scrapy directly (shows full output) scrapy crawl prothomalo -L INFO scrapy crawl dailysun -L INFO -s CLOSESPIDER_ITEMCOUNT=10 # Deactivate when done deactivate
-
Force Verbose Output with UV
# Use verbose flags to force output uv run --verbose scrapy crawl prothomalo -L INFO # Combine with log level and item count for testing uv run scrapy crawl prothomalo -L INFO -s CLOSESPIDER_ITEMCOUNT=20
-
Monitor Log Files in Real-Time
# Windows equivalent of tail -f (PowerShell) # Terminal 1: Start spider uv run scrapy crawl prothomalo -L INFO # Terminal 2: Monitor logs (PowerShell) Get-Content logs\prothomalo_*.log -Wait -Tail 20 # OR using Command Prompt with tail equivalent powershell "Get-Content logs\prothomalo_*.log -Wait -Tail 20"
For Development/Testing:
# Always use explicit log levels and limits for testing
uv run scrapy crawl prothomalo -L INFO -s CLOSESPIDER_ITEMCOUNT=10
# Use Python runner for better Windows experience
python run_spiders_optimized.py prothomalo --monitor
# Monitor in real-time (separate terminal)
powershell "Get-Content logs\*.log -Wait -Tail 50"
For Production:
# Use Python runner with monitoring (recommended)
python run_spiders_optimized.py --monitor
# Or use UV with explicit logging to file
uv run scrapy crawl prothomalo -L INFO > scraping.log 2>&1
# Monitor progress
powershell "Get-Content scraping.log -Wait -Tail 30"
Quick Progress Check:
# Check how many articles have been scraped so far
python toxlsx.py --list
# Check database directly
sqlite3 news_articles.db "SELECT COUNT(*) FROM articles;"
sqlite3 news_articles.db "SELECT COUNT(*) FROM articles WHERE paper_name = 'ProthomAlo';"
If UV continues to show minimal output, use these alternatives:
-
Virtual Environment Method (Most reliable)
# One-time setup per session .venv\Scripts\activate # Run commands directly (full output) scrapy crawl prothomalo -L INFO scrapy crawl dailysun -L INFO -s CLOSESPIDER_ITEMCOUNT=50 python performance_monitor.py # When done deactivate
-
Python Runner Method (Recommended)
# Always shows full output and progress python run_spiders_optimized.py prothomalo python run_spiders_optimized.py --monitor # Best visibility python run_spiders_optimized.py --help # See all options
-
Batch File Method (Easiest)
# Use the included .bat file run_spiders_optimized.bat prothomalo run_spiders_optimized.bat --monitor
Issue | Solution |
---|---|
UV shows only "Bytecode compiled" |
Use -L INFO flag or switch to Python runner |
Can't see scraping progress |
Use python run_spiders_optimized.py --monitor |
'uv' is not recognized |
Add UV to PATH or reinstall UV |
Permission denied |
Run Command Prompt/PowerShell as Administrator |
SSL certificate verify failed |
Update certificates: pip install --upgrade certifi |
ModuleNotFoundError |
Run uv sync in project directory |
Access denied to file |
Close Excel/other programs using the file |
No output visible |
Use explicit log levels: -L INFO or -L DEBUG |
# Check UV installation
uv --version
# Check Python installation
python --version
# Check if Scrapy is available
uv run scrapy version
# Reset virtual environment (if issues)
rmdir /s .venv
uv venv --python 3.11
.venv\Scripts\activate
uv sync
# View logs (Windows)
type logs\prothomalo_*.log
type scrapy.log
# Monitor running processes
tasklist | findstr python
Feature | Windows | Linux/macOS | Notes |
---|---|---|---|
Runner Script | python run_spiders_optimized.py |
./run_spiders_optimized.sh |
Same functionality |
Performance | ββββ | βββββ | Slightly slower on Windows |
Automation | Task Scheduler | Cron jobs | Both work well |
Setup | UV + Python | UV + bash | UV works on all platforms |
Monitoring | β Full support | β Full support | Identical features |
Date Filtering | β Full support | β Full support | Identical syntax |
Export Tools | β Full support | β Full support | Same output formats |
The run_spiders_optimized.py
script provides:
β
Cross-platform compatibility - Works on Windows, macOS, Linux
β
All bash script features - Monitoring, logging, progress tracking
β
Same performance optimizations - 64 concurrent requests, smart throttling
β
Windows-native experience - No need for WSL or bash emulation
β
Identical command-line interface - Same arguments and options
β
Real-time output - Live progress and logging
β
Error handling - Robust error detection and reporting
Windows users get the exact same experience as Linux/macOS users!
# Run specific newspapers one by one
uv run scrapy crawl prothomalo # Fastest (API-based)
uv run scrapy crawl dailysun # Enhanced extraction
uv run scrapy crawl ittefaq # Robust pagination
uv run scrapy crawl BDpratidin # Bengali date handling
uv run scrapy crawl bangladesh_today # Multi-format support
uv run scrapy crawl thedailystar # Legacy archive support
# With custom limits and settings
uv run scrapy crawl prothomalo -s CLOSESPIDER_ITEMCOUNT=100 # Limit to 100 articles
uv run scrapy crawl dailysun -s DOWNLOAD_DELAY=2 # Add 2s delay
uv run scrapy crawl ittefaq -s CONCURRENT_REQUESTS=32 # More concurrent requests
# ποΈ DATE RANGE FILTERING (All Spiders Support This!)
# Scrape articles from specific date ranges
uv run scrapy crawl prothomalo -a start_date=2024-01-01 -a end_date=2024-01-31 # January 2024
uv run scrapy crawl dailysun -a start_date=2024-06-01 -a end_date=2024-06-30 # June 2024
uv run scrapy crawl ittefaq -a start_date=2024-08-01 # From Aug 1 to today
uv run scrapy crawl BDpratidin -a start_date=2024-01-01 -a end_date=2024-12-31 # Entire 2024
uv run scrapy crawl bangladesh_today -a start_date=2024-03-01 -a end_date=2024-03-31 # March 2024
uv run scrapy crawl thedailystar -a start_date=2024-07-01 -a end_date=2024-07-31 # July 2024
# π
DATE FORMAT: YYYY-MM-DD (ISO format)
# β° If only start_date is provided, end_date defaults to today
# β° If only end_date is provided, start_date uses spider default (usually 6 months back)
# π― COMBINE DATE FILTERING WITH OTHER OPTIONS
uv run scrapy crawl prothomalo -a start_date=2024-01-01 -a end_date=2024-01-31 -s CLOSESPIDER_ITEMCOUNT=50
uv run scrapy crawl dailysun -a start_date=2024-06-01 -a categories="national,sports" -s DOWNLOAD_DELAY=1
# Make executable first
chmod +x run_spiders_optimized.sh
# Run all spiders with optimized settings
./run_spiders_optimized.sh
# Run specific spider only
./run_spiders_optimized.sh prothomalo
./run_spiders_optimized.sh dailysun
./run_spiders_optimized.sh ittefaq
# Run with performance monitoring
./run_spiders_optimized.sh --monitor
./run_spiders_optimized.sh prothomalo --monitor
# ποΈ DATE RANGE FILTERING with Enhanced Runner
# Run all spiders for specific date range
./run_spiders_optimized.sh --start-date 2024-01-01 --end-date 2024-01-31
# Run specific spider with date filtering
./run_spiders_optimized.sh prothomalo --start-date 2024-06-01 --end-date 2024-06-30
# Run with both monitoring and date filtering
./run_spiders_optimized.sh --monitor --start-date 2024-08-01 --end-date 2024-08-31
./run_spiders_optimized.sh prothomalo --monitor --start-date 2024-08-01
# Get help and see all options
./run_spiders_optimized.sh --help
# Run all spiders with optimized settings
python run_spiders_optimized.py
# Run specific spider only
python run_spiders_optimized.py prothomalo
python run_spiders_optimized.py dailysun
python run_spiders_optimized.py ittefaq
# Run with performance monitoring
python run_spiders_optimized.py --monitor
python run_spiders_optimized.py prothomalo --monitor
# ποΈ DATE RANGE FILTERING with Enhanced Runner
# Run all spiders for specific date range
python run_spiders_optimized.py --start-date 2024-01-01 --end-date 2024-01-31
# Run specific spider with date filtering
python run_spiders_optimized.py prothomalo --start-date 2024-06-01 --end-date 2024-06-30
# Run with both monitoring and date filtering
python run_spiders_optimized.py --monitor --start-date 2024-08-01 --end-date 2024-08-31
python run_spiders_optimized.py prothomalo --monitor --start-date 2024-08-01
# Get help and see all options
python run_spiders_optimized.py --help
Both Linux/macOS and Windows versions support the same spiders:
prothomalo
- ProthomAlo (API-based, fastest)bdpratidin
- BD Pratidin (Bengali handling)dailysun
- Daily Sun (enhanced extraction)ittefaq
- Daily Ittefaq (robust pagination)thebangladeshtoday
- Bangladesh Today (multi-format)thedailystar
- The Daily Star (legacy support)
# Run only fast spiders (API-based)
uv run scrapy crawl prothomalo
# Run only specific categories
uv run scrapy crawl ittefaq
uv run scrapy crawl dailysun
uv run scrapy crawl BDpratidin
# Run with specific parameters and date ranges
uv run scrapy crawl bangladesh_today -a start_date=2024-01-01 -a end_date=2024-01-31 -s CLOSESPIDER_ITEMCOUNT=50
# π
DATE-SPECIFIC SCRAPING EXAMPLES
# Last week's news
uv run scrapy crawl prothomalo -a start_date=2024-08-22 -a end_date=2024-08-29
# Monthly archives
uv run scrapy crawl dailysun -a start_date=2024-01-01 -a end_date=2024-01-31 # January
uv run scrapy crawl ittefaq -a start_date=2024-02-01 -a end_date=2024-02-29 # February
uv run scrapy crawl thedailystar -a start_date=2024-03-01 -a end_date=2024-03-31 # March
# Quarterly reports
uv run scrapy crawl BDpratidin -a start_date=2024-01-01 -a end_date=2024-03-31 # Q1 2024
uv run scrapy crawl bangladesh_today -a start_date=2024-04-01 -a end_date=2024-06-30 # Q2 2024
# Test run with minimal data
uv run scrapy crawl prothomalo -s CLOSESPIDER_ITEMCOUNT=5 -L DEBUG
# Monitor performance during run
uv run python performance_monitor.py &
uv run scrapy crawl dailysun
# Run with custom log levels
uv run scrapy crawl ittefaq -L INFO # Less verbose
uv run scrapy crawl BDpratidin -L ERROR # Only errors
The optimized runner script provides the most comprehensive way to run spiders with performance monitoring, logging, and advanced options.
# Make executable (one time only)
chmod +x run_spiders_optimized.sh
# Run all spiders with optimized settings
./run_spiders_optimized.sh
# Run specific spider
./run_spiders_optimized.sh prothomalo
./run_spiders_optimized.sh dailysun
./run_spiders_optimized.sh ittefaq
# Individual spider execution
./run_spiders_optimized.sh prothomalo # ProthomAlo (API-based, fastest)
./run_spiders_optimized.sh bdpratidin # BD Pratidin (Bengali handling)
./run_spiders_optimized.sh dailysun # Daily Sun (enhanced extraction)
./run_spiders_optimized.sh ittefaq # Daily Ittefaq (robust pagination)
./run_spiders_optimized.sh bangladesh_today # Bangladesh Today (multi-format)
./run_spiders_optimized.sh thedailystar # The Daily Star (legacy support)
# Run all spiders with real-time monitoring
./run_spiders_optimized.sh --monitor
# Run specific spider with monitoring
./run_spiders_optimized.sh prothomalo --monitor
./run_spiders_optimized.sh dailysun --monitor
# Monitor provides:
# - Real-time performance metrics
# - Memory and CPU usage tracking
# - Scraping speed statistics
# - Automatic performance report generation
# Filter articles by date range (all spiders support this)
./run_spiders_optimized.sh --start-date 2024-01-01 --end-date 2024-01-31 # All spiders for January 2024
./run_spiders_optimized.sh prothomalo --start-date 2024-06-01 --end-date 2024-06-30 # ProthomAlo for June 2024
# From specific date to today
./run_spiders_optimized.sh dailysun --start-date 2024-08-01
# Up to specific date (from default start)
./run_spiders_optimized.sh ittefaq --end-date 2024-12-31
# Combine with monitoring
./run_spiders_optimized.sh --monitor --start-date 2024-08-01 --end-date 2024-08-31
./run_spiders_optimized.sh prothomalo --monitor --start-date 2024-01-01 --end-date 2024-01-31
# Show all available options and spiders
./run_spiders_optimized.sh --help
./run_spiders_optimized.sh -h
# Output shows:
# - Available spider names
# - Date filtering options
# - Usage examples
# - Parameter explanations
The script automatically applies these performance optimizations:
# Settings applied by the optimized runner:
-s CONCURRENT_REQUESTS=64 # High concurrency
-s DOWNLOAD_DELAY=0.25 # Minimal but respectful delay
-s AUTOTHROTTLE_TARGET_CONCURRENCY=8.0 # Smart throttling
-L INFO # Informative logging level
# Logs are automatically created in logs/ directory
logs/prothomalo_20240829_143022.log # Timestamped logs
logs/dailysun_20240829_143545.log # Per-spider logs
logs/ittefaq_20240829_144012.log # Individual tracking
# View logs in real-time
tail -f logs/prothomalo_*.log
# Script automatically detects and uses:
# - UV package manager (preferred)
# - Fallback to direct scrapy commands
# - Performance monitor integration
# - Error handling and recovery
# Run fastest spider for testing
./run_spiders_optimized.sh prothomalo
# β
Uses API, completes in ~2-5 minutes
# Run all spiders with monitoring
./run_spiders_optimized.sh --monitor
# β
Comprehensive scraping with performance tracking
# β
Automatic report generation
# β
Individual logs per spider
# Run only fast/reliable spiders
./run_spiders_optimized.sh prothomalo --monitor
./run_spiders_optimized.sh dailysun --monitor
./run_spiders_optimized.sh ittefaq --monitor
# Test individual spiders during development
./run_spiders_optimized.sh prothomalo # Fast API test
./run_spiders_optimized.sh --help # Check available options
./run_spiders_optimized.sh bangladesh_today --monitor # Full test with monitoring
# Console output includes:
π Starting all spiders with optimized settings...
π° Running spider: prothomalo
Progress: 1/7
β
Spider prothomalo completed successfully
π All spiders completed!
Success: 7/7
Total time: 1234s (20m 34s)
π Generating performance report...
# Automatic error detection and reporting:
β Spider dailysun failed with exit code 1
β οΈ UV not found, using direct commands
β οΈ Performance monitor not found
Feature | Benefit |
---|---|
High Concurrency | 64 concurrent requests for faster scraping |
Smart Throttling | Automatic speed adjustment to avoid blocking |
UV Integration | Ultra-fast dependency resolution |
Individual Logs | Detailed per-spider tracking |
Progress Tracking | Real-time completion status |
Error Recovery | Continues with remaining spiders on failure |
Performance Reports | Automatic analytics generation |
Method | Speed | Monitoring | Logs | Error Handling | Best For |
---|---|---|---|---|---|
run_spiders_optimized.sh |
βββββ | βββββ | βββββ | βββββ | Production |
Individual commands | ββ | β | ββ | ββ | Development |
Custom scripts | βββ | βββ | βββ | βββ | Custom needs |
# Add to crontab for automatic daily runs
crontab -e
# Example cron entries:
# Run all spiders daily at 2 AM using optimized runner
0 2 * * * cd /path/to/BDNewsPaperScraper && ./run_spiders_optimized.sh >> /var/log/scraper.log 2>&1
# Run all spiders with monitoring daily at 3 AM
0 3 * * * cd /path/to/BDNewsPaperScraper && ./run_spiders_optimized.sh --monitor >> /var/log/scraper_monitored.log 2>&1
# Run fast spider every 6 hours using optimized runner
0 */6 * * * cd /path/to/BDNewsPaperScraper && ./run_spiders_optimized.sh prothomalo >> /var/log/prothomalo.log 2>&1
# Run specific spiders on weekdays only
0 9 * * 1-5 cd /path/to/BDNewsPaperScraper && ./run_spiders_optimized.sh dailysun --monitor
0 14 * * 1-5 cd /path/to/BDNewsPaperScraper && ./run_spiders_optimized.sh ittefaq --monitor
# Alternative: traditional individual commands
0 */6 * * * cd /path/to/BDNewsPaperScraper && uv run scrapy crawl prothomalo >> /var/log/prothomalo_direct.log 2>&1
# Create a Dockerfile for containerized runs
cat > Dockerfile << 'EOF'
FROM python:3.11-slim
WORKDIR /app
COPY . .
# Install UV and dependencies
RUN pip install uv
RUN uv sync
# Default command
CMD ["./run_spiders_optimized.sh", "--monitor"]
EOF
# Build and run in container
docker build -t bdnewspaper-scraper .
docker run -v $(pwd)/data:/app/data bdnewspaper-scraper
# Or with specific spider
docker run bdnewspaper-scraper ./run_spiders_optimized.sh prothomalo
# Activate virtual environment and run directly
source .venv/bin/activate # or .venv\Scripts\activate on Windows
# Run without uv prefix (faster for multiple commands)
scrapy crawl prothomalo -s CLOSESPIDER_ITEMCOUNT=100
scrapy crawl dailysun -s DOWNLOAD_DELAY=2
python performance_monitor.py
# Deactivate when done
deactivate
# VS Code launch.json configuration
cat > .vscode/launch.json << 'EOF'
{
"version": "0.2.0",
"configurations": [
{
"name": "Run Prothomalo Spider",
"type": "python",
"request": "launch",
"program": "${workspaceFolder}/.venv/bin/scrapy",
"args": ["crawl", "prothomalo", "-s", "CLOSESPIDER_ITEMCOUNT=10"],
"console": "integratedTerminal"
}
]
}
EOF
# PyCharm run configuration:
# Script path: .venv/bin/scrapy
# Parameters: crawl prothomalo -s CLOSESPIDER_ITEMCOUNT=10
# Working directory: /path/to/BDNewsPaperScraper
# Create systemd service for automatic runs
sudo cat > /etc/systemd/system/bdnewspaper.service << 'EOF'
[Unit]
Description=BD Newspaper Scraper
After=network.target
[Service]
Type=oneshot
User=your-username
WorkingDirectory=/path/to/BDNewsPaperScraper
ExecStart=/path/to/BDNewsPaperScraper/run_spiders_optimized.sh --monitor
Environment=PATH=/usr/local/bin:/usr/bin:/bin
[Install]
WantedBy=multi-user.target
EOF
# Enable and start service
sudo systemctl daemon-reload
sudo systemctl enable bdnewspaper.service
sudo systemctl start bdnewspaper.service
# Create timer for periodic runs
sudo cat > /etc/systemd/system/bdnewspaper.timer << 'EOF'
[Unit]
Description=Run BD Newspaper Scraper daily
Requires=bdnewspaper.service
[Timer]
OnCalendar=daily
Persistent=true
[Install]
WantedBy=timers.target
EOF
sudo systemctl enable bdnewspaper.timer
# Development environment
export SCRAPY_SETTINGS_MODULE=BDNewsPaper.settings_dev
uv run scrapy crawl prothomalo -s CLOSESPIDER_ITEMCOUNT=5 -L DEBUG
# Staging environment
export SCRAPY_SETTINGS_MODULE=BDNewsPaper.settings_staging
./run_spiders_optimized.sh prothomalo --monitor
# Production environment
export SCRAPY_SETTINGS_MODULE=BDNewsPaper.settings_prod
./run_spiders_optimized.sh --monitor
# Testing environment with mock data
export SCRAPY_SETTINGS_MODULE=BDNewsPaper.settings_test
uv run scrapy crawl prothomalo -s DOWNLOAD_DELAY=0 -s ROBOTSTXT_OBEY=False
# Run multiple spiders in parallel (advanced users)
# Terminal 1
./run_spiders_optimized.sh prothomalo --monitor &
# Terminal 2
./run_spiders_optimized.sh dailysun --monitor &
# Terminal 3
./run_spiders_optimized.sh ittefaq --monitor &
# Wait for all to complete
wait
# Or using GNU parallel
parallel -j 3 './run_spiders_optimized.sh {} --monitor' ::: prothomalo dailysun ittefaq
# Create Makefile for easy commands
cat > Makefile << 'EOF'
.PHONY: install test run-all run-fast clean
install:
uv sync
test:
./run_spiders_optimized.sh prothomalo --monitor
run-all:
./run_spiders_optimized.sh --monitor
run-fast:
./run_spiders_optimized.sh prothomalo
export:
./toxlsx.py --output "export_$(date +%Y%m%d).xlsx"
clean:
rm -rf logs/* *.log
rm -rf .scrapy/
stats:
./toxlsx.py --list
EOF
# Use with make commands
make install
make test
make run-all
make export
# GitHub Actions workflow (.github/workflows/scraper.yml)
cat > .github/workflows/scraper.yml << 'EOF'
name: News Scraper
on:
schedule:
- cron: '0 2 * * *' # Daily at 2 AM
workflow_dispatch:
jobs:
scrape:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Install UV
run: curl -LsSf https://astral.sh/uv/install.sh | sh
- name: Setup project
run: |
source ~/.bashrc
uv sync
- name: Run scraper
run: ./run_spiders_optimized.sh --monitor
- name: Upload artifacts
uses: actions/upload-artifact@v3
with:
name: scraped-data
path: news_articles.db
EOF
# GitLab CI (.gitlab-ci.yml)
cat > .gitlab-ci.yml << 'EOF'
stages:
- scrape
scrape_news:
stage: scrape
image: python:3.11
script:
- curl -LsSf https://astral.sh/uv/install.sh | sh
- source ~/.bashrc
- uv sync
- ./run_spiders_optimized.sh --monitor
artifacts:
paths:
- news_articles.db
expire_in: 1 week
only:
- schedules
EOF
# Custom Python script approach
cat > custom_runner.py << 'EOF'
#!/usr/bin/env python3
import subprocess
import sys
spiders = ['prothomalo', 'dailysun', 'ittefaq', 'BDpratidin']
for spider in spiders:
print(f"Running {spider}...")
result = subprocess.run([
'uv', 'run', 'scrapy', 'crawl', spider,
'-s', 'CLOSESPIDER_ITEMCOUNT=100'
], capture_output=True, text=True)
if result.returncode == 0:
print(f"β
{spider} completed successfully")
else:
print(f"β {spider} failed: {result.stderr}")
EOF
chmod +x custom_runner.py
python custom_runner.py
# Run specific newspaper spiders
uv run scrapy crawl prothomalo # ProthomAlo (fastest, API-based)
uv run scrapy crawl dailysun # Daily Sun
uv run scrapy crawl ittefaq # Daily Ittefaq
uv run scrapy crawl kalerKantho # Kaler Kantho
uv run scrapy crawl BDpratidin # BD Pratidin
uv run scrapy crawl bangladesh_today # Bangladesh Today
uv run scrapy crawl thedailystar # The Daily Star
# Run with custom settings
uv run scrapy crawl prothomalo -s CLOSESPIDER_ITEMCOUNT=100 # Limit to 100 articles
uv run scrapy crawl dailysun -s DOWNLOAD_DELAY=2 # Add 2s delay between requests
# Enhanced runner with optimizations (recommended)
chmod +x run_spiders_optimized.sh
./run_spiders_optimized.sh
# Run specific spider with monitoring
uv run scrapy crawl prothomalo \
-s CLOSESPIDER_ITEMCOUNT=500 \
-s DOWNLOAD_DELAY=1 \
-s CONCURRENT_REQUESTS=16 \
-L INFO
# Run spider for specific date range (if supported)
uv run scrapy crawl ittefaq -a start_date=2024-01-01 -a end_date=2024-01-31
# Check scraped data immediately
./toxlsx.py --list
# Example output:
# Shared News Articles Database
# ========================================
# Database file: news_articles.db
# Total articles: 1,234
# Date range: 2024-01-01 to 2024-12-31
#
# Articles by newspaper:
# ------------------------------
# ProthomAlo: 456 articles
# The Daily Ittefaq: 321 articles
# Daily Sun: 234 articles
# Kaler Kantho: 123 articles
# Install pandas for export functionality (one time only)
uv add pandas openpyxl # For Excel export
# OR
uv add pandas # For CSV export only
# Export all articles to Excel
./toxlsx.py --output all_news.xlsx
# Export all articles to CSV
./toxlsx.py --format csv --output all_news.csv
# Export specific newspaper articles
./toxlsx.py --paper "ProthomAlo" --output prothomalo.xlsx
./toxlsx.py --paper "Daily Sun" --output dailysun.xlsx
./toxlsx.py --paper "The Daily Ittefaq" --output ittefaq.xlsx
./toxlsx.py --paper "Kaler Kantho" --output kalerkantho.xlsx
./toxlsx.py --paper "BD Pratidin" --output bdpratidin.xlsx
./toxlsx.py --paper "Bangladesh Today" --output bangladesh_today.xlsx
./toxlsx.py --paper "The Daily Star" --output thedailystar.xlsx
# Export as CSV instead of Excel
./toxlsx.py --paper "ProthomAlo" --format csv --output prothomalo.csv
# Latest articles from all newspapers
./toxlsx.py --limit 100 --output recent_news.xlsx
./toxlsx.py --limit 500 --format csv --output recent_500.csv
# Latest from specific newspaper
./toxlsx.py --paper "ProthomAlo" --limit 50 --output latest_prothomalo.xlsx
./toxlsx.py --paper "Daily Sun" --limit 25 --format csv --output latest_dailysun.csv
# Count articles by newspaper
sqlite3 news_articles.db "SELECT paper_name, COUNT(*) as count FROM articles GROUP BY paper_name ORDER BY count DESC;"
# Recent headlines from all newspapers
sqlite3 news_articles.db "SELECT headline, paper_name, publication_date FROM articles ORDER BY scraped_at DESC LIMIT 20;"
# Search for specific topics
sqlite3 news_articles.db "SELECT headline, paper_name FROM articles WHERE headline LIKE '%politics%' LIMIT 10;"
# Articles from today
sqlite3 news_articles.db "SELECT COUNT(*) FROM articles WHERE date(scraped_at) = date('now');"
# Export query results to CSV
sqlite3 -header -csv news_articles.db "SELECT * FROM articles WHERE paper_name = 'ProthomAlo' LIMIT 100;" > prothomalo_latest.csv
# View real-time logs (in another terminal)
tail -f scrapy.log
# Monitor with performance tool
uv run python performance_monitor.py
# Show database information and statistics
./toxlsx.py --list
# Check article counts by newspaper
sqlite3 news_articles.db "SELECT paper_name, COUNT(*) FROM articles GROUP BY paper_name;"
# View recent headlines
sqlite3 news_articles.db "SELECT headline, paper_name FROM articles ORDER BY scraped_at DESC LIMIT 10;"
# Show database stats and newspaper breakdown
./toxlsx.py --list
# Output example:
# Shared News Articles Database
# ========================================
# Database file: news_articles.db
# Total articles: 1,234
# Date range: 2024-01-01 to 2024-12-31
#
# Articles by newspaper:
# ------------------------------
# ProthomAlo: 456 articles
# The Daily Ittefaq: 321 articles
# Daily Sun: 234 articles
# ...
# Install pandas for export functionality (one time only)
uv add pandas openpyxl # For Excel export
# OR
uv add pandas # For CSV export only
# Export all articles to Excel
./toxlsx.py --output all_news.xlsx
# Export all articles to CSV
./toxlsx.py --format csv --output all_news.csv
# Export only ProthomAlo articles
./toxlsx.py --paper "ProthomAlo" --output prothomalo.xlsx
# Export latest 100 articles from all newspapers
./toxlsx.py --limit 100 --output recent_news.xlsx
# Export latest 50 Daily Sun articles as CSV
./toxlsx.py --paper "Daily Sun" --limit 50 --format csv
# Export latest Ittefaq articles
./toxlsx.py --paper "The Daily Ittefaq" --limit 25 --output ittefaq_latest.xlsx
# See all available options
./toxlsx.py --help
# Available filters:
# --paper "newspaper_name" # Filter by specific newspaper
# --limit N # Limit to N most recent articles
# --format excel|csv # Output format
# --output filename # Custom output filename
# Direct SQLite queries for advanced analysis
sqlite3 news_articles.db "SELECT paper_name, COUNT(*) FROM articles GROUP BY paper_name;"
sqlite3 news_articles.db "SELECT headline, article FROM articles WHERE paper_name = 'ProthomAlo' LIMIT 5;"
sqlite3 news_articles.db "SELECT COUNT(*) FROM articles WHERE publication_date LIKE '2024%';"
# Export specific spider data
./toxlsx.py --spider prothomalo # Excel format
./toxlsx.py --spider dailysun --format csv # CSV format
./toxlsx.py --spider ittefaq --output custom.xlsx # Custom filename
# Export all available data
./toxlsx.py --spider legacy --output all_news.xlsx
# See all export options
./toxlsx.py --help
# Export with custom table
./toxlsx.py --db custom.db --table my_articles --output data.xlsx
Spider Name | Command | Website | Features |
---|---|---|---|
prothomalo |
uv run scrapy crawl prothomalo |
ProthomAlo | β API-based, Fast, JSON responses, Date filtering |
dailysun |
uv run scrapy crawl dailysun |
Daily Sun | β Enhanced extraction, Bengali support, Date filtering |
ittefaq |
uv run scrapy crawl ittefaq |
Daily Ittefaq | β Robust pagination, Date filtering |
BDpratidin |
uv run scrapy crawl BDpratidin |
BD Pratidin | β Bengali date handling, Categories, Date filtering |
bangladesh_today |
uv run scrapy crawl bangladesh_today |
Bangladesh Today | β Multi-format support, English content, Date filtering |
thedailystar |
uv run scrapy crawl thedailystar |
The Daily Star | β Legacy support, Large archive, Date filtering |
kalerKantho |
β DISCONTINUED | β English version discontinued Aug 2024, now Bangla-only |
All spiders now support date range filtering! You can scrape articles from specific time periods using the start_date
and end_date
parameters.
# Scrape articles from January 2024
uv run scrapy crawl prothomalo -a start_date=2024-01-01 -a end_date=2024-01-31
# Scrape from specific date to today
uv run scrapy crawl dailysun -a start_date=2024-06-01
# Scrape up to specific date (from default start)
uv run scrapy crawl ittefaq -a end_date=2024-12-31
# π
MONTHLY ARCHIVES
uv run scrapy crawl prothomalo -a start_date=2024-01-01 -a end_date=2024-01-31 # January 2024
uv run scrapy crawl dailysun -a start_date=2024-02-01 -a end_date=2024-02-29 # February 2024
uv run scrapy crawl ittefaq -a start_date=2024-03-01 -a end_date=2024-03-31 # March 2024
# π QUARTERLY REPORTS
uv run scrapy crawl BDpratidin -a start_date=2024-01-01 -a end_date=2024-03-31 # Q1 2024
uv run scrapy crawl bangladesh_today -a start_date=2024-04-01 -a end_date=2024-06-30 # Q2 2024
uv run scrapy crawl thedailystar -a start_date=2024-07-01 -a end_date=2024-09-30 # Q3 2024
# π° RECENT NEWS
uv run scrapy crawl thedailystar -a start_date=2024-08-22 -a end_date=2024-08-29 # Last week
uv run scrapy crawl prothomalo -a start_date=2024-08-01 # This month
# π― COMBINED WITH OTHER FILTERS
uv run scrapy crawl dailysun -a start_date=2024-01-01 -a end_date=2024-01-31 -s CLOSESPIDER_ITEMCOUNT=100
uv run scrapy crawl prothomalo -a start_date=2024-06-01 -a categories="Bangladesh,Sports" -s DOWNLOAD_DELAY=1
- Format:
YYYY-MM-DD
(ISO 8601 standard) - Timezone: All dates are interpreted in Dhaka timezone (Asia/Dhaka)
- Default start_date: Usually 6 months back (varies by spider)
- Default end_date: Today's date
- Range: Only articles published within the specified range are scraped
# β
RECOMMENDED: Use specific date ranges for faster scraping
uv run scrapy crawl prothomalo -a start_date=2024-08-01 -a end_date=2024-08-31
# β
PERFORMANCE: Shorter date ranges = faster completion
uv run scrapy crawl dailysun -a start_date=2024-08-25 -a end_date=2024-08-29
# β
ARCHIVAL: For historical data, use longer ranges
uv run scrapy crawl thedailystar -a start_date=2024-01-01 -a end_date=2024-12-31
# β AVOID: Very large date ranges without limits (may take hours)
# uv run scrapy crawl ittefaq -a start_date=2020-01-01 -a end_date=2024-12-31
# β
BETTER: Use limits with large ranges
uv run scrapy crawl ittefaq -a start_date=2023-01-01 -a end_date=2024-12-31 -s CLOSESPIDER_ITEMCOUNT=1000
All spiders now write to a single shared database (news_articles.db
) with only the essential fields you requested:
CREATE TABLE articles (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT UNIQUE NOT NULL,
paper_name TEXT NOT NULL,
headline TEXT NOT NULL,
article TEXT NOT NULL,
publication_date TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
url
- Article URL (unique identifier)paper_name
- Newspaper name (e.g., "ProthomAlo", "The Daily Ittefaq")headline
- Article title/headlinearticle
- Full article content (cleaned text)publication_date
- When the article was publishedscraped_at
- When we scraped it (automatic timestamp)
- β Single database file for all newspapers
- β Essential fields only - no unnecessary data
- β Fast queries with proper indexing
- β Automatic duplicate prevention by URL
- β Clean, normalized content
# Create custom settings file
cp BDNewsPaper/settings.py BDNewsPaper/settings_custom.py
# Run with custom settings
uv run scrapy crawl prothomalo -s SETTINGS_MODULE=BDNewsPaper.settings_custom
# Format code
uv run black BDNewsPaper/
# Sort imports
uv run isort BDNewsPaper/
# Lint code
uv run flake8 BDNewsPaper/
# Run all quality checks
uv run black . && uv run isort . && uv run flake8 .
# Monitor spider performance in real-time
uv run python performance_monitor.py
# View statistics
uv run python performance_monitor.py stats
# Generate detailed report
uv run python performance_monitor.py report
# Fastest spiders (API-based, recommended for frequent runs)
uv run scrapy crawl prothomalo # Uses API, very fast
# Medium speed spiders (good balance)
uv run scrapy crawl dailysun # Enhanced extraction
uv run scrapy crawl ittefaq # Robust pagination
# Comprehensive spiders (slower but thorough)
uv run scrapy crawl BDpratidin # Bengali date handling
uv run scrapy crawl bangladesh_today # Multi-format support
uv run scrapy crawl thedailystar # Large archive
# Limit articles for faster testing
uv run scrapy crawl prothomalo -s CLOSESPIDER_ITEMCOUNT=50
# Increase concurrent requests for faster scraping
uv run scrapy crawl dailysun -s CONCURRENT_REQUESTS=32
# Add delays to be respectful to servers
uv run scrapy crawl ittefaq -s DOWNLOAD_DELAY=1
# Disable unnecessary features for speed
uv run scrapy crawl ittefaq -s COOKIES_ENABLED=False -s RETRY_ENABLED=False
# Real-time monitoring
tail -f scrapy.log | grep -E "(Spider opened|items|Spider closed)"
# Database size monitoring
ls -lh news_articles.db*
# Performance monitoring
uv run python performance_monitor.py
# Resume interrupted scraping (spiders handle duplicates automatically)
uv run scrapy crawl prothomalo # Will skip existing URLs
# Clear specific spider data if needed
sqlite3 news_articles.db "DELETE FROM articles WHERE paper_name = 'ProthomAlo';"
# Backup database before major runs
cp news_articles.db news_articles_backup_$(date +%Y%m%d).db
This section covers all the ways to create detailed log files and monitor your scraping activities across all platforms and runner methods.
Platform | Method | Command |
---|---|---|
Linux/macOS | Individual Spider | uv run scrapy crawl prothomalo -L DEBUG > detailed.log 2>&1 |
Linux/macOS | Enhanced Runner | ./run_spiders_optimized.sh prothomalo --monitor > full.log 2>&1 |
Windows | Individual Spider | uv run scrapy crawl prothomalo -L DEBUG > detailed.log 2>&1 |
Windows | Python Runner | python run_spiders_optimized.py prothomalo --monitor > full.log 2>&1 |
Windows | Batch Runner | run_spiders_optimized.bat prothomalo > full.log 2>&1 |
# β RECOMMENDED: INFO level shows scraping progress
uv run scrapy crawl prothomalo -L INFO
# π§ DEBUG level shows detailed technical information
uv run scrapy crawl prothomalo -L DEBUG
# β οΈ WARNING level shows only warnings and errors
uv run scrapy crawl prothomalo -L WARNING
# β ERROR level shows only critical errors
uv run scrapy crawl prothomalo -L ERROR
# π BASIC: Save all output to file
uv run scrapy crawl prothomalo -L INFO > scraping.log 2>&1
# π DETAILED: Save with timestamps and full debug info
uv run scrapy crawl prothomalo -L DEBUG > "prothomalo_detailed_$(date +%Y%m%d_%H%M%S).log" 2>&1
# π― PRODUCTION: Save with specific spider and date
uv run scrapy crawl dailysun -L INFO > "logs/dailysun_$(date +%Y%m%d).log" 2>&1
# π SPLIT: Save errors separately
uv run scrapy crawl ittefaq -L INFO > scraping.log 2> errors.log
# π COMPREHENSIVE: Full logging with all details
uv run scrapy crawl prothomalo \
-L DEBUG \
-s LOG_FILE="logs/prothomalo_full_$(date +%Y%m%d_%H%M%S).log" \
-s LOG_LEVEL=DEBUG \
-s CLOSESPIDER_ITEMCOUNT=100 \
> "console_output_$(date +%Y%m%d_%H%M%S).log" 2>&1
# π PERFORMANCE: Include performance metrics
uv run scrapy crawl dailysun \
-L INFO \
-s STATS_CLASS=scrapy.statscollectors.MemoryStatsCollector \
-s LOG_FILE="logs/dailysun_performance_$(date +%Y%m%d).log" \
> "dailysun_console_$(date +%Y%m%d).log" 2>&1
# π MONITORING: Real-time progress with detailed stats
uv run scrapy crawl ittefaq \
-L INFO \
-s LOGSTATS_INTERVAL=10 \
-s STATS_CLASS=scrapy.statscollectors.MemoryStatsCollector \
> "ittefaq_realtime_$(date +%Y%m%d_%H%M%S).log" 2>&1
# π
DATE-SPECIFIC: Log scraping for specific periods
uv run scrapy crawl prothomalo \
-a start_date=2024-01-01 \
-a end_date=2024-01-31 \
-L INFO \
> "prothomalo_january2024_$(date +%Y%m%d).log" 2>&1
# π QUARTERLY: Log quarterly data collection
uv run scrapy crawl bdpratidin \
-a start_date=2024-01-01 \
-a end_date=2024-03-31 \
-L DEBUG \
-s LOG_FILE="logs/bdpratidin_Q1_2024.log" \
> "bdpratidin_Q1_console.log" 2>&1
# π BASIC: Standard logging with monitoring
./run_spiders_optimized.sh prothomalo --monitor > "runner_$(date +%Y%m%d).log" 2>&1
# π DETAILED: Full debug logging for all spiders
./run_spiders_optimized.sh --monitor > "full_scrape_$(date +%Y%m%d_%H%M%S).log" 2>&1
# π PERFORMANCE: Detailed performance monitoring
./run_spiders_optimized.sh prothomalo --monitor > "performance_$(date +%Y%m%d).log" 2>&1
# π
DATE-FILTERED: Log specific date range scraping
./run_spiders_optimized.sh \
--start-date 2024-08-01 \
--end-date 2024-08-31 \
--monitor \
> "august2024_scrape_$(date +%Y%m%d).log" 2>&1
# π― SELECTIVE: Log specific spiders only
./run_spiders_optimized.sh prothomalo --monitor > prothomalo_detailed.log 2>&1
./run_spiders_optimized.sh dailysun --monitor > dailysun_detailed.log 2>&1
./run_spiders_optimized.sh ittefaq --monitor > ittefaq_detailed.log 2>&1
# π BASIC: Standard logging with monitoring
python run_spiders_optimized.py prothomalo --monitor > runner_%date:~10,4%%date:~4,2%%date:~7,2%.log 2>&1
# π DETAILED: Full debug logging for all spiders
python run_spiders_optimized.py --monitor > full_scrape_%date:~10,4%%date:~4,2%%date:~7,2%_%time:~0,2%%time:~3,2%.log 2>&1
# π PERFORMANCE: Detailed performance monitoring
python run_spiders_optimized.py prothomalo --monitor > performance_%date:~10,4%%date:~4,2%%date:~7,2%.log 2>&1
# π
DATE-FILTERED: Log specific date range scraping
python run_spiders_optimized.py --start-date 2024-08-01 --end-date 2024-08-31 --monitor > august2024_scrape.log 2>&1
# π― SELECTIVE: Log specific spiders only
python run_spiders_optimized.py prothomalo --monitor > prothomalo_detailed.log 2>&1
python run_spiders_optimized.py dailysun --monitor > dailysun_detailed.log 2>&1
python run_spiders_optimized.py ittefaq --monitor > ittefaq_detailed.log 2>&1
# π BASIC: Standard logging with monitoring
run_spiders_optimized.bat prothomalo --monitor > runner_log.txt 2>&1
# π DETAILED: Full debug logging for all spiders
run_spiders_optimized.bat --monitor > full_scrape_log.txt 2>&1
# π PERFORMANCE: Detailed performance monitoring
run_spiders_optimized.bat prothomalo --monitor > performance_log.txt 2>&1
# π
DATE-FILTERED: Log specific date range scraping
run_spiders_optimized.bat --start-date 2024-08-01 --end-date 2024-08-31 --monitor > august2024_log.txt 2>&1
# π REAL-TIME: Monitor logs as they're created
# Terminal 1: Start scraping
./run_spiders_optimized.sh prothomalo --monitor > live_scraping.log 2>&1 &
# Terminal 2: Monitor in real-time
tail -f live_scraping.log
# π― FILTERED: Monitor specific events
tail -f live_scraping.log | grep -E "(scraped|ERROR|WARNING|Spider opened|Spider closed)"
# π STATS: Monitor statistics only
tail -f live_scraping.log | grep -E "(Crawled|Scraped|pages/min|items/min)"
# π DETAILED: Monitor with color highlighting (if you have ccze)
tail -f live_scraping.log | ccze -A
# π REAL-TIME: Monitor logs as they're created (PowerShell)
# Terminal 1: Start scraping
python run_spiders_optimized.py prothomalo --monitor > live_scraping.log 2>&1
# Terminal 2: Monitor in real-time (PowerShell)
Get-Content live_scraping.log -Wait -Tail 50
# π― FILTERED: Monitor specific events (PowerShell)
Get-Content live_scraping.log -Wait -Tail 50 | Select-String "scraped|ERROR|WARNING|Spider opened|Spider closed"
# π STATS: Monitor statistics only (PowerShell)
Get-Content live_scraping.log -Wait -Tail 50 | Select-String "Crawled|Scraped|pages/min|items/min"
# π COMMAND PROMPT: Alternative monitoring
powershell "Get-Content live_scraping.log -Wait -Tail 30"
# ποΈ CREATE: Organized logging structure
mkdir -p logs/{daily,individual,performance,errors,archive}
# π
DAILY: Daily organized logging
./run_spiders_optimized.sh --monitor > "logs/daily/scrape_$(date +%Y%m%d).log" 2>&1
# π·οΈ INDIVIDUAL: Per-spider logging
uv run scrapy crawl prothomalo -L INFO > "logs/individual/prothomalo_$(date +%Y%m%d_%H%M%S).log" 2>&1
uv run scrapy crawl dailysun -L INFO > "logs/individual/dailysun_$(date +%Y%m%d_%H%M%S).log" 2>&1
# π PERFORMANCE: Performance-focused logging
./run_spiders_optimized.sh prothomalo --monitor > "logs/performance/prothomalo_perf_$(date +%Y%m%d).log" 2>&1
# β ERRORS: Error-only logging
uv run scrapy crawl ittefaq -L ERROR > "logs/errors/ittefaq_errors_$(date +%Y%m%d).log" 2>&1
# π CUSTOM: Create custom logging configuration
cat > custom_logging_settings.py << 'EOF'
# Custom logging settings for detailed output
LOG_LEVEL = 'DEBUG'
LOG_FILE = f'logs/custom_scrape_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'
LOG_STDOUT = True
LOG_FORMAT = '%(levelname)s: %(message)s'
# Statistics configuration
STATS_CLASS = 'scrapy.statscollectors.MemoryStatsCollector'
LOGSTATS_INTERVAL = 30
# Download statistics
DOWNLOADER_STATS = True
DOWNLOAD_DELAY = 1
CONCURRENT_REQUESTS = 32
EOF
# π USE: Apply custom settings
uv run scrapy crawl prothomalo -s SETTINGS_MODULE=custom_logging_settings > custom_output.log 2>&1
# π PRODUCTION: Complete production logging setup
production_log() {
local spider_name=$1
local log_dir="logs/production/$(date +%Y/%m)"
local timestamp=$(date +%Y%m%d_%H%M%S)
# Create directory structure
mkdir -p "$log_dir"
# Run with comprehensive logging
./run_spiders_optimized.sh "$spider_name" --monitor \
> "$log_dir/${spider_name}_${timestamp}_full.log" 2>&1
# Create summary log
echo "=== Scraping Summary for $spider_name ===" > "$log_dir/${spider_name}_${timestamp}_summary.log"
echo "Start time: $(date)" >> "$log_dir/${spider_name}_${timestamp}_summary.log"
echo "Log file: $log_dir/${spider_name}_${timestamp}_full.log" >> "$log_dir/${spider_name}_${timestamp}_summary.log"
}
# π USAGE: Call the function
production_log prothomalo
production_log dailysun
# π STATS: Extract scraping statistics from logs
extract_stats() {
local log_file=$1
echo "=== Scraping Statistics ==="
grep -E "(Spider opened|Spider closed|Crawled.*pages|Scraped.*items)" "$log_file"
echo ""
echo "=== Error Summary ==="
grep -E "(ERROR|CRITICAL)" "$log_file" | head -10
echo ""
echo "=== Performance Metrics ==="
grep -E "(pages/min|items/min|items/sec)" "$log_file" | tail -5
}
# π ANALYZE: Analyze a log file
extract_stats "logs/prothomalo_detailed.log"
# π REPORT: Generate daily scraping report
generate_daily_report() {
local date_str=$(date +%Y%m%d)
local report_file="reports/daily_report_$date_str.txt"
mkdir -p reports
echo "=== Daily Scraping Report - $(date +%Y-%m-%d) ===" > "$report_file"
echo "" >> "$report_file"
# Database statistics
echo "=== Database Statistics ===" >> "$report_file"
sqlite3 news_articles.db "SELECT paper_name, COUNT(*) as articles FROM articles GROUP BY paper_name ORDER BY articles DESC;" >> "$report_file"
echo "" >> "$report_file"
# Recent logs summary
echo "=== Recent Activity ===" >> "$report_file"
find logs/ -name "*$date_str*.log" -exec echo "Log: {}" \; -exec tail -5 {} \; >> "$report_file"
echo "Report saved: $report_file"
}
# π USAGE: Generate report
generate_daily_report
# β ERROR-FOCUSED: Log only errors and warnings
uv run scrapy crawl prothomalo -L WARNING > "errors_only_$(date +%Y%m%d).log" 2>&1
# π DEBUGGING: Ultra-detailed error debugging
uv run scrapy crawl dailysun \
-L DEBUG \
-s DOWNLOAD_DELAY=3 \
-s RETRY_TIMES=5 \
-s CLOSESPIDER_ITEMCOUNT=10 \
> "debug_errors_$(date +%Y%m%d_%H%M%S).log" 2>&1
# π ERROR ANALYSIS: Extract and categorize errors
analyze_errors() {
local log_file=$1
echo "=== Error Analysis Report ===" > "${log_file%.log}_error_analysis.txt"
echo "Total Errors: $(grep -c ERROR "$log_file")" >> "${log_file%.log}_error_analysis.txt"
echo "Total Warnings: $(grep -c WARNING "$log_file")" >> "${log_file%.log}_error_analysis.txt"
echo "" >> "${log_file%.log}_error_analysis.txt"
echo "=== Most Common Errors ===" >> "${log_file%.log}_error_analysis.txt"
grep ERROR "$log_file" | sort | uniq -c | sort -nr | head -10 >> "${log_file%.log}_error_analysis.txt"
}
# π€ AUTOMATED: Complete logging script
cat > complete_logging.sh << 'EOF'
#!/bin/bash
# Configuration
LOG_BASE_DIR="logs/$(date +%Y/%m/%d)"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
SPIDERS=("prothomalo" "dailysun" "ittefaq" "bdpratidin")
# Create directory structure
mkdir -p "$LOG_BASE_DIR"/{individual,combined,errors,performance}
echo "π Starting comprehensive logging session: $TIMESTAMP"
# Log all spiders individually
for spider in "${SPIDERS[@]}"; do
echo "π° Logging spider: $spider"
./run_spiders_optimized.sh "$spider" --monitor \
> "$LOG_BASE_DIR/individual/${spider}_${TIMESTAMP}.log" 2>&1 &
done
# Wait for all to complete
wait
# Run combined session
echo "π Running combined session with monitoring"
./run_spiders_optimized.sh --monitor \
> "$LOG_BASE_DIR/combined/all_spiders_${TIMESTAMP}.log" 2>&1
# Generate summary
echo "π Generating summary report"
python toxlsx.py --list > "$LOG_BASE_DIR/database_summary_${TIMESTAMP}.txt"
echo "β
Logging session completed. Logs saved in: $LOG_BASE_DIR"
EOF
chmod +x complete_logging.sh
./complete_logging.sh
# β
RECOMMENDED: Standard production logging
./run_spiders_optimized.sh prothomalo --monitor > "logs/production_$(date +%Y%m%d).log" 2>&1
# π― DEVELOPMENT: Detailed debugging with limits
uv run scrapy crawl prothomalo -L DEBUG -s CLOSESPIDER_ITEMCOUNT=20 > "debug_$(date +%Y%m%d_%H%M%S).log" 2>&1
# π MONITORING: Real-time monitoring with statistics
./run_spiders_optimized.sh --monitor | tee "live_$(date +%Y%m%d_%H%M%S).log"
# π SCHEDULED: Automated logging for cron jobs
0 2 * * * cd /path/to/BDNewsPaperScraper && ./run_spiders_optimized.sh --monitor > "logs/daily/auto_$(date +%Y%m%d).log" 2>&1
# π NAMING: Consistent log file naming
# Format: [spider]_[purpose]_[date]_[time].log
# Examples:
prothomalo_production_20240830_143022.log # Production run
dailysun_debug_20240830_144530.log # Debug session
all_spiders_monitoring_20240830_150000.log # All spiders with monitoring
ittefaq_performance_20240830_151234.log # Performance testing
quarterly_archive_Q3_2024.log # Quarterly archive
This comprehensive logging guide covers every possible scenario for detailed logging across all platforms and runner methods. Users can now create detailed log files for analysis, debugging, monitoring, and production use! π
# Limit articles per spider
uv run scrapy crawl prothomalo -s CLOSESPIDER_ITEMCOUNT=100
# Add delays between requests
uv run scrapy crawl dailysun -s DOWNLOAD_DELAY=2
# Increase concurrent requests
uv run scrapy crawl ittefaq -s CONCURRENT_REQUESTS=32
# Set log level
uv run scrapy crawl BDpratidin -L DEBUG
# All spiders now write to a single shared database:
# news_articles.db (contains all newspaper articles)
# Check database content:
sqlite3 news_articles.db "SELECT paper_name, COUNT(*) FROM articles GROUP BY paper_name;"
# View recent articles:
sqlite3 news_articles.db "SELECT headline, paper_name FROM articles ORDER BY scraped_at DESC LIMIT 10;"
- Enhanced error handling with comprehensive try-catch blocks
- Single shared database for all newspapers with essential fields only
- Duplicate URL prevention with automatic checking
- Smart content extraction with multiple fallback methods
- Bengali date conversion with optimized processing
- Automatic data cleaning and text normalization
- Simplified data structure focusing on core content
- Fast export tools supporting Excel and CSV formats
BDNewsPaperScraper/
βββ BDNewsPaper/ # Main Scrapy project
β βββ spiders/ # Enhanced spider implementations
β β βββ prothomalo.py # ProthomAlo spider (API-based)
β β βββ dailysun.py # Daily Sun spider
β β βββ ittefaq.py # Daily Ittefaq spider
β β βββ kalerkantho.py.disabled # Kaler Kantho spider (DISCONTINUED)
β β βββ bdpratidin.py # BD Pratidin spider
β β βββ thebangladeshtoday.py # Bangladesh Today spider
β β βββ thedailystar.py # The Daily Star spider
β βββ items.py # Advanced data models with auto-processing
β βββ pipelines.py # Data processing and storage pipelines
β βββ settings.py # Scrapy configuration and optimizations
β βββ middlewares.py # Custom middlewares and error handling
β βββ bengalidate_to_englishdate.py # Bengali date conversion utility
βββ pyproject.toml # UV project configuration
βββ uv.toml # UV workspace settings
βββ setup.sh # Automated setup script (Linux/macOS)
βββ run_spiders_optimized.sh # Enhanced multi-spider runner (Linux/macOS)
βββ run_spiders_optimized.py # Cross-platform Python runner (Windows/Linux/macOS) βNEWβ
βββ run_spiders_optimized.bat # Windows batch file wrapper βNEWβ
βββ performance_monitor.py # Performance monitoring and analytics
βββ toxlsx.py # Enhanced data export tool (Excel/CSV)
βββ news_articles.db # Shared database for all newspapers
βββ scrapy.cfg # Scrapy deployment configuration
βββ README.md # This comprehensive documentation
File | Platform | Purpose |
---|---|---|
run_spiders_optimized.sh |
Linux/macOS | Bash script with full features |
run_spiders_optimized.py |
All Platforms β | Python script with identical features |
run_spiders_optimized.bat |
Windows | Batch wrapper for easier Windows usage |
setup.sh |
Linux/macOS | Automated setup |
toxlsx.py |
All Platforms | Data export tool |
performance_monitor.py |
All Platforms | Performance monitoring |
# Check UV installation
uv --version
# Check Python version
python --version # Should be 3.9+
# Verify project setup
./setup.sh --check
# Clean installation
./setup.sh --clean && ./setup.sh --all
# Test spider imports
uv run python -c "from BDNewsPaper.spiders.prothomalo import *; print('OK')"
# Run spider with debug logging
uv run scrapy crawl prothomalo -L DEBUG
# Check database creation
ls -la *.db
# View recent articles
sqlite3 prothomalo_articles.db "SELECT headline, publication_date FROM articles ORDER BY id DESC LIMIT 5;"
- "UV not found": Install UV using
curl -LsSf https://astral.sh/uv/install.sh | sh
- "Import errors": Run
uv sync
to install dependencies - "No articles scraped": Check internet connection and website accessibility
- "Database locked": Stop all running spiders and wait a few seconds
- "Spider not found": Use
uv run scrapy list
to see available spiders
# Check UV installation
uv --version
# Check Python version
python --version # Should be 3.9+
# Verify project setup
./setup.sh --check
# Clean installation
rm -rf .venv uv.lock
./setup.sh --clean && ./setup.sh --all
# Manual environment setup
uv venv --python 3.11
source .venv/bin/activate
uv sync
# Test spider imports
uv run python -c "from BDNewsPaper.spiders.prothomalo import ProthomAloSpider; print('Import OK')"
# Run spider with debug logging
uv run scrapy crawl prothomalo -L DEBUG -s CLOSESPIDER_ITEMCOUNT=2
# Check scrapy configuration
uv run scrapy check
# List all available spiders
uv run scrapy list
# Test spider with minimal settings
uv run scrapy crawl prothomalo -s ROBOTSTXT_OBEY=False -s CLOSESPIDER_ITEMCOUNT=1
# Check database creation and permissions
ls -la *.db
sqlite3 news_articles.db ".tables"
sqlite3 news_articles.db ".schema articles"
# Check recent articles
sqlite3 news_articles.db "SELECT COUNT(*) FROM articles;"
sqlite3 news_articles.db "SELECT headline, paper_name FROM articles ORDER BY id DESC LIMIT 5;"
# Fix database permissions
chmod 664 news_articles.db
# Repair database if corrupted
sqlite3 news_articles.db ".recover" | sqlite3 news_articles_recovered.db
# Test website connectivity
curl -I https://www.prothomalo.com/
curl -I https://www.dailysun.com/
# Test with different user agent
uv run scrapy crawl prothomalo -s USER_AGENT="Mozilla/5.0 (compatible; Bot)"
# Increase timeouts for slow networks
uv run scrapy crawl dailysun -s DOWNLOAD_TIMEOUT=30 -s DOWNLOAD_DELAY=3
# Disable SSL verification if needed (not recommended for production)
uv run scrapy crawl ittefaq -s DOWNLOAD_HANDLERS_BASE={"https": "scrapy.core.downloader.handlers.http.HTTPDownloadHandler"}
# Reduce concurrent requests
uv run scrapy crawl kalerKantho -s CONCURRENT_REQUESTS=1 -s CONCURRENT_REQUESTS_PER_DOMAIN=1
# Monitor memory usage
uv run python performance_monitor.py &
uv run scrapy crawl BDpratidin
# Clear logs and cache
rm -rf logs/* .scrapy/
# Install pandas for toxlsx.py
uv add pandas openpyxl
# Test export functionality
./toxlsx.py --list
./toxlsx.py --paper "ProthomAlo" --limit 5 --output test.xlsx
# Check export file permissions
ls -la *.xlsx *.csv
# Manual CSV export
sqlite3 -header -csv news_articles.db "SELECT * FROM articles LIMIT 10;" > test_export.csv
Error | Solution |
---|---|
ModuleNotFoundError: No module named 'scrapy' |
Run uv sync to install dependencies |
command not found: uv |
Install UV: curl -LsSf https://astral.sh/uv/install.sh | sh |
ImportError: No module named 'BDNewsPaper' |
Run from project root directory |
DatabaseError: database is locked |
Stop all running spiders, wait 10 seconds |
SSL certificate verify failed |
Add -s DOWNLOAD_HANDLERS_BASE={...} flag |
No articles scraped |
Check internet connection, try with -L DEBUG |
Permission denied |
Check file permissions with ls -la |
[Errno 111] Connection refused |
Website may be down, try later |
# Check scrapy version and configuration
uv run scrapy version
uv run scrapy settings
# Generate detailed logs
uv run scrapy crawl prothomalo -L DEBUG 2>&1 | tee debug.log
# Monitor system resources
top -p $(pgrep -f scrapy)
# Install scrapyd
uv add scrapyd
# Start scrapyd server
uv run scrapyd
# Deploy project
uv run scrapyd-deploy
# Schedule spider runs
curl http://localhost:6800/schedule.json -d project=BDNewsPaper -d spider=prothomalo
# Add to crontab for daily runs
# crontab -e
# Run all spiders daily at 2 AM
0 2 * * * cd /path/to/BDNewsPaperScraper && ./run_spiders_optimized.sh
# Run specific spider every 6 hours
0 */6 * * * cd /path/to/BDNewsPaperScraper && uv run scrapy crawl prothomalo
- Fork the repository
- Create a feature branch:
git checkout -b feature-name
- Install dependencies:
uv sync
- Make your changes following existing patterns
- Test your changes:
uv run scrapy crawl <spider_name> -s CLOSESPIDER_ITEMCOUNT=5
- Format code:
uv run black . && uv run isort .
- Submit a pull request
- Create new spider file in
BDNewsPaper/spiders/
- Follow existing spider patterns and error handling
- Add database configuration for the new spider
- Update this README with the new spider information
- Test thoroughly with small item counts
MIT License - see LICENSE file for details.
Q: Which spider should I run first?
A: Start with prothomalo
- it's the fastest (API-based) and most reliable for testing.
Q: How long does it take to scrape all newspapers? A: Depends on limits set:
- With limits (100 articles each): ~10-15 minutes
- Without limits (full scrape): 1-3 hours depending on network
Q: Can I run multiple spiders simultaneously? A: Yes, but be respectful. Run 2-3 at most to avoid overwhelming servers.
Q: Do I need to delete old data before running again? A: No, spiders automatically handle duplicates by URL. Old data is preserved.
Q: Where is my scraped data stored?
A: Everything goes into a single database: news_articles.db
Q: What format can I export to?
A: Excel (.xlsx) and CSV (.csv) using the ./toxlsx.py
tool
Q: How do I view data without exporting?
A: Use ./toxlsx.py --list
for quick overview or SQLite commands for detailed queries
Q: Can I filter data by date? A: Yes! Two ways:
- During scraping: Use date arguments:
uv run scrapy crawl prothomalo -a start_date=2024-01-01 -a end_date=2024-01-31
- After scraping: SQLite queries:
sqlite3 news_articles.db "SELECT * FROM articles WHERE publication_date LIKE '2024-01%';"
Q: How do I scrape articles from specific dates? A: All spiders support date filtering with these arguments:
start_date=YYYY-MM-DD
- Start from this dateend_date=YYYY-MM-DD
- End at this date Example:uv run scrapy crawl dailysun -a start_date=2024-08-01 -a end_date=2024-08-31
Q: My spider isn't finding any articles, what's wrong? A:
- Check internet connection
- Run with debug:
uv run scrapy crawl <spider> -L DEBUG -s CLOSESPIDER_ITEMCOUNT=2
- Verify the website is accessible:
curl -I <website-url>
Q: Can I modify the scraped fields?
A: Yes, edit BDNewsPaper/items.py
and corresponding spider files, but the current structure is optimized for essential data.
Q: How do I speed up scraping? A:
- Use
prothomalo
(fastest) - Increase concurrent requests:
-s CONCURRENT_REQUESTS=32
- Set limits:
-s CLOSESPIDER_ITEMCOUNT=100
Q: Is this legal? A: This scraper respects robots.txt and includes delays. Always check website terms of service.
Q: "ModuleNotFoundError" errors?
A: Run uv sync
to install all dependencies
Q: "Database is locked" error? A: Stop all running spiders and wait 10 seconds before retrying
Q: Spider runs but gets 0 articles?
A: Website structure may have changed. Check with -L DEBUG
flag and update selectors if needed.
# BEST PRACTICE: Use run_spiders_optimized.sh for all production runs
./run_spiders_optimized.sh # All spiders with optimizations
./run_spiders_optimized.sh prothomalo --monitor # Single spider with monitoring
# Why it's better:
# β
64 concurrent requests (vs 16 default)
# β
Smart auto-throttling
# β
Individual timestamped logs
# β
Real-time progress tracking
# β
Automatic performance reports
# β
Built-in error handling
# β
UV auto-detection
# Fast test run (recommended for development)
uv run scrapy crawl prothomalo -s CLOSESPIDER_ITEMCOUNT=10
# Production run with optimal settings
uv run scrapy crawl prothomalo -s CONCURRENT_REQUESTS=16 -s DOWNLOAD_DELAY=1
# Monitor while running
tail -f scrapy.log | grep -E "(scraped|items)"
# Create a daily scraping script
cat > daily_scrape.sh << 'EOF'
#!/bin/bash
cd /path/to/BDNewsPaperScraper
source .venv/bin/activate
# Run fast spiders daily
uv run scrapy crawl prothomalo -s CLOSESPIDER_ITEMCOUNT=200
uv run scrapy crawl dailysun -s CLOSESPIDER_ITEMCOUNT=100
# Export latest data
./toxlsx.py --limit 500 --output "daily_news_$(date +%Y%m%d).xlsx"
echo "Daily scraping completed: $(date)"
EOF
chmod +x daily_scrape.sh
# Quick statistics
sqlite3 news_articles.db "
SELECT
paper_name,
COUNT(*) as total_articles,
MIN(publication_date) as earliest,
MAX(publication_date) as latest
FROM articles
GROUP BY paper_name
ORDER BY total_articles DESC;"
# Find trending topics
sqlite3 news_articles.db "
SELECT
substr(headline, 1, 50) as headline_preview,
paper_name,
publication_date
FROM articles
WHERE headline LIKE '%economy%'
OR headline LIKE '%politics%'
ORDER BY publication_date DESC
LIMIT 20;"
- Documentation: Check this README and inline code comments
- Issues: Check database files and log outputs
- Performance: Use the performance monitor tool
- Custom needs: Modify spider settings and configurations