Distributed scraping system with cache for business information extraction.
- Scalable distributed architecture
- Configurable multi-threaded processing
- Advanced logging system with customizable levels
- Intelligent error handling and automatic recovery
- Efficient system resource management
- Support for multiple backends (Redis, Memcached)
- Configurable compression and encryption
- Policy-based automatic cleanup system
- Configurable replication and consistency
- Intelligent memory and space management
- Real-time monitoring of critical events
- Configurable severity levels
- Detailed alert history with metadata
- Detection and grouping of duplicate alerts
- Integration with metrics system
- Automatic system metrics collection
- Performance and resource monitoring
- Detailed operation statistics
- Configurable log rotation system
- Standard format metrics export
- Configurable authentication system
- Protection against brute force attacks
- Token and session management
- Sensitive data encryption
- Configurable access policies
- OCR Integration (Tesseract)
- AI capabilities with LLaMA model
- Parallel data processing
- Configurable extraction pipeline
- Data validation and cleaning
- Automatic temporary resource cleanup
- Configurable backup management
- CPU and memory usage control
- Disk space monitoring
- Automatic failure recovery
scraper/
├── __init__.py
├── alerts/
│ ├── __init__.py
│ ├── manager.py
│ ├── handlers.py
│ └── metrics.py
├── cache/
│ ├── __init__.py
│ ├── distributed.py
│ ├── cleaner.py
│ ├── compression.py
│ ├── encryption.py
│ └── priority.py
├── core/
│ ├── __init__.py
│ ├── config.py
│ ├── logging.py
│ ├── metrics.py
│ └── utils.py
├── db/
│ ├── __init__.py
│ ├── models.py
│ ├── session.py
│ └── operations.py
├── extractors/
│ ├── __init__.py
│ ├── base.py
│ ├── text.py
│ ├── ocr.py
│ └── ai.py
├── monitor/
│ ├── __init__.py
│ ├── system.py
│ ├── resources.py
│ └── alerts.py
├── security/
│ ├── __init__.py
│ ├── auth.py
│ ├── encryption.py
│ └── tokens.py
└── utils/
├── __init__.py
├── validation.py
├── formatting.py
└── helpers.py
config/
├── logging.yaml
├── cache.yaml
├── alerts.yaml
├── metrics.yaml
└── security.yaml
tests/
├── unit/
├── integration/
└── performance/
docs/
├── api/
├── setup/
└── examples/
- Authentication: Role and token-based access control
- Compression: Automatic compression based on data type and size
- Encryption: Transparent sensitive data encryption
- Events: Pub/sub system for monitoring and reaction
- Partitioning: Consistent data distribution
- Replication: Redundant copies for high availability
- Circuit Breakers: Protection against cascade failures
- Cleanup: Automatic data aging management
- Error Handling: Unified system with:
- Detailed logging
- Error metrics
- Automatic notifications
- Intelligent recovery
- Resource Management:
- Automatic connection closure
- Resource cleanup
- Context managers
- Lifecycle management
- Statistics:
- Node performance
- Resource usage
- Operations by type
- Temporal analysis
The system uses a centralized event manager to monitor and react to different situations:
-
Critical (High Priority):
- Errors
- Node failures
- Recovery/migration failures
-
Operational (Medium Priority):
- Warnings
- Migrations
- Rebalancing
- Backups/Restorations
-
Informational (Low Priority):
- GET/SET operations
- Informational logs
- Metrics
-
Configuration:
- Customizable thresholds by alert type
- Configurable severity levels
- Related alert grouping
- Configurable duplication windows
-
Monitoring:
- Detailed alert history
- Severity statistics
- Filtering and search
- Alert metrics
- Automatic history cleanup
-
Notifications:
- System event integration
- Similar alert aggregation
- Alert storm prevention
- Duplicate detection
- Silence windows
-
Resource Management:
- Automatic periodic cleanup
- Memory management
- Context managers
- Orderly shutdown
-
Statistics:
- Period summaries
- Severity distribution
- Trend analysis
- Duplication metrics
- Cleanup efficiency
-
Real-time Metrics:
- Operation latency
- Success/error rates
- Resource usage
- Node statistics
- Access patterns
-
Configurable Alerts:
- Dynamic thresholds
- Event correlation
- Trend analysis
-
Reports:
- Historical performance
- Error analysis
- Resource usage
- Access patterns
- Periodic summaries
- Python 3.8+
- Redis 6.0+ or Memcached 1.6+
- PostgreSQL 12+ (optional)
- Tesseract 4.1+ (optional for OCR)
- CUDA 11.0+ (optional for AI)
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
.\venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Initial setup
python setup.py install
# OCR
pip install -r requirements-ocr.txt
# AI
pip install -r requirements-ai.txt
# Database
pip install -r requirements-db.txt
- Copy example files:
cp config/*.yaml.example config/*.yaml
- Configure environment variables:
cp .env.example .env
# Edit .env with your values
- Choose backend (Redis/Memcached)
- Configure parameters in
config/cache.yaml
- Adjust related environment variables
- Define severity levels
- Configure thresholds in
config/alerts.yaml
- Set notification policies
- Enable metrics collection
- Configure intervals in
config/metrics.yaml
- Define log rotation policies
- Generate encryption keys
- Configure policies in
config/security.yaml
- Set authentication parameters
# Start the web interface
streamlit run app.py
# Run the scraper only
python run_scraper.py
# View real-time metrics
python -m scraper.monitor metrics
# View system status
python -m scraper.monitor status
# View active alerts
python -m scraper.monitor alerts
# Clean cache
python -m scraper.cache clean
# Rotate logs
python -m scraper.utils rotate-logs
# Data backup
python -m scraper.utils backup
# Unit tests
python -m pytest tests/unit
# Integration tests
python -m pytest tests/integration
# Performance tests
python -m pytest tests/performance
# All tests with coverage
python -m pytest --cov=scraper tests/
# Cache system tests
python -m pytest tests/unit/test_cache.py
# Alert system tests
python -m pytest tests/unit/test_alerts.py
# Cache performance tests
python -m pytest tests/performance/test_cache_performance.py
# Static analysis
flake8 scraper
# Type checking
mypy scraper
# Code formatting
black scraper
- Fork the repository
- Create a branch for your feature:
git checkout -b feature/feature-name
- Implement your changes following style guides
- Ensure all tests pass
- Update documentation if necessary
- Create a pull request
- Follow PEP 8 for Python code style
- Document all functions and classes with docstrings
- Maintain test coverage > 80%
- Use type hints in all functions
- Maintain cyclomatic complexity < 10
- Create issue describing the change
- Discuss implementation in the issue
- Implement changes in a branch
- Run complete test suite
- Create pull request
- Code review and approval
- Merge to main
- Use GitHub's issue system
- Include steps to reproduce
- Attach relevant logs
- Specify system version
- Describe expected vs actual behavior
This project is licensed under the MIT License - see the LICENSE file for details.
- GitHub Issues: For bug reports and feature requests
- Discussions: For general questions and discussions
- Wiki: For extended documentation and guides
- Keep code updated
- Review pull requests
- Respond to issues
- Update documentation
Note: This project is in active development. Contributions are welcome.
- Python 3.8+
- Google Chrome Browser
- Git
- Clone the repository:
git clone <repository-url>
cd business-address-scrapper
- Create and activate virtual environment:
Windows:
python -m venv venv
.\venv\Scripts\activate
Linux/Mac:
python -m venv venv
source venv/bin/activate
- Install basic dependencies:
pip install -r requirements.txt
- Configure environment variables:
# Windows
copy .env.example .env
# Linux/Mac
cp .env.example .env
-
Prepare input CSV file with business names in the first column
-
Run the scraper:
python simple_scraper.py input.csv output.csv
- Additional options:
python simple_scraper.py --input input.csv --output output.csv --retries 3 --wait 5
The scraper can run in two modes:
- Local Mode: Uses local Chrome and webdriver-manager
- Container Mode: Uses pre-configured Chrome and ChromeDriver
To configure the mode:
- Edit
.env
:
# Execution mode
EXECUTION_ENV=local # or 'container'
# Browser settings
CHROME_BINARY_PATH= # Leave empty for local
CHROME_DRIVER_PATH= # Leave empty for local
HEADLESS_MODE=false # true/false
-
Chrome/ChromeDriver Issues:
- Ensure Chrome is installed
- Update Chrome to latest version
- Clear browser cache/cookies
-
Permission Issues:
- Verify write permissions in output directory
- Run with appropriate privileges
-
Resource Issues:
- Increase system memory allocation
- Adjust scraping delays in .env
-
Simple Scraper Logs:
# View recent logs
tail -f logs/scraper.log