Business Address Scraper

Distributed scraping system with cache for business information extraction.

Main Features

Base System

Scalable distributed architecture
Configurable multi-threaded processing
Advanced logging system with customizable levels
Intelligent error handling and automatic recovery
Efficient system resource management

Distributed Cache

Support for multiple backends (Redis, Memcached)
Configurable compression and encryption
Policy-based automatic cleanup system
Configurable replication and consistency
Intelligent memory and space management

Alert System

Real-time monitoring of critical events
Configurable severity levels
Detailed alert history with metadata
Detection and grouping of duplicate alerts
Integration with metrics system

Metrics and Monitoring

Automatic system metrics collection
Performance and resource monitoring
Detailed operation statistics
Configurable log rotation system
Standard format metrics export

Security

Configurable authentication system
Protection against brute force attacks
Token and session management
Sensitive data encryption
Configurable access policies

Advanced Processing

OCR Integration (Tesseract)
AI capabilities with LLaMA model
Parallel data processing
Configurable extraction pipeline
Data validation and cleaning

Resource Management

Automatic temporary resource cleanup
Configurable backup management
CPU and memory usage control
Disk space monitoring
Automatic failure recovery

Project Structure

scraper/
├── __init__.py
├── alerts/
│   ├── __init__.py
│   ├── manager.py
│   ├── handlers.py
│   └── metrics.py
├── cache/
│   ├── __init__.py
│   ├── distributed.py
│   ├── cleaner.py
│   ├── compression.py
│   ├── encryption.py
│   └── priority.py
├── core/
│   ├── __init__.py
│   ├── config.py
│   ├── logging.py
│   ├── metrics.py
│   └── utils.py
├── db/
│   ├── __init__.py
│   ├── models.py
│   ├── session.py
│   └── operations.py
├── extractors/
│   ├── __init__.py
│   ├── base.py
│   ├── text.py
│   ├── ocr.py
│   └── ai.py
├── monitor/
│   ├── __init__.py
│   ├── system.py
│   ├── resources.py
│   └── alerts.py
├── security/
│   ├── __init__.py
│   ├── auth.py
│   ├── encryption.py
│   └── tokens.py
└── utils/
    ├── __init__.py
    ├── validation.py
    ├── formatting.py
    └── helpers.py

config/
├── logging.yaml
├── cache.yaml
├── alerts.yaml
├── metrics.yaml
└── security.yaml

tests/
├── unit/
├── integration/
└── performance/

docs/
├── api/
├── setup/
└── examples/

Distributed Cache System

Authentication: Role and token-based access control
Compression: Automatic compression based on data type and size
Encryption: Transparent sensitive data encryption
Events: Pub/sub system for monitoring and reaction
Partitioning: Consistent data distribution
Replication: Redundant copies for high availability
Circuit Breakers: Protection against cascade failures
Cleanup: Automatic data aging management
Error Handling: Unified system with:
- Detailed logging
- Error metrics
- Automatic notifications
- Intelligent recovery
Resource Management:
- Automatic connection closure
- Resource cleanup
- Context managers
- Lifecycle management
Statistics:
- Node performance
- Resource usage
- Operations by type
- Temporal analysis

Event System

The system uses a centralized event manager to monitor and react to different situations:

Event Types

Critical (High Priority):
- Errors
- Node failures
- Recovery/migration failures
Operational (Medium Priority):
- Warnings
- Migrations
- Rebalancing
- Backups/Restorations
Informational (Low Priority):
- GET/SET operations
- Informational logs
- Metrics

Alert System

Configuration:
- Customizable thresholds by alert type
- Configurable severity levels
- Related alert grouping
- Configurable duplication windows
Monitoring:
- Detailed alert history
- Severity statistics
- Filtering and search
- Alert metrics
- Automatic history cleanup
Notifications:
- System event integration
- Similar alert aggregation
- Alert storm prevention
- Duplicate detection
- Silence windows
Resource Management:
- Automatic periodic cleanup
- Memory management
- Context managers
- Orderly shutdown
Statistics:
- Period summaries
- Severity distribution
- Trend analysis
- Duplication metrics
- Cleanup efficiency

Monitoring System

Real-time Metrics:
- Operation latency
- Success/error rates
- Resource usage
- Node statistics
- Access patterns
Configurable Alerts:
- Dynamic thresholds
- Event correlation
- Trend analysis
Reports:
- Historical performance
- Error analysis
- Resource usage
- Access patterns
- Periodic summaries

Installation

Prerequisites

Python 3.8+
Redis 6.0+ or Memcached 1.6+
PostgreSQL 12+ (optional)
Tesseract 4.1+ (optional for OCR)
CUDA 11.0+ (optional for AI)

Basic Installation

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
.\venv\Scripts\activate   # Windows

# Install dependencies
pip install -r requirements.txt

# Initial setup
python setup.py install

Installation with Optional Features

# OCR
pip install -r requirements-ocr.txt

# AI
pip install -r requirements-ai.txt

# Database
pip install -r requirements-db.txt

Configuration

Basic Configuration

Copy example files:

cp config/*.yaml.example config/*.yaml

Configure environment variables:

cp .env.example .env
# Edit .env with your values

Advanced Configuration

Cache

Choose backend (Redis/Memcached)
Configure parameters in config/cache.yaml
Adjust related environment variables

Alert System

Define severity levels
Configure thresholds in config/alerts.yaml
Set notification policies

Metrics

Enable metrics collection
Configure intervals in config/metrics.yaml
Define log rotation policies

Security

Generate encryption keys
Configure policies in config/security.yaml
Set authentication parameters

Usage

Start the System

# Start the web interface
streamlit run app.py

# Run the scraper only
python run_scraper.py

Monitoring

# View real-time metrics
python -m scraper.monitor metrics

# View system status
python -m scraper.monitor status

# View active alerts
python -m scraper.monitor alerts

Maintenance

# Clean cache
python -m scraper.cache clean

# Rotate logs
python -m scraper.utils rotate-logs

# Data backup
python -m scraper.utils backup

Tests

Run Tests

# Unit tests
python -m pytest tests/unit

# Integration tests
python -m pytest tests/integration

# Performance tests
python -m pytest tests/performance

# All tests with coverage
python -m pytest --cov=scraper tests/

Specific Tests

# Cache system tests
python -m pytest tests/unit/test_cache.py

# Alert system tests
python -m pytest tests/unit/test_alerts.py

# Cache performance tests
python -m pytest tests/performance/test_cache_performance.py

Code Analysis

# Static analysis
flake8 scraper

# Type checking
mypy scraper

# Code formatting
black scraper

Contributing

Contribution Guide

Fork the repository
Create a branch for your feature: git checkout -b feature/feature-name
Implement your changes following style guides
Ensure all tests pass
Update documentation if necessary
Create a pull request

Code Standards

Follow PEP 8 for Python code style
Document all functions and classes with docstrings
Maintain test coverage > 80%
Use type hints in all functions
Maintain cyclomatic complexity < 10

Development Flow

Create issue describing the change
Discuss implementation in the issue
Implement changes in a branch
Run complete test suite
Create pull request
Code review and approval
Merge to main

Report Bugs

Use GitHub's issue system
Include steps to reproduce
Attach relevant logs
Specify system version
Describe expected vs actual behavior

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact and Support

Communication Channels

GitHub Issues: For bug reports and feature requests
Discussions: For general questions and discussions
Wiki: For extended documentation and guides

Additional Resources

Maintainers

Keep code updated
Review pull requests
Respond to issues
Update documentation

Note: This project is in active development. Contributions are welcome.

Independent Simple Scraper Execution

Minimum Requirements for Simple Scraper

Python 3.8+
Google Chrome Browser
Git

Basic Installation (Windows/Linux/Mac)

Clone the repository:

git clone <repository-url>
cd business-address-scrapper

Create and activate virtual environment:

Windows:

python -m venv venv
.\venv\Scripts\activate

Linux/Mac:

python -m venv venv
source venv/bin/activate

Install basic dependencies:

pip install -r requirements.txt

Configure environment variables:

# Windows
copy .env.example .env

# Linux/Mac
cp .env.example .env

Using the Simple Scraper

Prepare input CSV file with business names in the first column
Run the scraper:

python simple_scraper.py input.csv output.csv

Additional options:

python simple_scraper.py --input input.csv --output output.csv --retries 3 --wait 5

Simple Scraper Configuration

The scraper can run in two modes:

Local Mode: Uses local Chrome and webdriver-manager
Container Mode: Uses pre-configured Chrome and ChromeDriver

To configure the mode:

Edit .env:

# Execution mode
EXECUTION_ENV=local  # or 'container'

# Browser settings
CHROME_BINARY_PATH=  # Leave empty for local
CHROME_DRIVER_PATH=  # Leave empty for local
HEADLESS_MODE=false  # true/false

Simple Scraper Troubleshooting

Chrome/ChromeDriver Issues:
- Ensure Chrome is installed
- Update Chrome to latest version
- Clear browser cache/cookies
Permission Issues:
- Verify write permissions in output directory
- Run with appropriate privileges
Resource Issues:
- Increase system memory allocation
- Adjust scraping delays in .env
Simple Scraper Logs:

# View recent logs
tail -f logs/scraper.log

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.devcontainer		.devcontainer
data/output		data/output
docs		docs
scraper		scraper
scripts		scripts
tests		tests
.env.example		.env.example
.env.prod.example		.env.prod.example
.env.sandbox.example		.env.sandbox.example
.env.test.example		.env.test.example
.flake8		.flake8
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
app.py		app.py
businesses.csv.example		businesses.csv.example
config.yaml		config.yaml
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-ai.txt		requirements-ai.txt
requirements-dev.txt		requirements-dev.txt
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
run_scraper.py		run_scraper.py
setup.py		setup.py
simple_scraper.py		simple_scraper.py

License

CripterHack/business-address-scrapper

Folders and files

Latest commit

History

Repository files navigation

Business Address Scraper

Main Features

Base System

Distributed Cache

Alert System

Metrics and Monitoring

Security

Advanced Processing

Resource Management

Project Structure

Distributed Cache System

Event System

Event Types

Alert System

Monitoring System

Installation

Prerequisites

Basic Installation

Installation with Optional Features

Configuration

Basic Configuration

Advanced Configuration

Cache

Alert System

Metrics

Security

Usage

Start the System

Monitoring

Maintenance

Tests

Run Tests

Specific Tests

Code Analysis

Contributing

Contribution Guide

Code Standards

Development Flow

Report Bugs

License

Contact and Support

Communication Channels

Additional Resources

Maintainers

Independent Simple Scraper Execution

Minimum Requirements for Simple Scraper

Basic Installation (Windows/Linux/Mac)

Using the Simple Scraper

Simple Scraper Configuration

Simple Scraper Troubleshooting

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages