A tool for analyzing Dutch companies based on their branch structure and collecting detailed company information using OpenCorporates and Perplexity.
This project consists of three main phases:
- Phase 1 (Branch Analysis): Identifies "big" companies by analyzing their branch/subsidiary structure using OpenCorporates data
- Phase 2 (Company Details): Collects detailed information about identified big companies using Perplexity, including industry, employee estimates, and business intelligence
- Phase 3 (Export & Visualization): Exports and visualizes the enriched data through Excel reports and interactive web dashboards
- Python 3.7+
- Chrome browser installed
- Required Python packages (see requirements.txt)
- Clone this repository
- Install required packages:
pip install -r requirements.txt - Ensure Chrome browser is installed
- Create required directories:
mkdir -p logs db
Basic usage:
python src/main.py input.csvOptions:
--db-path: Specify SQLite database path (default: ./db/companies.db)--start-index: Starting row index to process (inclusive)--end-index: Ending row index to process (exclusive)--log-dir: Directory to store log files (default: ./logs/kvk_scraper_TIMESTAMP_pidNUM/)--retry-failed: Retry processing companies that previously failed
Example:
python src/main.py companies.csv --start-index 100 --retry-failedProcess companies with branches to get detailed information:
python src/phase2_processor.pyOptions:
--phase1-db: Path to Phase 1 database (default: ./db/companies.db)--phase2-db: Path to Phase 2 database (default: ./db/company_details.db)--max-companies: Maximum number of companies to process--delay: Delay between API calls in seconds (default: 1.0)--log-dir: Directory for log files
Examples:
# Process all companies with branches
python src/phase2_processor.py
# Process only 10 companies with 2-second delays
python src/phase2_processor.py --max-companies 10 --delay 2.0
# Use custom database paths
python src/phase2_processor.py --phase1-db ./data/companies.db --phase2-db ./data/details.dbNote: Before running Phase 2, ensure you have:
- A
.envfile with your Perplexity API key:PERPLEXITY_API_KEY=your_api_key_here PERPLEXITY_MODEL=sonar - Completed Phase 1 processing with companies that have branches
Export and visualize the enriched company data from Phase 2:
Export company details to Excel with multiple sheets for analysis:
python src/export_to_excel.pyOptions:
--db-path: Path to company details database (default: ./db/company_details.db)--output: Output Excel filename (default: company_details.xlsx)
Example:
python src/export_to_excel.py --db-path ./db/company_details.db --output my_companies.xlsxThe Excel file includes:
- Company Details: Main data with parsed industries
- Summary: Processing statistics and metrics
- Industries: Industry breakdown and counts
- Employee Ranges: Employee range distribution
Launch an interactive web dashboard to explore and filter company data:
pip install streamlit plotly
streamlit run src/web_dashboard.pyThe dashboard features:
- Real-time filtering by confidence score, employee range, and industries
- Interactive charts showing industry and confidence score distributions
- Downloadable filtered results as CSV
- Customizable column display
- Company metrics and statistics
Note: The web dashboard will open in your browser at http://localhost:8501
Deploy your dashboard to Streamlit Cloud for private sharing:
-
Encode your database:
python src/encode_db.py ./db/company_details.db
-
Setup secrets: Copy the encoded output to your Streamlit Cloud app secrets
-
Deploy: Use
web_dashboard_secrets.pyfor deployment:streamlit run src/web_dashboard_secrets.py
The deployed app will automatically load data from secrets without requiring file uploads.
The input CSV file should contain at least these columns:
kvk_number: KvK registration numbercompany_name: Company name
The script:
- Stores results in an SQLite database with company information:
- Company name
- KvK number
- Has branches status (true/false/-1 for failed checks)
- Generates detailed logs in the
logsdirectory - Provides processing statistics at completion
- Automatic handling of various KvK number formats
- Persistent storage in SQLite database
- Failed result tracking (-1 in database)
- Ability to retry previously failed checks
- Detailed logging with timestamp-based filenames
- Progress bar with live statistics
The script creates separate log files for each component:
scraper.log: Company scraping and branch detection logsdatabase.log: Database operations and storage logsproxy.log: Proxy fetching, validation and rotation logs
All logs are stored in a timestamped directory:
logs/
kvk_scraper_YYYYMMDD_HHMMSS_pidNUM/
scraper.log
database.log
proxy.log
Run all tests:
python -m pytestRun specific test categories using markers:
pytest -m rate_limit # Only rate limit tests
pytest -m branches # Only branch detection tests
pytest -m phase2 # Only phase 2 processing testsRun tests by name matching:
pytest -k "rate" # Run any test with "rate" in the name
pytest -k "TestPhase2" # Run Phase 2 processor tests
pytest -k "phase2" # Run all Phase 2 related testsTest files:
test_scraper.py: Tests for scraping and rate limit detectiontest_proxy_manager.py: Tests for proxy handlingtest_phase2.py: Tests for Phase 2 processing, Perplexity integration, and data models
- Company size determination through branch analysis
- Persistent SQLite storage of results
- Failed result tracking and retry capability
- Detailed logging system
- Progress tracking and statistics
- Phase 2: Perplexity integration for detailed company analysis
- Structured data extraction with confidence scoring
- Integration with Perplexity API for detailed company research
- Industry classification from predefined categories
- Employee count estimation in structured ranges
- Headquarters location identification
- Business description generation
- Confidence scoring for data quality assessment
- Separate database for enriched company data
The SQLite database currently stores:
- Company name
- KvK number
- Branch status (true/false/-1 for failed checks)
Extended company details database includes:
- KvK number (cross-reference key)
- Company name
- Industry classifications (1-3 categories)
- Employee range estimates
- Headquarters location
- Business description
- Confidence score (0.0-1.0)
- Timestamps for data tracking
Technology & Software, Financial Services, Manufacturing, Healthcare & Pharmaceuticals, Energy & Utilities, Construction & Real Estate, Transportation & Logistics, Retail & E-commerce, Food & Beverages, Education, Professional Services, Media & Entertainment, Telecommunications, Agriculture, Tourism & Hospitality, Automotive, Chemical & Materials, Aerospace & Defense, Government & Public Sector, Non-profit
1-10, 11-50, 51-200, 201-500, 501-1000, 1001-5000, 5000+
- Processing speed is limited due to web scraping
- Failed checks (None results) are stored as -1 in the database
- Use --retry-failed to reprocess previously failed checks
- Logs are automatically stored in ./logs directory with timestamps