This repository contains Python-based scrapers for extracting company data from GoodFirms. These scrapers leverage the Crawlbase Crawling API to handle JavaScript rendering, CAPTCHA challenges, and anti-bot protections. The extracted data provides valuable insights into various businesses, including company names, locations, ratings, services, and profile details.
➡ Read the full blog here to learn more.
The GoodFirms Search Listings Scraper (goodfirms_serp_scraper.py) extracts structured company information from search listings, including:
- Company Name
- Location
- Service Category
- Rating
- Company Profile URL
It supports pagination, ensuring that multiple pages of search results can be scraped efficiently. Extracted data is stored in a structured JSON file.
The GoodFirms Company Profile Scraper (goodfirms_company_page_scraper.py) extracts detailed company data from individual profile pages, including:
- Company Name
- Description
- Hourly Rate
- Number of Employees
- Year Founded
- Services Offered
It takes profile URLs from the search listings scraper and extracts detailed business information, saving the data in a JSON file.
Ensure that Python is installed on your system. Check the version using:
# Use python3 if you're on Linux/macOS
python --version
Install the required dependencies:
pip install crawlbase beautifulsoup4
- Crawlbase – Handles JavaScript rendering and bypasses bot protections.
- BeautifulSoup – Parses and extracts structured data from HTML.
-
Get Your Crawlbase Access Token
- Sign up for Crawlbase here to get an API token.
- Replace
"YOUR_CRAWLBASE_TOKEN"
in the script with your Crawlbase Token.
-
Run the Scraper
# Use python3 if required (for Linux/macOS)
python SCRAPER_FILE_NAME.py
Replace "SCRAPER_FILE_NAME.py"
with the actual script name (goodfirms_serp_scraper.py
or goodfirms_company_page_scraper.py
).
- Extend scrapers to extract additional company details like contact information and portfolios.
- Optimize the scraping process for better performance.
- Implement multi-threading for large-scale data extraction.
- Bypasses anti-bot protections using Crawlbase.
- Handles JavaScript-rendered content efficiently.
- Extracts structured company data for business analysis.