This project provides a high-throughput data pipeline for scraping and transforming company-level ESG (Environmental, Social, Governance) ratings from Sustainalytics' public website. It is designed to retrieve structured sustainability data across thousands of publicly listed companies and persist the results into a SQL Server database for analytical use.
This tool is optimized for enterprise users seeking periodic access to Sustainalytics ESG risk scores for:
- Internal reporting
- Risk and exposure modeling
- ESG-aligned portfolio construction
- Sustainability compliance workflows
The scraper leverages multiprocessing and multithreading to handle large-scale data extraction in parallel, enabling timely and reliable ingestion of ESG data.
Primary Endpoint:
https://www.sustainalytics.com/sustapi/companyratings/getcompanyratings
The scraper:
- Iteratively posts to this endpoint to fetch batches of company profile URLs.
- Visits each profile URL to extract ESG details using structured HTML parsing.
Each record in the transformed output includes:
name
: Company namesustainalytics_ticker
: Unique Sustainalytics identifierticker
: Parsed ticker symbolexchange
: Exchange suffixindustry
,country
: Entity classificationdesc
: Business descriptionftemployees
: Number of full-time employeesrisk_rating
: Quantitative ESG risk scorerisk_assessment
: Qualitative risk rating categoryind_pos
,ind_pos_total
: Industry rank and peer countuni_pos
,uni_pos_total
: Global rank and peer countexposure_assessment
,risk_management_assessment
: ESG component breakdownlast_update
: Most recent update date (parsed)timestamp_created_utc
: Time of data ingestion
-
Ticker Discovery
Posts to Sustainalytics' endpoint to discover available company profiles. -
Profile Scraping
Executes a combination of processes and threads to fetch ESG profile pages concurrently. -
Data Extraction
UsesBeautifulSoup
to extract ESG ratings and company metadata from HTML elements. -
Transformation
Normalizes and flattens nested data into a single table usingpandas
. -
Database Insertion
Outputs data to a configured Microsoft SQL Server table via batch insertions.
sustainalytics-client-main/
├── main.py # Pipeline runner
├── scraper/ # URL discovery and HTML data scraper
│ ├── sustainalytics.py
│ ├── request.py
│ └── useragents.txt
├── transformer/ # Data transformation logic
│ └── agent.py
├── database/ # MSSQL integration layer
│ └── mssql.py
├── config/ # Settings and logger setup
│ ├── settings.py
│ └── logger.py
├── .env.sample # Environment variable template
├── Dockerfile # Containerization support
├── requirements.txt # Python dependencies
Duplicate .env.sample
as .env
and populate the following:
Variable | Description |
---|---|
THREAD_COUNT |
Number of threads per process |
LOG_LEVEL |
Log verbosity (e.g. INFO , DEBUG ) |
OUTPUT_TABLE |
SQL Server table to store ESG data |
INSERTER_MAX_RETRIES |
Max DB insertion attempts |
REQUEST_MAX_RETRIES , REQUEST_BACKOFF_FACTOR |
Controls scraper retry strategy |
MSSQL_SERVER , MSSQL_DATABASE , MSSQL_USERNAME , MSSQL_PASSWORD |
SQL Server connection |
Optional BRIGHTDATA_* |
Proxy configuration for external IP rotation |
docker build -t sustainalytics-client .
docker run --env-file .env sustainalytics-client
pip install -r requirements.txt
python main.py
Execution logs provide details on:
- Number of discovered tickers
- Thread/process progress
- Records fetched and inserted
- Errors and retries (if applicable)
This software is provided under the MIT License. Access to Sustainalytics data is subject to their public content use policies.