Payer Plan Scrape Project

This project is designed to identify and ingest healthcare provider network information from various payer (insurance company) sources. The process is broken down into several steps, from discovering potential API endpoints to downloading and normalizing the data from those endpoints.

The project consists of two main data acquisition workflows for discovering API endpoints, and a final workflow for ingesting and processing the data from confirmed endpoints.

Project Workflow

The overall workflow is as follows:

Discovery (Workflows A & B): Identify potential FHIR API endpoints using two different strategies.
Curation (Manual Step): Analyze the results from the discovery phase to create a definitive list of working FHIR endpoints.
Ingestion (Workflow C): Connect to the curated list of FHIR APIs, download the provider network data, and save it as a series of structured CSV files.

Workflow A: Organization-Based Search for relevant Insurance Companies

This workflow discovers potential API endpoints by searching for company names.

Step10_create_target_list.py
- Purpose: To create a master list of unique "Parent Organizations" to search for.
- Input:
  - ./local_data/partc_source_data/2025_partc_star_ratings.csv
  - ./local_data/partc_source_data/MA_Contract_directory_2025_06.csv
- Output: search_these.csv
- Process: Reads the two source files, extracts all unique "Parent Organization" names, and saves them into a new CSV file.
Step20_Serp_Scrape.py
- Purpose: To perform a broad web search for each parent organization to find their provider directory API.
- Input: search_these.csv
- Output: A JSON file for each organization in ./local_data/scrape_results/.
- Process: For each organization in the input file, it uses the SERPapi service to perform a Google search for "{organization_name} Medicare Advantage "PROVIDER DIRECTORY" API "FHIR". The raw JSON search results are saved. Eventually this will support different search strings to find the same thing.

Workflow B: Domain-Based Site Search for FHIR endpoints

This workflow discovers potential API endpoints by searching for company domain names found in contact email addresses.

Step30_extract_email_domains.py
- Purpose: To create a list of unique company domain names.
- Input: ./local_data/partc_source_data/MA_Contract_directory_2025_06.csv
- Output: plan_domain_names.csv
- Process: Reads the source file, finds all email addresses in the "Directory Contact Email" column, extracts the unique domain names (e.g., aetna.com), and saves them to a new CSV file.
Step40_domain_serp_scrape.py
- Purpose: To perform a targeted, site-specific search for each domain to find their provider directory API.
- Input: plan_domain_names.csv
- Output: A JSON file for each domain in ./local_data/email_scrape_results/.
- Process: For each domain, it uses SERPapi to perform a Google search limited to that domain: site:{domain} "PROVIDER DIRECTORY" "FHIR". This provides more targeted results than Workflow A.

Manual Step: Curating Endpoints

The JSON files generated by Step 20 and Step 40 must be manually reviewed to identify actual, working FHIR API base URLs. These URLs should be compiled into good_payer_endpoints.csv with the columns payer_name, payer_stub, and payer_provider_directory_fhir_url. This file is the critical input for the final ingestion step.

Workflow C: Data Ingestion and Normalization

This workflow consumes the curated list of FHIR APIs and processes the data.

Step70_SlurpPayerProviderNetworks.py
- Purpose: To connect to known FHIR endpoints, download all provider network data, and normalize it into a structured, relational format.
- Input: good_payer_endpoints.csv
- Output: A new directory for each payer in ./local_data/payer_slurp_results/{payer_stub}/, containing seven distinct CSV files:
  1. org_to_pr.csv: Links Organizations to PractitionerRoles.
  2. org.csv: Unique Organizations.
  3. location_to_pr.csv: Links Locations to PractitionerRoles.
  4. location.csv: Unique Locations.
  5. p_to_pr.csv: Links Practitioners to PractitionerRoles, including NPI.
  6. spec_to_pr.csv: Links Specialties to PractitionerRoles.
  7. tele_to_pr.csv: Links Telecom information to PractitionerRoles.
- Process: For each payer endpoint, the script fetches all PractitionerRole resources, handling pagination. It then recursively follows the links within each PractitionerRole to fetch the associated Practitioner, Organization, and Location resources. Finally, it parses all the retrieved data and writes it out to the seven CSV files. The script includes a --test flag for development, which limits the number of records processed.

Policies

Open Source Policy

We adhere to the CMS Open Source Policy. If you have any questions, just shoot us an email.

Security and Responsible Disclosure Policy

Submit a vulnerability: Vulnerability reports can be submitted through Bugcrowd. Reports may be submitted anonymously. If you share contact information, we will acknowledge receipt of your report within 3 business days.

Software Bill of Materials (SBOM)

A Software Bill of Materials (SBOM) is a formal record containing the details and supply chain relationships of various components used in building software.

In the spirit of Executive Order 14028 - Improving the Nation's Cyber Security, a SBOM for this repository is provided here: https://github.com/{{ cookiecutter.project_org }}/{{ cookiecutter.project_repo_name }}/network/dependencies.

For more information and resources about SBOMs, visit: https://www.cisa.gov/sbom.

Public domain

This project is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication as indicated in LICENSE.

All contributions to this project will be released under the CC0 dedication. By submitting a pull request or issue, you are agreeing to comply with this waiver of copyright interest.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
AI_Instructions		AI_Instructions
docs		docs
local_data		local_data
.gitignore		.gitignore
COMMUNITY.md		COMMUNITY.md
LICENSE		LICENSE
ReadMe.md		ReadMe.md
Step10_create_target_list.py		Step10_create_target_list.py
Step20_Serp_Scrape.py		Step20_Serp_Scrape.py
Step30_extract_email_domains.py		Step30_extract_email_domains.py
Step40_domain_serp_scrape.py		Step40_domain_serp_scrape.py
Step70_SlurpPayerProviderNetworks.py		Step70_SlurpPayerProviderNetworks.py
code.json		code.json
example.env		example.env
git_store_cred.sh		git_store_cred.sh
good_payer_endpoints.csv		good_payer_endpoints.csv
plan_domain_names.csv		plan_domain_names.csv
plan_scrape.code-workspace		plan_scrape.code-workspace
search_these.csv		search_these.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Payer Plan Scrape Project

Project Workflow

Workflow A: Organization-Based Search for relevant Insurance Companies

Workflow B: Domain-Based Site Search for FHIR endpoints

Manual Step: Curating Endpoints

Workflow C: Data Ingestion and Normalization

Policies

Open Source Policy

Security and Responsible Disclosure Policy

Software Bill of Materials (SBOM)

Public domain

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

DSACMS/npd_plan_scrape

Folders and files

Latest commit

History

Repository files navigation

Payer Plan Scrape Project

Project Workflow

Workflow A: Organization-Based Search for relevant Insurance Companies

Workflow B: Domain-Based Site Search for FHIR endpoints

Manual Step: Curating Endpoints

Workflow C: Data Ingestion and Normalization

Policies

Open Source Policy

Security and Responsible Disclosure Policy

Software Bill of Materials (SBOM)

Public domain

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages