Skip to content

DSACMS/npd_plan_scrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Payer Plan Scrape Project

This project is designed to identify and ingest healthcare provider network information from various payer (insurance company) sources. The process is broken down into several steps, from discovering potential API endpoints to downloading and normalizing the data from those endpoints.

The project consists of two main data acquisition workflows for discovering API endpoints, and a final workflow for ingesting and processing the data from confirmed endpoints.

Project Workflow

The overall workflow is as follows:

  1. Discovery (Workflows A & B): Identify potential FHIR API endpoints using two different strategies.
  2. Curation (Manual Step): Analyze the results from the discovery phase to create a definitive list of working FHIR endpoints.
  3. Ingestion (Workflow C): Connect to the curated list of FHIR APIs, download the provider network data, and save it as a series of structured CSV files.

Workflow A: Organization-Based Search for relevant Insurance Companies

This workflow discovers potential API endpoints by searching for company names.

  • Step10_create_target_list.py

    • Purpose: To create a master list of unique "Parent Organizations" to search for.
    • Input:
      • ./local_data/partc_source_data/2025_partc_star_ratings.csv
      • ./local_data/partc_source_data/MA_Contract_directory_2025_06.csv
    • Output: search_these.csv
    • Process: Reads the two source files, extracts all unique "Parent Organization" names, and saves them into a new CSV file.
  • Step20_Serp_Scrape.py

    • Purpose: To perform a broad web search for each parent organization to find their provider directory API.
    • Input: search_these.csv
    • Output: A JSON file for each organization in ./local_data/scrape_results/.
    • Process: For each organization in the input file, it uses the SERPapi service to perform a Google search for "{organization_name} Medicare Advantage "PROVIDER DIRECTORY" API "FHIR". The raw JSON search results are saved. Eventually this will support different search strings to find the same thing.

Workflow B: Domain-Based Site Search for FHIR endpoints

This workflow discovers potential API endpoints by searching for company domain names found in contact email addresses.

  • Step30_extract_email_domains.py

    • Purpose: To create a list of unique company domain names.
    • Input: ./local_data/partc_source_data/MA_Contract_directory_2025_06.csv
    • Output: plan_domain_names.csv
    • Process: Reads the source file, finds all email addresses in the "Directory Contact Email" column, extracts the unique domain names (e.g., aetna.com), and saves them to a new CSV file.
  • Step40_domain_serp_scrape.py

    • Purpose: To perform a targeted, site-specific search for each domain to find their provider directory API.
    • Input: plan_domain_names.csv
    • Output: A JSON file for each domain in ./local_data/email_scrape_results/.
    • Process: For each domain, it uses SERPapi to perform a Google search limited to that domain: site:{domain} "PROVIDER DIRECTORY" "FHIR". This provides more targeted results than Workflow A.

Manual Step: Curating Endpoints

The JSON files generated by Step 20 and Step 40 must be manually reviewed to identify actual, working FHIR API base URLs. These URLs should be compiled into good_payer_endpoints.csv with the columns payer_name, payer_stub, and payer_provider_directory_fhir_url. This file is the critical input for the final ingestion step.


Workflow C: Data Ingestion and Normalization

This workflow consumes the curated list of FHIR APIs and processes the data.

  • Step70_SlurpPayerProviderNetworks.py
    • Purpose: To connect to known FHIR endpoints, download all provider network data, and normalize it into a structured, relational format.
    • Input: good_payer_endpoints.csv
    • Output: A new directory for each payer in ./local_data/payer_slurp_results/{payer_stub}/, containing seven distinct CSV files:
      1. org_to_pr.csv: Links Organizations to PractitionerRoles.
      2. org.csv: Unique Organizations.
      3. location_to_pr.csv: Links Locations to PractitionerRoles.
      4. location.csv: Unique Locations.
      5. p_to_pr.csv: Links Practitioners to PractitionerRoles, including NPI.
      6. spec_to_pr.csv: Links Specialties to PractitionerRoles.
      7. tele_to_pr.csv: Links Telecom information to PractitionerRoles.
    • Process: For each payer endpoint, the script fetches all PractitionerRole resources, handling pagination. It then recursively follows the links within each PractitionerRole to fetch the associated Practitioner, Organization, and Location resources. Finally, it parses all the retrieved data and writes it out to the seven CSV files. The script includes a --test flag for development, which limits the number of records processed.

Policies

Open Source Policy

We adhere to the CMS Open Source Policy. If you have any questions, just shoot us an email.

Security and Responsible Disclosure Policy

Submit a vulnerability: Vulnerability reports can be submitted through Bugcrowd. Reports may be submitted anonymously. If you share contact information, we will acknowledge receipt of your report within 3 business days.

Software Bill of Materials (SBOM)

A Software Bill of Materials (SBOM) is a formal record containing the details and supply chain relationships of various components used in building software.

In the spirit of Executive Order 14028 - Improving the Nation's Cyber Security, a SBOM for this repository is provided here: https://github.com/{{ cookiecutter.project_org }}/{{ cookiecutter.project_repo_name }}/network/dependencies.

For more information and resources about SBOMs, visit: https://www.cisa.gov/sbom.

Public domain

This project is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication as indicated in LICENSE.

All contributions to this project will be released under the CC0 dedication. By submitting a pull request or issue, you are agreeing to comply with this waiver of copyright interest.

About

plan scrape prototype

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •