This project is designed to identify and ingest healthcare provider network information from various payer (insurance company) sources. The process is broken down into several steps, from discovering potential API endpoints to downloading and normalizing the data from those endpoints.
The project consists of two main data acquisition workflows for discovering API endpoints, and a final workflow for ingesting and processing the data from confirmed endpoints.
The overall workflow is as follows:
- Discovery (Workflows A & B): Identify potential FHIR API endpoints using two different strategies.
- Curation (Manual Step): Analyze the results from the discovery phase to create a definitive list of working FHIR endpoints.
- Ingestion (Workflow C): Connect to the curated list of FHIR APIs, download the provider network data, and save it as a series of structured CSV files.
This workflow discovers potential API endpoints by searching for company names.
-
Step10_create_target_list.py
- Purpose: To create a master list of unique "Parent Organizations" to search for.
- Input:
./local_data/partc_source_data/2025_partc_star_ratings.csv
./local_data/partc_source_data/MA_Contract_directory_2025_06.csv
- Output:
search_these.csv
- Process: Reads the two source files, extracts all unique "Parent Organization" names, and saves them into a new CSV file.
-
Step20_Serp_Scrape.py
- Purpose: To perform a broad web search for each parent organization to find their provider directory API.
- Input:
search_these.csv
- Output: A JSON file for each organization in
./local_data/scrape_results/
. - Process: For each organization in the input file, it uses the SERPapi service to perform a Google search for
"{organization_name} Medicare Advantage "PROVIDER DIRECTORY" API "FHIR"
. The raw JSON search results are saved. Eventually this will support different search strings to find the same thing.
This workflow discovers potential API endpoints by searching for company domain names found in contact email addresses.
-
Step30_extract_email_domains.py
- Purpose: To create a list of unique company domain names.
- Input:
./local_data/partc_source_data/MA_Contract_directory_2025_06.csv
- Output:
plan_domain_names.csv
- Process: Reads the source file, finds all email addresses in the "Directory Contact Email" column, extracts the unique domain names (e.g.,
aetna.com
), and saves them to a new CSV file.
-
Step40_domain_serp_scrape.py
- Purpose: To perform a targeted, site-specific search for each domain to find their provider directory API.
- Input:
plan_domain_names.csv
- Output: A JSON file for each domain in
./local_data/email_scrape_results/
. - Process: For each domain, it uses SERPapi to perform a Google search limited to that domain:
site:{domain} "PROVIDER DIRECTORY" "FHIR"
. This provides more targeted results than Workflow A.
The JSON files generated by Step 20 and Step 40 must be manually reviewed to identify actual, working FHIR API base URLs. These URLs should be compiled into good_payer_endpoints.csv
with the columns payer_name
, payer_stub
, and payer_provider_directory_fhir_url
. This file is the critical input for the final ingestion step.
This workflow consumes the curated list of FHIR APIs and processes the data.
Step70_SlurpPayerProviderNetworks.py
- Purpose: To connect to known FHIR endpoints, download all provider network data, and normalize it into a structured, relational format.
- Input:
good_payer_endpoints.csv
- Output: A new directory for each payer in
./local_data/payer_slurp_results/{payer_stub}/
, containing seven distinct CSV files:org_to_pr.csv
: Links Organizations to PractitionerRoles.org.csv
: Unique Organizations.location_to_pr.csv
: Links Locations to PractitionerRoles.location.csv
: Unique Locations.p_to_pr.csv
: Links Practitioners to PractitionerRoles, including NPI.spec_to_pr.csv
: Links Specialties to PractitionerRoles.tele_to_pr.csv
: Links Telecom information to PractitionerRoles.
- Process: For each payer endpoint, the script fetches all
PractitionerRole
resources, handling pagination. It then recursively follows the links within eachPractitionerRole
to fetch the associatedPractitioner
,Organization
, andLocation
resources. Finally, it parses all the retrieved data and writes it out to the seven CSV files. The script includes a--test
flag for development, which limits the number of records processed.
We adhere to the CMS Open Source Policy. If you have any questions, just shoot us an email.
Submit a vulnerability: Vulnerability reports can be submitted through Bugcrowd. Reports may be submitted anonymously. If you share contact information, we will acknowledge receipt of your report within 3 business days.
A Software Bill of Materials (SBOM) is a formal record containing the details and supply chain relationships of various components used in building software.
In the spirit of Executive Order 14028 - Improving the Nation's Cyber Security, a SBOM for this repository is provided here: https://github.com/{{ cookiecutter.project_org }}/{{ cookiecutter.project_repo_name }}/network/dependencies.
For more information and resources about SBOMs, visit: https://www.cisa.gov/sbom.
This project is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication as indicated in LICENSE.
All contributions to this project will be released under the CC0 dedication. By submitting a pull request or issue, you are agreeing to comply with this waiver of copyright interest.