Tools for unifying personal electronic health record (EHR) exports into a local SQLite database and exploring them with a Streamlit dashboard. The repository contains no protected health information; the ingest pipeline expects you to provide your own CCD exports. Portions of the scaffolding were drafted with generative AI and reviewed by human maintainers - see the full AI disclosure for details.
- Python 3.12 or newer
- SQLite (bundled with Python)
- Streamlit-compatible browser (Chrome, Edge, Firefox, Safari)
git clone <repo-url>
cd Health-Records-Collection
python -m venv .venv
.venv\Scripts\Activate.ps1 # Windows PowerShell
# or: source .venv/bin/activate # macOS/Linux
pip install --upgrade pip
pip install -r requirements.txt-
Drop each CCD ZIP export into
data/raw/. -
Run the ingestion workflow:
python ingest.py
This creates or refreshes
db/health_records.db, extracts ZIP contents intodata/parsed/, and populates all supported tables.-
Add
--log-level debugto surface detailed troubleshooting messages while you iterate:python ingest.py --log-level debug
-
To capture logs without printing patient identifiers to the console, direct output to a file:
python ingest.py --log-level info --log-file logs/ingest.log
Debug logs include richer context, so avoid enabling them on shared systems.
-
-
Launch the dashboard:
streamlit run frontend/app.py
Streamlit opens at http://localhost:8501 with an encounter overview, table browser, and SQL scratchpad.
-
Ingestion pipeline (
ingest.py)- Unzips CCD packages from
data/raw/intodata/parsed/(skipping extracts that already exist).
- Unzips CCD packages from
-
Parses XML with lxml using modular parsers in
parsers/for patients, encounters, allergies, conditions, medications, labs, procedures, vitals, immunizations, progress notes, and insurance coverage.- Records file-level provenance in the
data_sourcetable (original filename, archive, SHA256 hash, creation time, repository ID, and author institution pulled from XDMMETADATA.XML) and threads the resulting identifier through every downstream insert. - Normalizes providers, deduplicates medications and immunizations, and
invokes service modules in
services/to load data into SQLite. - Applies schema migrations on the fly via
db/schema.pyto keep older databases compatible.
- Records file-level provenance in the
-
Streamlit dashboard (
frontend/)views.pyrenders an Encounter Overview with expandable visit summaries, including diagnoses and medications.- Sidebar controls let you pick tables to preview using reusable widgets in
ui_components.py. - A SQL query box allows ad-hoc exploration; results render with native Streamlit dataframes.
- Connection utilities in
db_utils.pykeep the UI responsive with row limits and read-only access. - XML files are rendered using the HL7 CDA Core Stylesheet, automatically updated weekly from the official repository with proper attribution.
-
Schema & services (
schema.sql,services/)schema.sqldefines core tables for patients, providers, encounters, medications, lab results, allergies, insurance coverage, conditions (with codes), procedures, vitals, immunizations, attachments, progress notes, and theingested_archiveregistry used to track archive hashes and ingestion counts, each linking back to enricheddata_sourcemetadata (now includingsource_archive_idforeign keys toingested_archive).- Service modules encapsulate insert logic, deduplication, and foreign key
wiring for each domain.
services/data_sources.pymanages provenance rows so other modules can reference a shareddata_source_id, whileservices/archives.pyrecords archive hashes so duplicate uploads can be flagged safely. db/schema.pybackfills missing columns, normalizes provider records, and adds protective indexes.
Use the Settings view in the Streamlit sidebar to update the raw, parsed, and database paths. Overrides are saved to user/settings.yaml and the app automatically reloads after changes.
-
CDA Rendering
- This project uses the HL7 CDA Core Stylesheet for rendering CDA XML documents, which is maintained in a separate repository and automatically updated via GitHub Actions. The stylesheet files are included under the Apache 2.0 license with proper attribution.
-
Color Palette -Coolors.co
data/ Raw ZIP exports (`raw/`) and extracted XML (`parsed/`)
db/ SQLite artifacts (`health_records.db`) and schema helpers
frontend/ Streamlit application entry point, views, and utilities
parsers/ CCD XML parsers grouped by domain
services/ Persistence helpers for each domain table
tests/ Pytest suite covering parsers, services,
schema, and ingest flow
ingest.py Command-line ingestion workflow
schema.sql Canonical database definition
requirements.txt Locked Python dependencies
- Update
frontend/config.yamlto change the dashboard title, layout, database path, or default row limits. - Extend parsing coverage by adding new modules in
parsers/and wiring them intoingest.py. - Modify or append tables by editing
schema.sqland enhancingdb/schema.pyto enforce migrations. - Regenerate the database at any time by deleting
db/health_records.dband rerunningpython ingest.py. - Control ingestion verbosity per run with
--log-level {error,warning,info,debug}and optionally persist output via--log-file path/to/logs.txt.
-
Run the automated tests with:
pytest
-
The project targets Python 3.12; please keep new dependencies pinned in
requirements.txt. -
Follow the contributor guidelines in
CONTRIBUTING.mdand report security concerns perSECURITY.md.
MIT License. See LICENSE for full terms.