feat(nlp): add a variety of --cohort-* args to filter notes #434

mikix · 2025-08-19T14:51:07Z

Three main new arguments:

--cohort-csv: a csv with a column like patient_id or note_ref
--cohort-anon-csv: same but with anonymized IDs
--cohort-athena-table: same but points at an Athena table

To support the Athena option, we have some other new args:

--athena-workgroup: specify the workgroup to use
--athena-database: specify the database to use
--allow-large-cohort: if the table is gigantic, use it anyway (this is here because we have a typo-guard in there - if you accidentally point at the base observation table, we're gonna stop you from downloading a terabyte of data)

You can specify the database in the table arg with a period. And the workgroup can be specified via env var or CLI.

If we find a docref/dxreport ID/ref column, we'll use that. Otherwise, we'll use a patient ID column and grab all notes for those patients.

This cohort filtering replaces instead of augments the previous default filtering of "final" status notes (i.e. skipping draft or superceded notes). But if the user is specifying the IDs manually for us, they must know what they want and we don't need to do the status check for them.

Checklist

Consider if documentation (like in docs/) needs to be updated
Consider if tests should be added

github-actions · 2025-08-19T17:29:59Z

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines	Covered	Coverage	Threshold	Status
4120	4078	99%	98%	🟢

New Files

No new covered files...

Modified Files

File	Coverage	Status
cumulus_etl/cli_utils.py	100%	🟢
cumulus_etl/errors.py	100%	🟢
cumulus_etl/etl/cli.py	100%	🟢
cumulus_etl/etl/config.py	100%	🟢
cumulus_etl/etl/nlp/cli.py	100%	🟢
cumulus_etl/etl/pipeline.py	100%	🟢
cumulus_etl/etl/tasks/base.py	100%	🟢
cumulus_etl/etl/tasks/nlp_task.py	100%	🟢
cumulus_etl/etl/tasks/task_factory.py	100%	🟢
cumulus_etl/export/cli.py	100%	🟢
cumulus_etl/nlp/utils.py	100%	🟢
cumulus_etl/upload_notes/cli.py	100%	🟢
cumulus_etl/upload_notes/downloader.py	100%	🟢
cumulus_etl/upload_notes/selector.py	100%	🟢
TOTAL	100%	🟢

updated for commit: 5b27f89 by action🐍

Three main new arguments: * --cohort-csv: a csv with a column like patient_id or note_ref * --cohort-anon-csv: same but with anonymized IDs * --cohort-athena-table: same but points at an Athena table To support the Athena option, we have some other new args: * --athena-workgroup: specify the workgroup to use * --athena-database: specify the database to use * --allow-large-cohort: if the table is gigantic, use it anyway (this is here because we have a typo-guard in there - if you accidentally point at the base observation table, we're gonna stop you from downloading a terabyte of data) You can specify the database in the table arg with a period. And the workgroup can be specified via env var or CLI. If we find a docref/dxreport ID/ref column, we'll use that. Otherwise, we'll use a patient ID column and grab all notes for those patients. This cohort filtering replaces instead of augments the previous default filtering of "final" status notes (i.e. skipping draft or superceded notes). But if the user is specifying the IDs manually for us, they must know what they want and we don't need to do the status check for them.

dogversioning · 2025-08-19T18:02:33Z

compose.yaml

+      - AWS_ATHENA_WORK_GROUP
      - AWS_DEFAULT_PROFILE
+      - AWS_DEFAULT_REGION
      - AWS_PROFILE
+      - AWS_REGION


just noting here that the library uses different names for these same fields - we might want to natter about harmonizing on one (probably this set)

mikix force-pushed the mikix/cohort-args branch from 9bfeb52 to 94c3e04 Compare August 19, 2025 17:15

mikix force-pushed the mikix/cohort-args branch from 94c3e04 to cb3c027 Compare August 19, 2025 17:51

mikix marked this pull request as ready for review August 19, 2025 17:52

mikix changed the title ~~WIP: feat(nlp): add a variety of --cohort-* args to filter notes~~ feat(nlp): add a variety of --cohort-* args to filter notes Aug 19, 2025

mikix force-pushed the mikix/cohort-args branch from cb3c027 to 5b27f89 Compare August 19, 2025 17:53

dogversioning approved these changes Aug 19, 2025

View reviewed changes

mikix merged commit 3cd9947 into main Aug 19, 2025
3 checks passed

mikix deleted the mikix/cohort-args branch August 19, 2025 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(nlp): add a variety of --cohort-* args to filter notes #434

feat(nlp): add a variety of --cohort-* args to filter notes #434

Uh oh!

mikix commented Aug 19, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 19, 2025 •

edited

Loading

Uh oh!

dogversioning Aug 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(nlp): add a variety of --cohort-* args to filter notes #434

feat(nlp): add a variety of --cohort-* args to filter notes #434

Uh oh!

Conversation

mikix commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

github-actions bot commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

☂️ Python Coverage

Overall Coverage

New Files

Modified Files

Uh oh!

dogversioning Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mikix commented Aug 19, 2025 •

edited

Loading

github-actions bot commented Aug 19, 2025 •

edited

Loading