Skip to content

Conversation

@mikix
Copy link
Contributor

@mikix mikix commented Aug 19, 2025

Three main new arguments:

  • --cohort-csv: a csv with a column like patient_id or note_ref
  • --cohort-anon-csv: same but with anonymized IDs
  • --cohort-athena-table: same but points at an Athena table

To support the Athena option, we have some other new args:

  • --athena-workgroup: specify the workgroup to use
  • --athena-database: specify the database to use
  • --allow-large-cohort: if the table is gigantic, use it anyway (this is here because we have a typo-guard in there - if you accidentally point at the base observation table, we're gonna stop you from downloading a terabyte of data)

You can specify the database in the table arg with a period. And the workgroup can be specified via env var or CLI.

If we find a docref/dxreport ID/ref column, we'll use that. Otherwise, we'll use a patient ID column and grab all notes for those patients.

This cohort filtering replaces instead of augments the previous default filtering of "final" status notes (i.e. skipping draft or superceded notes). But if the user is specifying the IDs manually for us, they must know what they want and we don't need to do the status check for them.

Checklist

  • Consider if documentation (like in docs/) needs to be updated
  • Consider if tests should be added

@mikix mikix force-pushed the mikix/cohort-args branch from 9bfeb52 to 94c3e04 Compare August 19, 2025 17:15
@github-actions
Copy link

github-actions bot commented Aug 19, 2025

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
4120 4078 99% 98% 🟢

New Files

No new covered files...

Modified Files

File Coverage Status
cumulus_etl/cli_utils.py 100% 🟢
cumulus_etl/errors.py 100% 🟢
cumulus_etl/etl/cli.py 100% 🟢
cumulus_etl/etl/config.py 100% 🟢
cumulus_etl/etl/nlp/cli.py 100% 🟢
cumulus_etl/etl/pipeline.py 100% 🟢
cumulus_etl/etl/tasks/base.py 100% 🟢
cumulus_etl/etl/tasks/nlp_task.py 100% 🟢
cumulus_etl/etl/tasks/task_factory.py 100% 🟢
cumulus_etl/export/cli.py 100% 🟢
cumulus_etl/nlp/utils.py 100% 🟢
cumulus_etl/upload_notes/cli.py 100% 🟢
cumulus_etl/upload_notes/downloader.py 100% 🟢
cumulus_etl/upload_notes/selector.py 100% 🟢
TOTAL 100% 🟢

updated for commit: 5b27f89 by action🐍

@mikix mikix force-pushed the mikix/cohort-args branch from 94c3e04 to cb3c027 Compare August 19, 2025 17:51
@mikix mikix marked this pull request as ready for review August 19, 2025 17:52
@mikix mikix changed the title WIP: feat(nlp): add a variety of --cohort-* args to filter notes feat(nlp): add a variety of --cohort-* args to filter notes Aug 19, 2025
Three main new arguments:
* --cohort-csv: a csv with a column like patient_id or note_ref
* --cohort-anon-csv: same but with anonymized IDs
* --cohort-athena-table: same but points at an Athena table

To support the Athena option, we have some other new args:
* --athena-workgroup: specify the workgroup to use
* --athena-database: specify the database to use
* --allow-large-cohort: if the table is gigantic, use it anyway
  (this is here because we have a typo-guard in there - if you
  accidentally point at the base observation table, we're gonna
  stop you from downloading a terabyte of data)

You can specify the database in the table arg with a period.
And the workgroup can be specified via env var or CLI.

If we find a docref/dxreport ID/ref column, we'll use that. Otherwise,
we'll use a patient ID column and grab all notes for those patients.

This cohort filtering replaces instead of augments the previous default
filtering of "final" status notes (i.e. skipping draft or superceded
notes). But if the user is specifying the IDs manually for us, they
must know what they want and we don't need to do the status check for
them.
@mikix mikix force-pushed the mikix/cohort-args branch from cb3c027 to 5b27f89 Compare August 19, 2025 17:53
Comment on lines +36 to +40
- AWS_ATHENA_WORK_GROUP
- AWS_DEFAULT_PROFILE
- AWS_DEFAULT_REGION
- AWS_PROFILE
- AWS_REGION
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just noting here that the library uses different names for these same fields - we might want to natter about harmonizing on one (probably this set)

@mikix mikix merged commit 3cd9947 into main Aug 19, 2025
3 checks passed
@mikix mikix deleted the mikix/cohort-args branch August 19, 2025 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants