| 
 | 1 | +---  | 
 | 2 | +title: Example Workflow  | 
 | 3 | +parent: NLP  | 
 | 4 | +grand_parent: ETL  | 
 | 5 | +nav_order: 1  | 
 | 6 | +# audience: engineer familiar with the project  | 
 | 7 | +# type: tutorial  | 
 | 8 | +---  | 
 | 9 | + | 
 | 10 | +# An Example NLP Workflow  | 
 | 11 | + | 
 | 12 | +Let's work through an end-to-end NLP workflow, as if you were doing a real study.  | 
 | 13 | +But we'll use an example study shipped with the ETL for testing purposes instead.  | 
 | 14 | + | 
 | 15 | +This will take us from the initial actual NLP, then to chart review,  | 
 | 16 | +then finally to analyzing accuracy.  | 
 | 17 | + | 
 | 18 | +You don't need to prepare your own clinical notes for this run-through.  | 
 | 19 | +We'll use synthetic notes shipped with Cumulus ETL for this very purpose.  | 
 | 20 | + | 
 | 21 | +This example study we'll use is just a very simple age range study.  | 
 | 22 | +The NLP will only be tasked with extracting an age from a clinical note.  | 
 | 23 | + | 
 | 24 | +## The NLP Itself  | 
 | 25 | + | 
 | 26 | +Before we start, you'll need to have Cumulus ETL and your AWS infrastructure ready.  | 
 | 27 | +Follow the [setup instructions](../setup) if you haven't done so already, then come back here.  | 
 | 28 | + | 
 | 29 | +### Model Setup  | 
 | 30 | + | 
 | 31 | +You have a choice of model for this.  | 
 | 32 | +Real studies might require one specific model or another.  | 
 | 33 | +But this example task is fairly liberal.  | 
 | 34 | + | 
 | 35 | +Here are the options, along with the task name to use.  | 
 | 36 | + | 
 | 37 | +#### Azure Cloud Options  | 
 | 38 | +For these, you'll want to set a couple variables first:  | 
 | 39 | +```sh  | 
 | 40 | +export AZURE_OPENAI_API_KEY=xxx  | 
 | 41 | +export AZURE_OPENAI_ENDPOINT=https://xxx.openai.azure.com/  | 
 | 42 | +```  | 
 | 43 | + | 
 | 44 | +Task names:  | 
 | 45 | +- GPT4: `example__nlp_gpt4`  | 
 | 46 | +- GPT4o: `example__nlp_gpt4o`  | 
 | 47 | +- GPT5: `example__nlp_gpt5`  | 
 | 48 | + | 
 | 49 | +#### Local (On-Prem) Options  | 
 | 50 | +For these, you'll need to start up the appropriate model on your machine:  | 
 | 51 | +```sh  | 
 | 52 | +docker compose up --wait gpt-oss-120b  | 
 | 53 | +```  | 
 | 54 | + | 
 | 55 | +Task names:  | 
 | 56 | +- GPT-OSS 120B (needs 80GB of GPU memory): `example__nlp_gpt_oss_120b`  | 
 | 57 | + | 
 | 58 | +### Running the ETL  | 
 | 59 | + | 
 | 60 | +Now that your model is ready, let's run the ETL on some notes!  | 
 | 61 | + | 
 | 62 | +Below is the command line to use.  | 
 | 63 | +You'll need to change the bucket names and paths to wherever you set up your AWS infrastructure.  | 
 | 64 | +And you'll want to change the task name as appropriate for your model.  | 
 | 65 | +Leave the odd looking `%EXAMPLE%` bit in place;  | 
 | 66 | +that just tells Cumulus ETL to use its built-in example documents as the input.  | 
 | 67 | + | 
 | 68 | +The output and PHI bucket locations should be the same as your normal ETL runs on raw FHIR data.  | 
 | 69 | +There's no actual PHI in this example run because of the synthetic data,  | 
 | 70 | +but normally there is, and that PHI bucket is where Cumulus ETL keeps caches of NLP results.  | 
 | 71 | + | 
 | 72 | +```sh  | 
 | 73 | +docker compose run --rm \  | 
 | 74 | +  cumulus-etl nlp \  | 
 | 75 | +  %EXAMPLE% \  | 
 | 76 | +  s3://my-output-bucket/ \  | 
 | 77 | +  s3://my-phi-bucket/ \  | 
 | 78 | +  --task example__nlp_gpt4  | 
 | 79 | +```  | 
 | 80 | + | 
 | 81 | +(If this were a real study, you'd probably do this a bit differently.  | 
 | 82 | +You'd point at your real DocumentReference resources for example.  | 
 | 83 | +And you'd probably restrict the set of documents you run NLP on with an argument  | 
 | 84 | +like `--cohort-athena-table study__my_cohort`.  | 
 | 85 | +But for this run-through, we're going to hand-wave all the document selection pieces.)  | 
 | 86 | + | 
 | 87 | +### Running the Crawler  | 
 | 88 | + | 
 | 89 | +Whenever you write a new table to S3, you'll want to run your AWS Glue crawler again,  | 
 | 90 | +so that the table's schema gets set correctly in Athena.  | 
 | 91 | + | 
 | 92 | +First, confirm that your AWS Cloud Formation templates have the `example__nlp_*` tables  | 
 | 93 | +configured in them. If not, try copying the Glue crawler definition from  | 
 | 94 | +[the sample template we provide](../setup/aws.md).  | 
 | 95 | + | 
 | 96 | +Then go to your AWS console, in the AWS Glue service, in the sidebar under Data Catalog, and  | 
 | 97 | +choose Crawlers.  | 
 | 98 | +You should see your crawler listed there. Select it, click Run, and wait for it to finish.  | 
 | 99 | + | 
 | 100 | +### Confirm the Data in Athena  | 
 | 101 | + | 
 | 102 | +While you're in the AWS console, switch to the Athena service and select the appropriate  | 
 | 103 | +Cumulus workgroup and database.  | 
 | 104 | + | 
 | 105 | +Then if you make a query like below (assuming you used the GPT4 model),  | 
 | 106 | +you should see eight results with extracted ages.  | 
 | 107 | +```sql  | 
 | 108 | +select * from example__nlp_gpt4  | 
 | 109 | +```  | 
 | 110 | + | 
 | 111 | +**Congratulations!**  | 
 | 112 | +You've now run NLP on some synthetic clinical notes and uploaded the results to Athena.  | 
 | 113 | +Those extracted ages could now be post-processed by the `example` study to calculate age ranges,  | 
 | 114 | +and then confirmed with chart review by humans.  | 
 | 115 | + | 
 | 116 | +At least that's the flow you'd use for a real study.  | 
0 commit comments