feat: add example__* tasks #437

mikix · 2025-08-22T14:15:10Z

These are bare bone prompts, which can run on sample shipped data. They are meant to be used when following some docs on how to step through the NLP workflow.

Checklist

Consider if documentation (like in docs/) needs to be updated
Consider if tests should be added

mikix · 2025-08-22T14:19:48Z

cumulus_etl/nlp/openai.py

-        return openai.AsyncAzureOpenAI(api_version="2024-06-01")
+        return openai.AsyncAzureOpenAI(api_version="2024-10-21")


This gets us onto latest - I believe the only change is deprecating something we aren't using and adding batch processing (which we'll be interested in at some point)

should this be a cli arg with a default value?

Naw, because this isn't a user visible thing. I think largely even if it were, you'd still want to ratchet it upward, because who knows when they'll drop an old API version. But mostly it's just for us to know how to call into it.

mikix · 2025-08-22T14:20:39Z

cumulus_etl/nlp/openai.py

+    # 3.5 doesn't support a pydantic JSON schema, so we do some work to keep it using the same API
+    # as the rest of our code.


I realized this issue while testing the model for this branch - I think it's broken right now in main for the covid study (but 🤷) - this gets it working again.

TBH we can probably remove 3.5 at this point?

Well, that becomes a question of "when do we feel comfortable deleting code for the covid study?" and an issue of reproducibility.

I realize 3.5 is harder to hit today, but not impossible (we can still do it through Azure at BCH).

we should maybe talk about that at a weekly meeting? but that shouldn't block this PR

mikix · 2025-08-22T14:54:07Z

cumulus_etl/etl/pipeline.py

+    if args.dir_input == "%EXAMPLE%" and not os.path.exists(args.dir_input):
+        args.dir_input = os.path.join(os.path.dirname(__file__), "studies/example/ndjson")


Is this %EXAMPLE% approach gross? Got a better idea?

two approaches come to mind:

have a special cli arg which will use a specific on disk path just for this

make this study a seperate git repo, and then have someone install it in some location they can get easy path access to

have a special cli arg which will use a specific on disk path just for this

Like --use-example-data or something? But then what do we do with the normally-required input folder positional CLI arg? Make it not required if we see the other argument? Adds some complexity to the code but could work. And adds a new CLI arg to the pile.

make this study a seperate git repo, and then have someone install it in some location they can get easy path access to

Yeah that would be the traditional approach, and more similar to "real" studies, which is nice. But... it adds several steps to the instructions and means that the docker compose lines have to have the extra complexity of "OK now volume mount where you put that folder and refer to it on the command line from the mounted location" stuff - which again, they'll have to do eventually for real data. But I was hoping to avoid for the simple workflow case.

Were you just brainstorming, and/or do you dislike %EXAMPLE%? Like, how much on a scale of 1-10? 😄 I dislike it about a 3. And slightly prefer it to the above I think, which I'd put at 4-5 maybe.

it's not my favorite, but I think it's... fine?

I like the real study approach but only with the idea of 'get everything to use the example study for demonstration purposes' which is maybe a bit more of a lift, but I don't think it's a must do.

Ah that's interesting thought - to re-use this elsewhere. I'd have a concern about viability though. Like, these docs are chosen for matching what the NLP task needs - certain "final" status codes, having a clear age listed in the doc, etc. If we use them elsewhere, we might struggle to make the data fit all use cases - and/or add more data that would make this small token test into a larger one.

HOWEVER, I do intend to make a little example study for the Library - I was thinking of having it built-in like core and discovery, again preferring ease-of-use over some realism for the non-NLP bits of this workflow. If that approach is no good, we would make a new repo for it, and that might be a reasonable place to put this data.

We talked about this separately. I think we're fine with this current approach for now, but we want this document and its parent "NLP overview for execs" doc to expand a bit and likely move to the global cumulus docs repo. But this hand-wavy approach around "where do the docs come from" is fine for this "NLP overview for engineers" doc.

github-actions · 2025-08-22T15:45:04Z

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines	Covered	Coverage	Threshold	Status
4186	4144	99%	98%	🟢

New Files

File	Coverage	Status
cumulus_etl/etl/studies/example/init.py	100%	🟢
cumulus_etl/etl/studies/example/example_tasks.py	100%	🟢
TOTAL	100%	🟢

Modified Files

File	Coverage	Status
cumulus_etl/etl/pipeline.py	100%	🟢
cumulus_etl/etl/tasks/nlp_task.py	100%	🟢
cumulus_etl/etl/tasks/task_factory.py	100%	🟢
cumulus_etl/nlp/openai.py	100%	🟢
TOTAL	100%	🟢

updated for commit: 7a0cee5 by action🐍

These are bare bone prompts, which can run on sample shipped data. They are meant to be used when following some docs on how to step through the NLP workflow.

dogversioning · 2025-08-25T12:25:34Z

cumulus_etl/nlp/openai.py

-        return openai.AsyncAzureOpenAI(api_version="2024-06-01")
+        return openai.AsyncAzureOpenAI(api_version="2024-10-21")


should this be a cli arg with a default value?

dogversioning · 2025-08-25T12:26:14Z

cumulus_etl/nlp/openai.py

+    # 3.5 doesn't support a pydantic JSON schema, so we do some work to keep it using the same API
+    # as the rest of our code.


TBH we can probably remove 3.5 at this point?

dogversioning · 2025-08-25T12:30:26Z

cumulus_etl/etl/pipeline.py

+    if args.dir_input == "%EXAMPLE%" and not os.path.exists(args.dir_input):
+        args.dir_input = os.path.join(os.path.dirname(__file__), "studies/example/ndjson")


two approaches come to mind:

have a special cli arg which will use a specific on disk path just for this

make this study a seperate git repo, and then have someone install it in some location they can get easy path access to

dogversioning · 2025-08-25T12:31:43Z

docs/nlp/example.md

it might be nice to give a rough idea of the cost of running this, just so folks realize it's trivial?

Good point... Let me ask Dylan how to figure that out 😄

OK after talking to him, I want to make a larger change that should probably land separately.

I can collect and print the total token costs of an NLP job (and the run time). This should let anyone figure out how much any given NLP run cost them. Tokens for cloud jobs, run time for local models.

And then I can come back and fill this bit in with an exact cost for the sample data (rather than an estimated one).

Though... I suppose I could go collect that data real quick rn and put it in before doing the full cost thing. Let me do that

Done. Less than 15 cents for gpt4, which oddly is crazy expensive ($30/million tokens compared to like $2-$3/million for most of the other APIs).

mikix commented Aug 22, 2025

View reviewed changes

mikix force-pushed the mikix/example branch from 3185124 to f5ffd41 Compare August 22, 2025 14:50

mikix commented Aug 22, 2025

View reviewed changes

mikix marked this pull request as ready for review August 22, 2025 14:57

mikix force-pushed the mikix/example branch from f5ffd41 to 587a6ce Compare August 22, 2025 15:30

feat: add example__* tasks

54edb86

These are bare bone prompts, which can run on sample shipped data. They are meant to be used when following some docs on how to step through the NLP workflow.

mikix force-pushed the mikix/example branch from 587a6ce to 54edb86 Compare August 22, 2025 16:31

dogversioning approved these changes Aug 25, 2025

View reviewed changes

Add a comment about cost

7a0cee5

mikix merged commit 36f2923 into main Aug 25, 2025
3 checks passed

mikix deleted the mikix/example branch August 25, 2025 16:12

		return openai.AsyncAzureOpenAI(api_version="2024-06-01")
		return openai.AsyncAzureOpenAI(api_version="2024-10-21")

		# 3.5 doesn't support a pydantic JSON schema, so we do some work to keep it using the same API
		# as the rest of our code.

		if args.dir_input == "%EXAMPLE%" and not os.path.exists(args.dir_input):
		args.dir_input = os.path.join(os.path.dirname(__file__), "studies/example/ndjson")

feat: add example__* tasks #437

feat: add example__* tasks #437

Uh oh!

Conversation

mikix commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

☂️ Python Coverage

Overall Coverage

New Files

Modified Files

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mikix commented Aug 22, 2025 •

edited

Loading

github-actions bot commented Aug 22, 2025 •

edited

Loading