Skip to content

Conversation

@mikix
Copy link
Contributor

@mikix mikix commented Aug 22, 2025

These are bare bone prompts, which can run on sample shipped data. They are meant to be used when following some docs on how to step through the NLP workflow.

Checklist

  • Consider if documentation (like in docs/) needs to be updated
  • Consider if tests should be added

Comment on lines -78 to +81
return openai.AsyncAzureOpenAI(api_version="2024-06-01")
return openai.AsyncAzureOpenAI(api_version="2024-10-21")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets us onto latest - I believe the only change is deprecating something we aren't using and adding batch processing (which we'll be interested in at some point)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a cli arg with a default value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naw, because this isn't a user visible thing. I think largely even if it were, you'd still want to ratchet it upward, because who knows when they'll drop an old API version. But mostly it's just for us to know how to call into it.

# 3.5 doesn't support a pydantic JSON schema, so we do some work to keep it using the same API
# as the rest of our code.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized this issue while testing the model for this branch - I think it's broken right now in main for the covid study (but 🤷) - this gets it working again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH we can probably remove 3.5 at this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that becomes a question of "when do we feel comfortable deleting code for the covid study?" and an issue of reproducibility.

I realize 3.5 is harder to hit today, but not impossible (we can still do it through Azure at BCH).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should maybe talk about that at a weekly meeting? but that shouldn't block this PR

Comment on lines +177 to +178
if args.dir_input == "%EXAMPLE%" and not os.path.exists(args.dir_input):
args.dir_input = os.path.join(os.path.dirname(__file__), "studies/example/ndjson")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this %EXAMPLE% approach gross? Got a better idea?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two approaches come to mind:

  • have a special cli arg which will use a specific on disk path just for this
  • make this study a seperate git repo, and then have someone install it in some location they can get easy path access to

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have a special cli arg which will use a specific on disk path just for this

Like --use-example-data or something? But then what do we do with the normally-required input folder positional CLI arg? Make it not required if we see the other argument? Adds some complexity to the code but could work. And adds a new CLI arg to the pile.

make this study a seperate git repo, and then have someone install it in some location they can get easy path access to

Yeah that would be the traditional approach, and more similar to "real" studies, which is nice. But... it adds several steps to the instructions and means that the docker compose lines have to have the extra complexity of "OK now volume mount where you put that folder and refer to it on the command line from the mounted location" stuff - which again, they'll have to do eventually for real data. But I was hoping to avoid for the simple workflow case.


Were you just brainstorming, and/or do you dislike %EXAMPLE%? Like, how much on a scale of 1-10? 😄 I dislike it about a 3. And slightly prefer it to the above I think, which I'd put at 4-5 maybe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not my favorite, but I think it's... fine?

I like the real study approach but only with the idea of 'get everything to use the example study for demonstration purposes' which is maybe a bit more of a lift, but I don't think it's a must do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's interesting thought - to re-use this elsewhere. I'd have a concern about viability though. Like, these docs are chosen for matching what the NLP task needs - certain "final" status codes, having a clear age listed in the doc, etc. If we use them elsewhere, we might struggle to make the data fit all use cases - and/or add more data that would make this small token test into a larger one.

HOWEVER, I do intend to make a little example study for the Library - I was thinking of having it built-in like core and discovery, again preferring ease-of-use over some realism for the non-NLP bits of this workflow. If that approach is no good, we would make a new repo for it, and that might be a reasonable place to put this data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this separately. I think we're fine with this current approach for now, but we want this document and its parent "NLP overview for execs" doc to expand a bit and likely move to the global cumulus docs repo. But this hand-wavy approach around "where do the docs come from" is fine for this "NLP overview for engineers" doc.

@mikix mikix marked this pull request as ready for review August 22, 2025 14:57
@github-actions
Copy link

github-actions bot commented Aug 22, 2025

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
4186 4144 99% 98% 🟢

New Files

File Coverage Status
cumulus_etl/etl/studies/example/init.py 100% 🟢
cumulus_etl/etl/studies/example/example_tasks.py 100% 🟢
TOTAL 100% 🟢

Modified Files

File Coverage Status
cumulus_etl/etl/pipeline.py 100% 🟢
cumulus_etl/etl/tasks/nlp_task.py 100% 🟢
cumulus_etl/etl/tasks/task_factory.py 100% 🟢
cumulus_etl/nlp/openai.py 100% 🟢
TOTAL 100% 🟢

updated for commit: 7a0cee5 by action🐍

These are bare bone prompts, which can run on sample shipped data.
They are meant to be used when following some docs on how to step
through the NLP workflow.
Comment on lines -78 to +81
return openai.AsyncAzureOpenAI(api_version="2024-06-01")
return openai.AsyncAzureOpenAI(api_version="2024-10-21")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a cli arg with a default value?

# 3.5 doesn't support a pydantic JSON schema, so we do some work to keep it using the same API
# as the rest of our code.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH we can probably remove 3.5 at this point?

Comment on lines +177 to +178
if args.dir_input == "%EXAMPLE%" and not os.path.exists(args.dir_input):
args.dir_input = os.path.join(os.path.dirname(__file__), "studies/example/ndjson")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two approaches come to mind:

  • have a special cli arg which will use a specific on disk path just for this
  • make this study a seperate git repo, and then have someone install it in some location they can get easy path access to

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be nice to give a rough idea of the cost of running this, just so folks realize it's trivial?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point... Let me ask Dylan how to figure that out 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK after talking to him, I want to make a larger change that should probably land separately.

I can collect and print the total token costs of an NLP job (and the run time). This should let anyone figure out how much any given NLP run cost them. Tokens for cloud jobs, run time for local models.

And then I can come back and fill this bit in with an exact cost for the sample data (rather than an estimated one).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though... I suppose I could go collect that data real quick rn and put it in before doing the full cost thing. Let me do that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Less than 15 cents for gpt4, which oddly is crazy expensive ($30/million tokens compared to like $2-$3/million for most of the other APIs).

@mikix mikix merged commit 36f2923 into main Aug 25, 2025
3 checks passed
@mikix mikix deleted the mikix/example branch August 25, 2025 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants