-
Notifications
You must be signed in to change notification settings - Fork 4
feat: add example__* tasks #437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| return openai.AsyncAzureOpenAI(api_version="2024-06-01") | ||
| return openai.AsyncAzureOpenAI(api_version="2024-10-21") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gets us onto latest - I believe the only change is deprecating something we aren't using and adding batch processing (which we'll be interested in at some point)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be a cli arg with a default value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naw, because this isn't a user visible thing. I think largely even if it were, you'd still want to ratchet it upward, because who knows when they'll drop an old API version. But mostly it's just for us to know how to call into it.
| # 3.5 doesn't support a pydantic JSON schema, so we do some work to keep it using the same API | ||
| # as the rest of our code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized this issue while testing the model for this branch - I think it's broken right now in main for the covid study (but 🤷) - this gets it working again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH we can probably remove 3.5 at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that becomes a question of "when do we feel comfortable deleting code for the covid study?" and an issue of reproducibility.
I realize 3.5 is harder to hit today, but not impossible (we can still do it through Azure at BCH).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should maybe talk about that at a weekly meeting? but that shouldn't block this PR
| if args.dir_input == "%EXAMPLE%" and not os.path.exists(args.dir_input): | ||
| args.dir_input = os.path.join(os.path.dirname(__file__), "studies/example/ndjson") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this %EXAMPLE% approach gross? Got a better idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two approaches come to mind:
- have a special cli arg which will use a specific on disk path just for this
- make this study a seperate git repo, and then have someone install it in some location they can get easy path access to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have a special cli arg which will use a specific on disk path just for this
Like --use-example-data or something? But then what do we do with the normally-required input folder positional CLI arg? Make it not required if we see the other argument? Adds some complexity to the code but could work. And adds a new CLI arg to the pile.
make this study a seperate git repo, and then have someone install it in some location they can get easy path access to
Yeah that would be the traditional approach, and more similar to "real" studies, which is nice. But... it adds several steps to the instructions and means that the docker compose lines have to have the extra complexity of "OK now volume mount where you put that folder and refer to it on the command line from the mounted location" stuff - which again, they'll have to do eventually for real data. But I was hoping to avoid for the simple workflow case.
Were you just brainstorming, and/or do you dislike %EXAMPLE%? Like, how much on a scale of 1-10? 😄 I dislike it about a 3. And slightly prefer it to the above I think, which I'd put at 4-5 maybe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not my favorite, but I think it's... fine?
I like the real study approach but only with the idea of 'get everything to use the example study for demonstration purposes' which is maybe a bit more of a lift, but I don't think it's a must do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that's interesting thought - to re-use this elsewhere. I'd have a concern about viability though. Like, these docs are chosen for matching what the NLP task needs - certain "final" status codes, having a clear age listed in the doc, etc. If we use them elsewhere, we might struggle to make the data fit all use cases - and/or add more data that would make this small token test into a larger one.
HOWEVER, I do intend to make a little example study for the Library - I was thinking of having it built-in like core and discovery, again preferring ease-of-use over some realism for the non-NLP bits of this workflow. If that approach is no good, we would make a new repo for it, and that might be a reasonable place to put this data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked about this separately. I think we're fine with this current approach for now, but we want this document and its parent "NLP overview for execs" doc to expand a bit and likely move to the global cumulus docs repo. But this hand-wavy approach around "where do the docs come from" is fine for this "NLP overview for engineers" doc.
☂️ Python Coverage
Overall Coverage
New Files
Modified Files
|
These are bare bone prompts, which can run on sample shipped data. They are meant to be used when following some docs on how to step through the NLP workflow.
| return openai.AsyncAzureOpenAI(api_version="2024-06-01") | ||
| return openai.AsyncAzureOpenAI(api_version="2024-10-21") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be a cli arg with a default value?
| # 3.5 doesn't support a pydantic JSON schema, so we do some work to keep it using the same API | ||
| # as the rest of our code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH we can probably remove 3.5 at this point?
| if args.dir_input == "%EXAMPLE%" and not os.path.exists(args.dir_input): | ||
| args.dir_input = os.path.join(os.path.dirname(__file__), "studies/example/ndjson") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two approaches come to mind:
- have a special cli arg which will use a specific on disk path just for this
- make this study a seperate git repo, and then have someone install it in some location they can get easy path access to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be nice to give a rough idea of the cost of running this, just so folks realize it's trivial?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point... Let me ask Dylan how to figure that out 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK after talking to him, I want to make a larger change that should probably land separately.
I can collect and print the total token costs of an NLP job (and the run time). This should let anyone figure out how much any given NLP run cost them. Tokens for cloud jobs, run time for local models.
And then I can come back and fill this bit in with an exact cost for the sample data (rather than an estimated one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though... I suppose I could go collect that data real quick rn and put it in before doing the full cost thing. Let me do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Less than 15 cents for gpt4, which oddly is crazy expensive ($30/million tokens compared to like $2-$3/million for most of the other APIs).
These are bare bone prompts, which can run on sample shipped data. They are meant to be used when following some docs on how to step through the NLP workflow.
Checklist
docs/) needs to be updated