Clean extracted schema json #432
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
I am experimenting with different models and approaches.
An area I am struggling with is schema extraction and specifically models returning incorrect JSON which fails to be parsed and the process fails. This is feels particularly brittle.
Regardless of the rules in the prompt relating to JSON data models still return badly formatted JSON, particularly wrapping it in markdown e.g.
This is particularly prevalent with
gpt-4o
and the OpenAI open source models e.g.gpt-oss-20b
.Dependent on the model I can include response parameters, "response_format": {"type": "json_object"} , but not all models support it.
I have experimented with including a step, here in the schema extraction, to cleanse the response removing known or common problems (e.g. the markdown format) from the json before loading it. This has worked really well.
Type of Change
Complexity
Complexity: Low
How Has This Been Tested?
Checklist
The following requirements should have been met (depending on the changes in the branch):