Skip to content

Make sure we correctly support additionalProperties set to True #150

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rlouf opened this issue Jan 28, 2025 · 5 comments
Open

Make sure we correctly support additionalProperties set to True #150

rlouf opened this issue Jan 28, 2025 · 5 comments
Labels
bug Something isn't working json schema

Comments

@rlouf
Copy link
Member

rlouf commented Jan 28, 2025

Setting the additionalProperties keyword to True should allow generation of properties whose name is not listed in properties, but it is currently not the case:

import json
import re
from outlines_core.fsm.json_schema import build_regex_from_schema

schema = {
    "$schema": "http://json-schema.org/draft-04/schema#",
    "type": "object",
    "properties": {
      "a": {
        "type": "string",
      }
    },
    "required": [
      "marathonJson"
    ],
    "additionalProperties": True
}

data = json.dumps({
    "a": "aaa",
})


regex_str = build_regex_from_schema(json.dumps(schema))
print(regex_str)
# '\\{([ ]?"a"[ ]?:[ ]?"([^"\\\\\\x00-\\x1F\\x7F-\\x9F]|\\\\["\\\\])*")?[ ]?\\}'

print(re.match(regex_str, data)
# <re.Match object; span=(0, 12), match='{"a": "aaa"}'>

But:

data = json.dumps({
    "a": "aaa",
    "id": "1"
})

print(re.match(regex_str, data)
# None

The immediate solution is to return an error when additionalProperties is set to True, to avoid surprises. Once that's done we can open an issue to add it to the implementation.

@rlouf rlouf added the bug Something isn't working label Jan 28, 2025
@rlouf rlouf removed the json schema label Jan 29, 2025
@rlouf rlouf changed the title Return compilation error when additionalProperties is set to True Support additionalProperties set to True Feb 18, 2025
@rlouf rlouf changed the title Support additionalProperties set to True Make sure we correctly support additionalProperties set to True Feb 18, 2025
@yvan-sraka
Copy link
Contributor

Okay, so I ran a lot of tests, read the JSON Schema specification, and also looked into this related issue: #149

AFAIU the spec states that having "additionalProperties": true is redundant since it's the default behavior. A quick fix could have been to simply forbid "additionalProperties": false, adjust a few tests, and be done with it!

However, the issue that @rlouf pointed out is actually different. Even though the current implementation looks correct, there's still a problem that can be demonstrated with this minimal example:

import json
import re
from outlines_core.json_schema import build_regex_from_schema

schema = {
    "type": "object",
    "properties": {
        "a": {
            "type": "string",
        }
        # No need to even mention `additionalProperties`
    },
}
regex_str = build_regex_from_schema(json.dumps(schema))

data = json.dumps({
    "a": "aaa",
})
print(re.match(regex_str, data))

data = json.dumps({
    "a": "aaa",
    "id": "1"
})
print(re.match(regex_str, data))  # Shouldn't be `None`

The issue with this code is that it doesn't even trigger the line of code mentioned... IIUC the library default behavior is wrong, and that's what need to be changed, not just tweaking additionalProperties-related code!

@rlouf
Copy link
Member Author

rlouf commented Feb 19, 2025

I think that it makes sense for generation to set the default value to False (as long as it's documented). OpenAI forces you to set it to False manually, but I think it does match people's intuition when it comes to generation.

@torymur
Copy link
Contributor

torymur commented Feb 19, 2025

@yvan-sraka So to_regex is an entry point, then it branches out depending on the key. If both type & properties are defined, only properties will be executed (because it's a first match, type is almost the last): https://github.com/dottxt-ai/outlines-core/blob/main/src/json_schema/parsing.rs#L54-L62

And additionalProperties logic belongs to parse_type branch only, while parse_properties doesn't have additionalProperties logic at all.

In other words, if you change @rlouf example by dropping properties it will work:

schema = {
    "$schema": "http://json-schema.org/draft-04/schema#",
    "type": "object",
    },
    "required": [
      "marathonJson"
    ],
    "additionalProperties": True
}

So that's what this bug is about, parse_properties needs to include this additionalProperties logic from parse_type => parse_type_string => parse_object_type

@yvan-sraka
Copy link
Contributor

I think that it makes sense for generation to set the default value to False (as long as it's documented). OpenAI forces you to set it to False manually, but I think it does match people's intuition when it comes to generation.

I would also prefer an error that forces the user to set it to False explicitly; otherwise, it’s a bit unclear and counterintuitive! If we choose a different default than the JSON Schema specification, where should we document it to ensure users don’t miss it?

@rlouf
Copy link
Member Author

rlouf commented Feb 19, 2025

"Unintuitive" if you know the JSON specification inside and out. We are not using the spec for what it was intended to do (parsing), so we can take small liberties when a default of the original spec doesn't match users' common use cases or mental models.

I am yet to meet anyone who complained that outlines does not generate properties that they did not specify in the schema. I bet most people's mental model correspond to Zod and Pydantic's "if it's not explicitly defined it's not allowed".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working json schema
Projects
None yet
Development

No branches or pull requests

3 participants