Skip to content

performance problem #7

@michaelfruth

Description

@michaelfruth

Hello,
I noticed a performance problem as soon as the schema contains the following structure:

... "anyOf": [ {"enum": ["aa", "bb", "cc"]}, {"pattern": "pattern1"}, {"pattern": "pattern2"}, {"pattern": "pattern3"}, ... ] ...

The performance can be massively improved by processing the schema beforehand. All enum values and patterns should be combined to a single pattern as shown in the example below:

... "anyOf": [ {"pattern": "^aa$|^bb$|^cc$|pattern1|pattern2|pattern3"} ] ...

Actually, you iteratively append the enum values and regex patterns to a single regex and compute for every iteration the intersection between the current pattern and ".*". This is very expensive and results in bad performance (for this specific kind of schema).

I added an example json file (anyOf.json) that shows the problem. anyOf.json takes on my machine about 50-60 seconds for the result (LHS :< RHS and RHS :< LHS) when checking the file against itself (command jsonsubschema anyOf.json anyOf.json). Applying preprocessing, it takes about 0.04 seconds. I also attached a python script (smaller_anyOf.py) that contains the preprocessing. The script combines the string-enum-values and all patterns to a single pattern as shown in the example above.

AnyOf.zip

By transforming the string-enum-values to a regex, special regex characters (e.g. ".", "-", ...) are escaped to get an identical expression as regex.

... "enum": ["ab-c"] ...
will be transformed to
... "pattern": "^ab\\-c$" ...

Be careful, this can currently lead to another problem - see #6 .

Best Regards
Michael

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions