-
Notifications
You must be signed in to change notification settings - Fork 122
Polymorphism w type designator: make linkml-validate work by adding support in dataclasses + enable include_range_class_descendants for json schema validator #1257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
not yet finished. gen-json-schema will need some refactoring, i think we will need a function that will expands the slot range to a list of all possible classes/enums, for any combination of the bug in the current PR version is this: i look up all class descendants of slot.range to generate json-schema anyOf which is not correct for enums (there the range is string or even None) i wonder if there are still other edge cases (combinations of anyOf / allOf and ranges) that will be hard to support. |
9bf4319
to
4b99a79
Compare
i just found out about include_range_class_descendants, and my test passes without my json schema changes (so i'll revert them) if i enable it. But |
4b99a79
to
8e56f2a
Compare
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #1257 +/- ##
==========================================
+ Coverage 79.95% 80.31% +0.36%
==========================================
Files 93 77 -16
Lines 9711 8906 -805
Branches 2382 2164 -218
==========================================
- Hits 7764 7153 -611
+ Misses 1502 1333 -169
+ Partials 445 420 -25
☔ View full report in Codecov by Sentry. |
8e56f2a
to
484a392
Compare
484a392
to
935b8a6
Compare
rlines.append(f""" | ||
type_designator = "{aliased_slot_name}" | ||
if not type_designator in kwargs: | ||
return super().__new__(cls,*args,**kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're a little nervous about whether we have enough tests that take advantage of designates_type to catch unintended side effects. One thought I had was that even though this is likely perfectly safe to just pass up to the super constructor without arguments, it might be a step safer to just return an empty string in the case that none of the subclasses has a designates_type slot.
(I think maybe on/before line 782 you might be able to use a list comprehension to count the number of descendants with designates_type slots, rather than the total number of descendants)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello! I'm not sure if i understood everything correctly..
For the list comprehension: don't all child classes of a class having a type designator have a type designator too ? Would there be a chance that the list comprehension would give less classes than the full class descendants list ?
Btw, for the pydantic generator i did the same thing (take full list of class descendants to build the Union[])..
Thanks!
Frank
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, i will add an extra test that verifies the correct working if we create an instance of a class having a type designator, without actually passing the value of the type designator (this used to work before, so it should still work).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the extra test uncovered some issues:
-
I only generate a new when the type designator is a class slot and not an induced slot. This is wrong and breaks if i just instantiate an object of a subclass: then the type designator is not initialized correctly.
-
When i have a class hierarchy A --> B --> C and i instantiate a B that is actually a C (because of its type designator value), it remains at B because the type designator was defined at A and so only A has a new
going to fix these
935b8a6
to
6c074d7
Compare
I played with it interactively and it feels like magic! |
You don't need to emit that mapping table -- all of the URI and CURIE information is already available in the generated python as I think that the above approach allows subclasses to be added that are defined in different models and imported, which should be a bit more versitile? I noticed that you were using the textual curies in the generated python. I think that there would be some interesting use cases for using external URI's (those declared in the model) vs. the model URI's or curies themselves. The generated code could then be reduced to: type_designator = "category"
if not type_designator in kwargs:
return super().__new__(cls,*args,**kwargs)
else:
return super().__new__(cls._class_class_curie(kwargs[type_designator]),*args,**kwargs) I'm not super familiar with the type_designator -- can it be a curie OR a uri? I can see a lot of uses for the latter!! We still need to address the issue of a type_designator that isn't in the table. At the moment, the functions defined above return None if there is no match. We should probably detect this and emit a sensible error -- I'm sure the |
Type designator is Boolean, but the slot it applies to has a range applied like any other, so you can have string fields be type designators, curie field... |
See the slides from the last community meeting https://docs.google.com/presentation/d/1yIOHGZCNRmk6DYYZSk9Rl_JI1atjZ4Y-fbVydtlC1Ig/edit |
Hello @hsolbrig, i tried to implement your changes:
they work and the tests pass but there is a problem in case the class has a the specification on the range of the type designator slots done by @cmungall in #945 doesn't mention which one should be used, so for pydanticgen i used the "native" one. i'd like to do that here to so the json would be interoperable. When i define a class_uri slot, the class_class_curie is empty (because there is no prefix for the extneral URL i use in my class_uri). This breaks my code now. So i would need a I'm now just pushing commits to get your feedback. Once you think it is fine, i will rebase the commits again to have a series of 3 to 4 logical commits. |
pipeline failure will be fixed by #1275 hopefully |
@kervel - I think that you've touched on something that we may want to discuss just a bit more. There is no reason that every URI should have a corresponding CURIE and, in fact, there is nothing besides convention that prevents a URI from being represented by more than one CURIE. As an example, even IF we assign the prefix, My own opinion on the matter is that the lookup should always be done by URI, and either the caller or the implementation code needs to do the transformation before doing the lookup. If there isn't a LinkML method available already that converts a CURIE to a URI, I would propose that we need to get one into place. The CurieNamespaces in the generated python module are intended to address part of that problem, but we still need to be able to find the class associated with Another option that we might consider is to deal with two type identifier thingiess: strings and URI's. If a string looks like a URI, treat it as such. If not, assume it is the target class name? We do have the default prefix at our disposal, so we could also prepend it, map it to a URI and go from there. WRT model uri/curie vs class uri/curie: class uri will be the same as model uri if there is no I had sort of intended the |
Hello @hsolbrig, I understand the logic of first transforming anything in the type designator value to an URI and then doing the lookup. If i understand correctly:
So the logic would be to always look at the uri classvar (except for the string range case, which would look at the class name case) and optionally try to translate curies to uri's. It would probably also be sensible to understand both native and non-native uri's/curies (so both class_class_uri and class_model_uri). I would need some guidance here on doing the curie->uri translation, then i could give it a try. That still leaves the opposite problem open: i just instantiate a Person() without specifying the type designator value, i would then expect the type designator value to be auto-initialized to the correct value. And then we need a single choice, and i think it makes sense to (in that case) always take the native uri or curie, because those are guaranteed to be always defined (there is always a default_prefix) For the pydantic generator, i don't think we can completely do the same implementation strategy: the pydantic classes don't have any linkml(runtime) dependency (and that has its advantages too in some cases). That would mean that we would need to build a list of acceptable values for the type designator value for every class at generation time. Current implementation only has a single value, but we could also make it support for instance both the native/non-native and both uri and curie. |
Just toying around with "That still leaves the opposite problem open: i just instantiate a Person() without specifying the type designator value, i would then expect the type designator value to be auto-initialized to the correct value. And then we need a single choice, and i think it makes sense to (in that case) always take the native uri or curie, because those are guaranteed to be always defined (there is always a default_prefix)" Wrt single choice, I'd argue for A proposed riff on your existing code: @dataclass
class NamedThing(YAMLRoot):
_inherited_slots: ClassVar[List[str]] = []
class_class_uri: ClassVar[URIRef] = URIRef("http://example.org/NamedThing")
class_class_curie: ClassVar[str] = None
class_name: ClassVar[str] = "NamedThing"
class_model_uri: ClassVar[URIRef] = URIRef("http://example.org/NamedThing")
id: Union[str, NamedThingId] = None
full_name: Optional[str] = None
thingtype: URIorCURIE = dataclasses.field(init=False)
@property
def thingtype(self) -> URIorCURIE:
return URIorCURIE(self.class_class_uri)
@thingtype.setter
def thingtype(self, v: Any) -> None:
pass
def __post_init__(self, *_: List[str], **kwargs: Dict[str, Any]):
if self._is_empty(self.id):
self.MissingRequiredField("id")
if not isinstance(self.id, NamedThingId):
self.id = NamedThingId(self.id)
if self.full_name is not None and not isinstance(self.full_name, str):
self.full_name = str(self.full_name)
super().__post_init__(**kwargs)
def __new__(cls, *args, **kwargs):
uri = kwargs.pop("thingtype", cls.class_class_uri)
target_cls = self._class_for_uri(uri)
if target_cls is None:
# Should this be a warning instead? If so, what do we do with the new URI?
raise ValueError(
f"Wrong type designator value: class NamedThing has no subclass with class_class_uri='{uri}'")
return super().__new__(target_cls, *args, **kwargs) I popped "thingtype" from the kwargs because we don't want to try to be setting it. That said, we might get more clever if we're going to look up by name or curie. Just pfutzing around... when I run: p = Person(id="Fred")
print(repr(p))
p.thingtype = 'abc' I get
Which, despite the "not-the-uri-you-expect", I think that it may be the one we wanted. The setter is there because I can't convince the dataclass innards from trying to set it. It does leave us with an unfortunate side effect in that |
Just to double check -- by "native", we mean the URI that is assigned to the class by combining the default namespace of the model with the class name. I'll use "assigned" to refer to a second URI that can be assigned to a class by the modeler. The purpose of the native URI's was to assign a default URI to every class, slot and type in the model. This allowed someone to develop an RDF based model without having to know much about RDF. In the python Switching to
While the Class,
One of the goals of this assignment was to address the sort of model used in the HL7 FHIR Observation model, where the LOINC code (a URI) could be used to identify the particular model instance. The reason that I'm pushing back on using the "native" URI is that it prevents this second situation -- if you have a class instance that is derived from and external model that uses LOINC, you will need to maintain an external map between the LOINC URI's and corresponding native model uri's, even though that map is already intrinsically available. Thinking about the Observation model, however, it occurs to me that what we may want to think about doing is to associate the type designator with an enumeration and, perhaps, also consider semi-automatic generation of classes. |
Hello, I think i'm in a similar case: i'm trying to create a linkml model that uses parts of RSM (https://rsm.uic.org/doc/rsm/rsm-1-2/). There, the classes and slots have URI's but they are generated by (i think) enterprise architect, and not human readable. For instance, this is the uri for the "latitude" slot: http://rsm.uic.org/RSM12#EAID_ABEB1DF5_8277_452c_9B67_6786E4FA9620. The json serialisation format (and the python classes) are things i want to use internally, and for external interop i could convert to json-ld or rdf. So i would get a slight edge by using the native uri's in my case because the json would be more human readable. However, since i am in control over the json format, switching is not a dealbreaker for me, and i totally understand your arguments, and i'm happy to switch (i will then also switch the pydantic implementation to be consistent). I then think the remaining thing is to have some (maybe generated) class that can convert between uri's and curies for all known namespaces. On the slack channel there was a reminder not to create new uri/curie converters and using https://github.com/cthoyt/curies for that task. It is already a dependency of linkml-runtime. Would it be an idea to generate the creation of a converter with all known namespaces ? a the generated code would then look like so: # Namespaces
LINKML = CurieNamespace('linkml', 'https://w3id.org/linkml/')
SHEX = CurieNamespace('shex', 'http://www.w3.org/ns/shex#')
X = CurieNamespace('x', 'http://example.org/')
XSD = CurieNamespace('xsd', 'http://www.w3.org/2001/XMLSchema#')
DEFAULT_ = X
# converter (we would probably need to extend this class to provide a uri_from_uriorcurie function
NamespaceConverter = curies.Converter.from_prefix_map({
LINKML.prefix: LINKML,
SHEX.prefix: SHEX,
X.prefix: X,
XSD.prefix, XSD
}) so the logic would then be:
would that make sense ? |
Actually, CurieNamespace was intended to handle exactly that problem already. We never completed it. Here is a proposed solution: from typing import Optional, Union, Dict
from rdflib import Namespace, URIRef
class CurieNamespace(Namespace):
catalog: Dict[str, "CurieNamespace"] = dict()
@classmethod
def to_curie(cls, uri: Union[str, URIRef]) -> str:
uri = str(uri)
candidate_ns = ""
for prefix, ns in cls.catalog.items():
if uri.startswith(ns) and len(ns) > len(candidate_ns):
candidate_ns = ns
if candidate_ns:
return candidate_ns.curie(uri[len(candidate_ns):])
return None
@classmethod
def to_uri(cls, curie: str) -> Optional[URIRef]:
prefix, localname = curie.split(':', 1)
ns = CurieNamespace.catalog.get(prefix, None)
return ns[localname] if ns else None
def __new__(cls, prefix: str, ns: Union[str, bytes, URIRef]) -> "CurieNamespace":
rt = Namespace.__new__(cls, str(ns) if not isinstance(ns, bytes) else ns)
rt.prefix = prefix
CurieNamespace.catalog[prefix] = rt
return rt
def curie(self, reference: Optional[str] = '') -> str:
return self.prefix + ':' + reference As a test,
If this looks ok, I'll put in a PR on linkml-runtime for this tomorrow. There are some outstanding issues we would want to consider to make this robust. It looks like we might want to leverage the As far as issues, one of the interesting things is, as pprint(CurieNamespace.catalog)
{'': Namespace('http://example.org/'),
'NCIT': Namespace('http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#'),
'OIO': Namespace('http://www.geneontology.org/formats/oboInOwl#'),
'bibo': Namespace('http://purl.org/ontology/bibo/'),
'dcterms': Namespace('http://purl.org/dc/terms/'),
'linkml': Namespace('https://w3id.org/linkml/'),
'oslc': Namespace('http://open-services.net/ns/core#'),
'owl': Namespace('http://www.w3.org/2002/07/owl#'),
'pav': Namespace('http://purl.org/pav/'),
'prov': Namespace('http://www.w3.org/ns/prov#'),
'qb': Namespace('http://purl.org/linked-data/cube#'),
'qudt': Namespace('http://qudt.org/schema/qudt/'),
'rdf': Namespace('http://www.w3.org/1999/02/22-rdf-syntax-ns#'),
'rdfs': Namespace('http://www.w3.org/2000/01/rdf-schema#'),
'schema': Namespace('http://schema.org/'),
'sh': Namespace('https://w3id.org/shacl/'),
'shex': Namespace('http://www.w3.org/ns/shex#'),
'skos': Namespace('http://www.w3.org/2004/02/skos/core#'),
'skosxl': Namespace('http://www.w3.org/2008/05/skos-xl#'),
'swrl': Namespace('http://www.w3.org/2003/11/swrl#'),
'tccm': Namespace('https://ontologies.r.us/tccm/'),
'vann': Namespace('https://vocab.org/vann/'),
'xsd': Namespace('http://www.w3.org/2001/XMLSchema#')} I'd recommend that we detect collisions ( |
Looks good to me. I'd prefer trying to stick to uri's when it comes to data representation. If you start putting curies into databases, you're sort of obligated to include the accompanying prefix maps and, when it comes to comparisons uri <--> uri comparisons always work. curie <--> curie comparisons are dependent on the maps being available and permanent. |
hello @hsolbrig ,
|
I've proposed a CurieNamespace change (linkml/linkml-runtime#244). As noted in the PR, it doesn't use the |
be5c32f
to
65c81b8
Compare
65c81b8
to
968497f
Compare
test will fail until we merge linkml/linkml-runtime#251 there is also a docgen test that fails on my system, but i'm running with linkml-runtime from the main branch with PR 251 applied, maybe that's the cause ? |
from .test_pydantic_polymorphism import data_str, gen_test_schema | ||
import json | ||
|
||
class PydanticPolymorphismTestCase(TestEnvironmentTestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this mis-named? this seems to be testing PythonGenerator (i.e dataclass generation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was originally meant to test the whole linkml-validate flow for polymorphism (which includes python generator but also json schema validation), but since the json schema validation is now a separate PR, it makes sense to rename the file. will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused by the test cases
- test_pydantic_polymorphism changes look OK but seem unrelated to what is now the focus of the PR, which is pythongen
- the 2nd test does seem to test pythongen but has pydantic in the name
I tried to make pydantic and python tests use the same data ... i made this more clear now by having them import from a common module instead of the one importing the other. As a bonus i added a type hierarchy test just like we had in pydantic ... which fails now :( so more work needed. |
…ion of range of type designators in #945)
Because the pydantic and the python tests now have similar test cases, i refactored the pydantic test so that they can use common test schema's and data.
0dd6e94
to
8715683
Compare
Any progress on this effort? Looking forward to this feature being added to the linkml toolchain |
i'm stuck on this one: linkml/linkml-runtime#251 .. once i get the feedback i dont think there is much work left |
@@ -162,7 +169,7 @@ def gen_schema(self) -> str: | |||
from linkml_runtime.utils.formatutils import camelcase, underscore, sfx | |||
{handlerimport} | |||
from rdflib import Namespace, URIRef | |||
from linkml_runtime.utils.curienamespace import CurieNamespace | |||
from linkml_runtime.utils.curienamespace import CurieNamespace, CurieNamespaceCatalog |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We think this PR should be decoupled from changes in linkml-runtime
assert len([x for x in output_subset if "x:Person" in x]) == 1 | ||
|
||
gen = PydanticGenerator(schema_str.replace("uriorcurie", "uri")) | ||
output = gen.serialize() | ||
output_subset = [line for line in output.splitlines() if "thingtype" in line] | ||
assert len(output_subset) > 0 | ||
assert len([x for x in output_subset if "http://example.org/Person" in x]) == 1 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something odd is going on with the diff - in the intervening time between creation of this PR and now we refactored to use pytest. This is showing up here like unittest except unindented...
Sorry for jumping into the conversation this late, but this MR is touching a topics that I'm interested on.
As stated in this comment to issue #258, I find the usage of CURIEs/URIs in LinkML needs some improvements:
WRT |
see #1617 |
New version of pr #1212 , rebased and using a source branch in the linkml repo. Should me merged only after #1224
"controversial" change: i set
include_range_class_descendants
to true for the schema's generated for jsonschemavalidator. it doesn't break any test in the test suite.