Skip to content

Polymorphism w type designator: make linkml-validate work by adding support in dataclasses + enable include_range_class_descendants for json schema validator #1257

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

kervel
Copy link
Collaborator

@kervel kervel commented Jan 21, 2023

New version of pr #1212 , rebased and using a source branch in the linkml repo. Should me merged only after #1224

"controversial" change: i set include_range_class_descendants to true for the schema's generated for jsonschemavalidator. it doesn't break any test in the test suite.

@kervel kervel marked this pull request as draft January 21, 2023 15:24
@kervel
Copy link
Collaborator Author

kervel commented Jan 22, 2023

not yet finished. gen-json-schema will need some refactoring, i think we will need a function that will expands the slot range to a list of all possible classes/enums, for any combination of any_of and polymorphism classes. other generators could also benefit from this ?

the bug in the current PR version is this: i look up all class descendants of slot.range to generate json-schema anyOf which is not correct for enums (there the range is string or even None)

i wonder if there are still other edge cases (combinations of anyOf / allOf and ranges) that will be hard to support.

@kervel kervel force-pushed the feature/poly_dataclasses branch from 9bf4319 to 4b99a79 Compare January 22, 2023 21:20
@kervel
Copy link
Collaborator Author

kervel commented Jan 22, 2023

i just found out about include_range_class_descendants, and my test passes without my json schema changes (so i'll revert them) if i enable it.

But include_range_class_descendants is set to False for jsonschemavalidator, is there a reason for this ? in the validator there is no way to set this to true. On first sight a default of True makes more sense (since not including the class descendants makes a schema that is more strict than what linkml expresses), do we know why the default was set to False ?

@kervel kervel force-pushed the feature/poly_dataclasses branch from 4b99a79 to 8e56f2a Compare January 23, 2023 07:52
@codecov-commenter
Copy link

codecov-commenter commented Jan 23, 2023

Codecov Report

Patch coverage: 90.90% and project coverage change: +0.36% 🎉

Comparison is base (20da3a9) 79.95% compared to head (6c074d7) 80.31%.
Report is 25 commits behind head on main.

❗ Current head 6c074d7 differs from pull request most recent head 3411769. Consider uploading reports for the commit 3411769 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1257      +/-   ##
==========================================
+ Coverage   79.95%   80.31%   +0.36%     
==========================================
  Files          93       77      -16     
  Lines        9711     8906     -805     
  Branches     2382     2164     -218     
==========================================
- Hits         7764     7153     -611     
+ Misses       1502     1333     -169     
+ Partials      445      420      -25     
Files Changed Coverage Δ
linkml/validators/jsonschemavalidator.py 52.94% <66.66%> (-6.89%) ⬇️
linkml/generators/pythongen.py 88.94% <93.10%> (+0.48%) ⬆️
linkml/generators/jsonschemagen.py 87.84% <100.00%> (-1.99%) ⬇️

... and 80 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kervel kervel force-pushed the feature/poly_dataclasses branch from 8e56f2a to 484a392 Compare January 23, 2023 08:46
@kervel kervel marked this pull request as ready for review January 23, 2023 09:39
@sierra-moxon sierra-moxon requested review from sierra-moxon and kevinschaper and removed request for kevinschaper and sierra-moxon January 24, 2023 21:49
@kervel kervel changed the title Polymorphism using type designator: add support in python and jsonschema generators, so that linkml-validate works with polymorphic data #1212 Polymorphism w type designator make linkml-validate work by adding support in dataclasses + enable include_range_class_descendants for json schema validator Jan 25, 2023
@kervel kervel changed the title Polymorphism w type designator make linkml-validate work by adding support in dataclasses + enable include_range_class_descendants for json schema validator Polymorphism w type designator: make linkml-validate work by adding support in dataclasses + enable include_range_class_descendants for json schema validator Jan 25, 2023
@cmungall cmungall requested a review from hsolbrig January 26, 2023 18:58
@kervel kervel force-pushed the feature/poly_dataclasses branch from 484a392 to 935b8a6 Compare January 26, 2023 20:41
rlines.append(f"""
type_designator = "{aliased_slot_name}"
if not type_designator in kwargs:
return super().__new__(cls,*args,**kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're a little nervous about whether we have enough tests that take advantage of designates_type to catch unintended side effects. One thought I had was that even though this is likely perfectly safe to just pass up to the super constructor without arguments, it might be a step safer to just return an empty string in the case that none of the subclasses has a designates_type slot.

(I think maybe on/before line 782 you might be able to use a list comprehension to count the number of descendants with designates_type slots, rather than the total number of descendants)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello! I'm not sure if i understood everything correctly..

For the list comprehension: don't all child classes of a class having a type designator have a type designator too ? Would there be a chance that the list comprehension would give less classes than the full class descendants list ?

Btw, for the pydantic generator i did the same thing (take full list of class descendants to build the Union[])..

Thanks!
Frank

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, i will add an extra test that verifies the correct working if we create an instance of a class having a type designator, without actually passing the value of the type designator (this used to work before, so it should still work).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the extra test uncovered some issues:

  • I only generate a new when the type designator is a class slot and not an induced slot. This is wrong and breaks if i just instantiate an object of a subclass: then the type designator is not initialized correctly.

  • When i have a class hierarchy A --> B --> C and i instantiate a B that is actually a C (because of its type designator value), it remains at B because the type designator was defined at A and so only A has a new

going to fix these

@kervel kervel force-pushed the feature/poly_dataclasses branch from 935b8a6 to 6c074d7 Compare January 30, 2023 10:19
@kevinschaper
Copy link
Contributor

I played with it interactively and it feels like magic!

@hsolbrig
Copy link
Contributor

hsolbrig commented Jan 31, 2023

You don't need to emit that mapping table -- all of the URI and CURIE information is already available in the generated python as class_class_curie, class_class_uri and class_model_uri. Take a look at linkml/linkml-runtime#243 - using that, the __new__ function can use the cls._class_for_uri or cls._class_for_curie functions.

I think that the above approach allows subclasses to be added that are defined in different models and imported, which should be a bit more versitile?

I noticed that you were using the textual curies in the generated python. I think that there would be some interesting use cases for using external URI's (those declared in the model) vs. the model URI's or curies themselves.

The generated code could then be reduced to:

       type_designator = "category"
        if not type_designator in kwargs:
            return super().__new__(cls,*args,**kwargs)
        else:
            return super().__new__(cls._class_class_curie(kwargs[type_designator]),*args,**kwargs)

I'm not super familiar with the type_designator -- can it be a curie OR a uri? I can see a lot of uses for the latter!!

We still need to address the issue of a type_designator that isn't in the table. At the moment, the functions defined above return None if there is no match. We should probably detect this and emit a sensible error -- I'm sure the None from the return above would generate a really (really) opaque error message ;-)

@cmungall
Copy link
Member

Type designator is Boolean, but the slot it applies to has a range applied like any other, so you can have string fields be type designators, curie field...

@cmungall
Copy link
Member

See the slides from the last community meeting https://docs.google.com/presentation/d/1yIOHGZCNRmk6DYYZSk9Rl_JI1atjZ4Y-fbVydtlC1Ig/edit

@kervel
Copy link
Collaborator Author

kervel commented Jan 31, 2023

Hello @hsolbrig, i tried to implement your changes:

  • elimination of the mapping table
  • raise ValueError when the type designator has a value we don't understand

they work and the tests pass but there is a problem in case the class has a class_uri defined. In that case , actually, a class has two uri's: the native one (generated with prefix and class name) and the one in class_uri. in owlGen the two uris are present and have a skos:exactmatch relation, which seems logical.

the specification on the range of the type designator slots done by @cmungall in #945 doesn't mention which one should be used, so for pydanticgen i used the "native" one. i'd like to do that here to so the json would be interoperable.

When i define a class_uri slot, the class_class_curie is empty (because there is no prefix for the extneral URL i use in my class_uri). This breaks my code now. So i would need a class_model_curie classvar. I tried to do this, but i'm not sure if i'm using the Namespaces object correctly.

I'm now just pushing commits to get your feedback. Once you think it is fine, i will rebase the commits again to have a series of 3 to 4 logical commits.

@kervel
Copy link
Collaborator Author

kervel commented Jan 31, 2023

pipeline failure will be fixed by #1275 hopefully

@hsolbrig
Copy link
Contributor

@kervel - I think that you've touched on something that we may want to discuss just a bit more. There is no reason that every URI should have a corresponding CURIE and, in fact, there is nothing besides convention that prevents a URI from being represented by more than one CURIE. As an example, even IF we assign the prefix, x: to http://example.org/ in the model, there is no reason that a block of data might use ex: or even ns1:.

My own opinion on the matter is that the lookup should always be done by URI, and either the caller or the implementation code needs to do the transformation before doing the lookup. If there isn't a LinkML method available already that converts a CURIE to a URI, I would propose that we need to get one into place. The CurieNamespaces in the generated python module are intended to address part of that problem, but we still need to be able to find the class associated with x:Organization from data that uses ex:Organization instead. To be honest, I'd been a bit hesitant to include curie's in the class information at all for this very reason. While there has been a push for canonical curie's in the bio space, the majority of the communities I'm familiar with deal strictly with uri's and make very little effort to render curie's globally unique -- x:, ex:, 'ns1:' and other variations all show up in a whole lot of test cases, randomly assigned to a variety of different base URI's.

Another option that we might consider is to deal with two type identifier thingiess: strings and URI's. If a string looks like a URI, treat it as such. If not, assume it is the target class name? We do have the default prefix at our disposal, so we could also prepend it, map it to a URI and go from there.

WRT model uri/curie vs class uri/curie: class uri will be the same as model uri if there is no class_uri assignment. If there is, we could either a) assume that class_uri is the correct one, as the developers did go to the trouble of saying as much or b) try class_uri first and, if no match, `model_uri.

I had sort of intended the _class_for_curie and _class_for_uri methods to be the entry points w/ class_for being the common function. I know - not obvious when they ALL start with '_', but I don't want to pollute the namespace.

@kervel
Copy link
Collaborator Author

kervel commented Jan 31, 2023

Hello @hsolbrig,

I understand the logic of first transforming anything in the type designator value to an URI and then doing the lookup. If i understand correctly:

  • if the range is uriorcurie then specifying an uri or a curie should both work. Specifying a class name would not make sense in this case, and i would not expect it to work.
  • if the range is uri then specifying an uri would work, and we could disable the curie->uri translation. Specifying a class name makes no sense either.
  • if the range is a string, then the class name should be specified, and neither uri or curie would make sense here.

So the logic would be to always look at the uri classvar (except for the string range case, which would look at the class name case) and optionally try to translate curies to uri's. It would probably also be sensible to understand both native and non-native uri's/curies (so both class_class_uri and class_model_uri).

I would need some guidance here on doing the curie->uri translation, then i could give it a try.

That still leaves the opposite problem open: i just instantiate a Person() without specifying the type designator value, i would then expect the type designator value to be auto-initialized to the correct value. And then we need a single choice, and i think it makes sense to (in that case) always take the native uri or curie, because those are guaranteed to be always defined (there is always a default_prefix)

For the pydantic generator, i don't think we can completely do the same implementation strategy: the pydantic classes don't have any linkml(runtime) dependency (and that has its advantages too in some cases). That would mean that we would need to build a list of acceptable values for the type designator value for every class at generation time. Current implementation only has a single value, but we could also make it support for instance both the native/non-native and both uri and curie.

@hsolbrig
Copy link
Contributor

hsolbrig commented Feb 1, 2023

Just toying around with "That still leaves the opposite problem open: i just instantiate a Person() without specifying the type designator value, i would then expect the type designator value to be auto-initialized to the correct value. And then we need a single choice, and i think it makes sense to (in that case) always take the native uri or curie, because those are guaranteed to be always defined (there is always a default_prefix)"

Wrt single choice, I'd argue for class_class_uri because it will always be there and, while defaulting to the same as the model, it can be overriden.

A proposed riff on your existing code:

@dataclass
class NamedThing(YAMLRoot):
    _inherited_slots: ClassVar[List[str]] = []

    class_class_uri: ClassVar[URIRef] = URIRef("http://example.org/NamedThing")
    class_class_curie: ClassVar[str] = None
    class_name: ClassVar[str] = "NamedThing"

    class_model_uri: ClassVar[URIRef] = URIRef("http://example.org/NamedThing")

    id: Union[str, NamedThingId] = None
    full_name: Optional[str] = None
    thingtype: URIorCURIE = dataclasses.field(init=False)

    @property
    def thingtype(self) -> URIorCURIE:
        return URIorCURIE(self.class_class_uri)

    @thingtype.setter
    def thingtype(self, v: Any) -> None:
        pass

    def __post_init__(self, *_: List[str], **kwargs: Dict[str, Any]):
        if self._is_empty(self.id):
            self.MissingRequiredField("id")
        if not isinstance(self.id, NamedThingId):
            self.id = NamedThingId(self.id)

        if self.full_name is not None and not isinstance(self.full_name, str):
            self.full_name = str(self.full_name)

        super().__post_init__(**kwargs)



    def __new__(cls, *args, **kwargs):
        uri = kwargs.pop("thingtype", cls.class_class_uri)
        target_cls = self._class_for_uri(uri)
        if target_cls is None:
            # Should this be a warning instead?  If so, what do we do with the new URI?
            raise ValueError(
                f"Wrong type designator value: class NamedThing has no subclass with class_class_uri='{uri}'")
        return super().__new__(target_cls, *args, **kwargs)

I popped "thingtype" from the kwargs because we don't want to try to be setting it. That said, we might get more clever if we're going to look up by name or curie.

Just pfutzing around...

when I run:

p = Person(id="Fred")
print(repr(p))
p.thingtype = 'abc'

I get

Person(id='Fred', full_name=None, thingtype='http://testbreaker/not-the-uri-you-expect', height=None)

Which, despite the "not-the-uri-you-expect", I think that it may be the one we wanted.

The setter is there because I can't convince the dataclass innards from trying to set it. It does leave us with an unfortunate side effect in that p.thingtype = 'abc' is a no-op. I suppose we could get slightly more clever and throw
an error if the input argument isn't none, but I'm curious whether this is still more or less on track.

@hsolbrig
Copy link
Contributor

hsolbrig commented Feb 1, 2023

Just to double check -- by "native", we mean the URI that is assigned to the class by combining the default namespace of the model with the class name. I'll use "assigned" to refer to a second URI that can be assigned to a class by the modeler.

The purpose of the native URI's was to assign a default URI to every class, slot and type in the model. This allowed someone to develop an RDF based model without having to know much about RDF. In the python dataclass based Class, this URI is referred to as class_model_uri. The class_uri (and slot_uri and type_uri - apologies for the less than consistent naming) were intended for use by modelers who had RDF integration in mind -- they could override the default URI and assign an external one instead. The class_class_uri property carries the native URI if a class_uri hasn't been specified, otherwise it would be the override.

Switching to type_definition because it is already taking full advantage of this, we have stated that RDF will refer to a LinkML integer as xsd:integer, even though the LinkML model treats it as `linkml:integer'

class Integer(int):
    """ An integer """
    type_class_uri = XSD.integer
    type_class_curie = "xsd:integer"
    type_name = "integer"
    type_model_uri = LINKML.Integer

While the Class, SubSetDefinition assumes the native LinkML URI:

@dataclass
class SubsetDefinition(Element):
    """
    the name and description of a subset
    """
    _inherited_slots: ClassVar[List[str]] = []

    class_class_uri: ClassVar[URIRef] = LINKML.SubsetDefinition
    class_class_curie: ClassVar[str] = "linkml:SubsetDefinition"
    class_name: ClassVar[str] = "subset_definition"
    class_model_uri: ClassVar[URIRef] = LINKML.SubsetDefinition

One of the goals of this assignment was to address the sort of model used in the HL7 FHIR Observation model, where the LOINC code (a URI) could be used to identify the particular model instance.

The reason that I'm pushing back on using the "native" URI is that it prevents this second situation -- if you have a class instance that is derived from and external model that uses LOINC, you will need to maintain an external map between the LOINC URI's and corresponding native model uri's, even though that map is already intrinsically available.

Thinking about the Observation model, however, it occurs to me that what we may want to think about doing is to associate the type designator with an enumeration and, perhaps, also consider semi-automatic generation of classes.

@kervel
Copy link
Collaborator Author

kervel commented Feb 1, 2023

Hello,

I think i'm in a similar case: i'm trying to create a linkml model that uses parts of RSM (https://rsm.uic.org/doc/rsm/rsm-1-2/). There, the classes and slots have URI's but they are generated by (i think) enterprise architect, and not human readable. For instance, this is the uri for the "latitude" slot: http://rsm.uic.org/RSM12#EAID_ABEB1DF5_8277_452c_9B67_6786E4FA9620.
Setting them as class_uri and slot_uri makes the owl generator emit an exactmatch between the native and the assigned uri.

The json serialisation format (and the python classes) are things i want to use internally, and for external interop i could convert to json-ld or rdf. So i would get a slight edge by using the native uri's in my case because the json would be more human readable. However, since i am in control over the json format, switching is not a dealbreaker for me, and i totally understand your arguments, and i'm happy to switch (i will then also switch the pydantic implementation to be consistent).

I then think the remaining thing is to have some (maybe generated) class that can convert between uri's and curies for all known namespaces. On the slack channel there was a reminder not to create new uri/curie converters and using https://github.com/cthoyt/curies for that task. It is already a dependency of linkml-runtime. Would it be an idea to generate the creation of a converter with all known namespaces ? a Converter has no function yet to convert an uriorcurie (it needs to be either an uri or a curie) so we would need to look for the presence of :/ which is a bit ugly.

the generated code would then look like so:

# Namespaces
LINKML = CurieNamespace('linkml', 'https://w3id.org/linkml/')
SHEX = CurieNamespace('shex', 'http://www.w3.org/ns/shex#')
X = CurieNamespace('x', 'http://example.org/')
XSD = CurieNamespace('xsd', 'http://www.w3.org/2001/XMLSchema#')
DEFAULT_ = X
# converter (we would probably need to extend this class to provide a uri_from_uriorcurie function
NamespaceConverter = curies.Converter.from_prefix_map({
     LINKML.prefix: LINKML,
     SHEX.prefix: SHEX,
     X.prefix: X,
     XSD.prefix, XSD
})

so the logic would then be:

  • for the "instantiate a blank object" case --> we use class_class_uri or class_class_curie (with a fallback to class_class_uri if there is no curie

  • for the "load json" case --> we take whatever is the input, convert it to an uri and then try to find a class that has either a matching class_class_uri or a matching class_model_uri (this way we accept all know uri's/curies for a class as type designators)

would that make sense ?

@hsolbrig
Copy link
Contributor

hsolbrig commented Feb 2, 2023

Actually, CurieNamespace was intended to handle exactly that problem already. We never completed it. Here is a proposed solution:

from typing import Optional, Union, Dict

from rdflib import Namespace, URIRef


class CurieNamespace(Namespace):
    catalog: Dict[str, "CurieNamespace"] = dict()

    @classmethod
    def to_curie(cls, uri: Union[str, URIRef]) -> str:
        uri = str(uri)
        candidate_ns = ""
        for prefix, ns in cls.catalog.items():
            if uri.startswith(ns) and len(ns) > len(candidate_ns):
                candidate_ns = ns
        if candidate_ns:
            return candidate_ns.curie(uri[len(candidate_ns):])
        return None

    @classmethod
    def to_uri(cls, curie: str) -> Optional[URIRef]:
        prefix, localname = curie.split(':', 1)
        ns = CurieNamespace.catalog.get(prefix, None)
        return ns[localname] if ns else None

    def __new__(cls, prefix: str, ns: Union[str, bytes, URIRef]) -> "CurieNamespace":
        rt = Namespace.__new__(cls, str(ns) if not isinstance(ns, bytes) else ns)
        rt.prefix = prefix
        CurieNamespace.catalog[prefix] = rt
        return rt

    def curie(self, reference: Optional[str] = '') -> str:
        return self.prefix + ':' + reference

As a test,

print(CurieNamespace.to_curie(XSD.foo))
xsd:foo

print(CurieNamespace.to_uri("linkml:stuff"))
https://w3id.org/linkml/stuff

If this looks ok, I'll put in a PR on linkml-runtime for this tomorrow. There are some outstanding issues we would want to consider to make this robust. It looks like we might want to leverage the curiemap you had above a bit more directly?

As far as issues, one of the interesting things is, as catalog is a class variable, we end up accumulating every namespace defined in a particular module. As an example, using that polymorphism test model that you created, we get:

pprint(CurieNamespace.catalog)

{'': Namespace('http://example.org/'),
 'NCIT': Namespace('http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#'),
 'OIO': Namespace('http://www.geneontology.org/formats/oboInOwl#'),
 'bibo': Namespace('http://purl.org/ontology/bibo/'),
 'dcterms': Namespace('http://purl.org/dc/terms/'),
 'linkml': Namespace('https://w3id.org/linkml/'),
 'oslc': Namespace('http://open-services.net/ns/core#'),
 'owl': Namespace('http://www.w3.org/2002/07/owl#'),
 'pav': Namespace('http://purl.org/pav/'),
 'prov': Namespace('http://www.w3.org/ns/prov#'),
 'qb': Namespace('http://purl.org/linked-data/cube#'),
 'qudt': Namespace('http://qudt.org/schema/qudt/'),
 'rdf': Namespace('http://www.w3.org/1999/02/22-rdf-syntax-ns#'),
 'rdfs': Namespace('http://www.w3.org/2000/01/rdf-schema#'),
 'schema': Namespace('http://schema.org/'),
 'sh': Namespace('https://w3id.org/shacl/'),
 'shex': Namespace('http://www.w3.org/ns/shex#'),
 'skos': Namespace('http://www.w3.org/2004/02/skos/core#'),
 'skosxl': Namespace('http://www.w3.org/2008/05/skos-xl#'),
 'swrl': Namespace('http://www.w3.org/2003/11/swrl#'),
 'tccm': Namespace('https://ontologies.r.us/tccm/'),
 'vann': Namespace('https://vocab.org/vann/'),
 'xsd': Namespace('http://www.w3.org/2001/XMLSchema#')}

I'd recommend that we detect collisions (x:, ex: and : being prime examples) and decide what to do about them.

@hsolbrig
Copy link
Contributor

hsolbrig commented Feb 2, 2023

for the "instantiate a blank object" case --> we use class_class_uri or class_class_curie (with a fallback to class_class_uri if there is no curie

for the "load json" case --> we take whatever is the input, convert it to an uri and then try to find a class that has either a matching class_class_uri or a matching class_model_uri (this way we accept all know uri's/curies for a class as type designators)

Looks good to me. I'd prefer trying to stick to uri's when it comes to data representation. If you start putting curies into databases, you're sort of obligated to include the accompanying prefix maps and, when it comes to comparisons uri <--> uri comparisons always work. curie <--> curie comparisons are dependent on the maps being available and permanent.

@kervel
Copy link
Collaborator Author

kervel commented Feb 2, 2023

hello @hsolbrig ,

  • About the "curies in databases" remark: i guess this translates to "do curies make sense as type designator or not". And there i think both @cmungall and @sierra-moxon argued for yes (see [WIP] Pydanticgen: support polymorphism using type designators #1180 (comment) and the description of Add support for designates_type in all generators #945)

  • Didn't know about curienamespace, but as far as i can judge it would be fine for what we need. I'll try to follow up your PR with a new version of this one, and with a new PR for pydantic. Not certain about the prefixmap collisions, we could start by introducing a testcase that purposely introduces a collision. I don't know the other generators, the problem might be bigger than just the python generator then. Is requiring the prefixes to be globally unique in case of nested (imported) linkml schema's not limiting flexibility ? On the other hand, if we don't scope the prefixes, then we'd better have an explicit error for collisions.

@hsolbrig
Copy link
Contributor

hsolbrig commented Feb 2, 2023

I've proposed a CurieNamespace change (linkml/linkml-runtime#244). As noted in the PR, it doesn't use the curies package, because I couldn't figure out how to incrementally update the curies vs. in one big batch. Any solution which passes test_curienamespace.py should be good.

@kervel
Copy link
Collaborator Author

kervel commented Mar 23, 2023

test will fail until we merge linkml/linkml-runtime#251

there is also a docgen test that fails on my system, but i'm running with linkml-runtime from the main branch with PR 251 applied, maybe that's the cause ?

from .test_pydantic_polymorphism import data_str, gen_test_schema
import json

class PydanticPolymorphismTestCase(TestEnvironmentTestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this mis-named? this seems to be testing PythonGenerator (i.e dataclass generation)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was originally meant to test the whole linkml-validate flow for polymorphism (which includes python generator but also json schema validation), but since the json schema validation is now a separate PR, it makes sense to rename the file. will do.

Copy link
Member

@cmungall cmungall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by the test cases

  1. test_pydantic_polymorphism changes look OK but seem unrelated to what is now the focus of the PR, which is pythongen
  2. the 2nd test does seem to test pythongen but has pydantic in the name

@kervel
Copy link
Collaborator Author

kervel commented Mar 25, 2023

I'm a bit confused by the test cases

  1. test_pydantic_polymorphism changes look OK but seem unrelated to what is now the focus of the PR, which is pythongen
  2. the 2nd test does seem to test pythongen but has pydantic in the name

I tried to make pydantic and python tests use the same data ... i made this more clear now by having them import from a common module instead of the one importing the other. As a bonus i added a type hierarchy test just like we had in pydantic ... which fails now :( so more work needed.

kervel added 2 commits March 28, 2023 10:40
Because the pydantic and the python tests now have similar test cases, i refactored the pydantic test so that they can use common test schema's and data.
@kervel kervel force-pushed the feature/poly_dataclasses branch from 0dd6e94 to 8715683 Compare March 28, 2023 08:41
@capsulecorplab
Copy link

Any progress on this effort? Looking forward to this feature being added to the linkml toolchain

@kervel
Copy link
Collaborator Author

kervel commented May 13, 2023

i'm stuck on this one: linkml/linkml-runtime#251 .. once i get the feedback i dont think there is much work left

@@ -162,7 +169,7 @@ def gen_schema(self) -> str:
from linkml_runtime.utils.formatutils import camelcase, underscore, sfx
{handlerimport}
from rdflib import Namespace, URIRef
from linkml_runtime.utils.curienamespace import CurieNamespace
from linkml_runtime.utils.curienamespace import CurieNamespace, CurieNamespaceCatalog
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We think this PR should be decoupled from changes in linkml-runtime

assert len([x for x in output_subset if "x:Person" in x]) == 1

gen = PydanticGenerator(schema_str.replace("uriorcurie", "uri"))
output = gen.serialize()
output_subset = [line for line in output.splitlines() if "thingtype" in line]
assert len(output_subset) > 0
assert len([x for x in output_subset if "http://example.org/Person" in x]) == 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something odd is going on with the diff - in the intervening time between creation of this PR and now we refactored to use pytest. This is showing up here like unittest except unindented...

cmungall added a commit that referenced this pull request Sep 12, 2023
This PR incorporates work done in #1257 by @kervel
@Silvanoc
Copy link
Contributor

Sorry for jumping into the conversation this late, but this MR is touching a topics that I'm interested on.

@hsolbrig wrote

My own opinion on the matter is that the lookup should always be done by URI, and either the caller or the implementation code needs to do the transformation before doing the lookup. If there isn't a LinkML method available already that converts a CURIE to a URI, I would propose that we need to get one into place. The CurieNamespaces in the generated python module are intended to address part of that problem, but we still need to be able to find the class associated with x:Organization from data that uses ex:Organization instead. To be honest, I'd been a bit hesitant to include curie's in the class information at all for this very reason. While there has been a push for canonical curie's in the bio space, the majority of the communities I'm familiar with deal strictly with uri's and make very little effort to render curie's globally unique -- x:, ex:, 'ns1:' and other variations all show up in a whole lot of test cases, randomly assigned to a variety of different base URI's.

As stated in this comment to issue #258, I find the usage of CURIEs/URIs in LinkML needs some improvements:

  1. I agree with @hsolbrig that LinkML should use internally only URIs. CURIEs should be only syntactic sugar for the users that LinkML would always convert into URIs. It would imply that two CURIEs with different prefixes, but the same suffix, might end up being the same (like x:Organization and ex:Organization in the above example).
  2. As long as LinkML doesn't use SafeCURIEs, there is some space for ambiguity between URIs and CURIEs. As an alternative JSON-LD Compact IRIs significantly reduce the room for ambiguities, because only listed prefixes are interpreted as CURIEs, otherwise URIs are expected.

WRT designates_type, I don't see how a string can be used to validate, unless it's clear that it can only be a class name (would it then be expected to be a CURIE with the default prefix?). Otherwise continuing the above logic I would say that a CURIE (even when relying on the default prefix) should only be valid if the resulting URI identifies a valid class.

@kervel
Copy link
Collaborator Author

kervel commented Sep 26, 2023

see #1617

@kervel kervel closed this Sep 26, 2023
@pkalita-lbl pkalita-lbl deleted the feature/poly_dataclasses branch December 22, 2023 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants