Skip to content

Division between array serialization and specification #6

Open
@sneakers-the-rat

Description

@sneakers-the-rat

I've said this a few times when we've talked on zoom during the hackathons, so I don't mean to be a broken record, but one of the places that a lot of prior schema languages have messed up array specification is taking on too much of the weight of specifying the actual encoding of the arrays, rather than being a schematic description that is generic across serializations.

The generality of the current form is pretty good! one way that I see us buying more complexity than we need to though is in this GroupingByArrayOrder idea:
https://github.com/linkml/linkml-model/blob/aab9842be0e230c0040688dfc6ffa26696c97827/linkml_model/model/schema/array.yaml#L67-L94

That's an implementation detail of how arrays are stored and indexed - I don't think we should touch the storage part in the schema, and the indexing part is handled by the rest of the array specification, right? I could be missing something that requires that to be specified in the schema, but I think in general it would be good to make a clear separation of concerns here - a decent test is "can this array specification be satisfied in such a way that the schema knows absolutely nothing about the way that the array is serialized?" where the responsibility for getting the array ordering correct is that of the dumper/loader, similarly to how we would expect the dumper/loader to correctly handle chunking and other serialization details.

This is actually what i want to work on at the hackashop - to work on a second set of specifications for declaring serializations, so in a linked data context one would be able to say "this particular array has n linked serializations - this numpy format, that zarr format, etc." without having that be specified in the array's schema. So a way of saying "this particular hash of a binary stream is annotated with being a numpy ndarray with shape (x,y)" and all the other details needed to handle the serialization/deserialization that could be consumed by a generalized dumper/loaders. So we may want to just talk about this next week :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions