Refactoring the backend storage classes to support new use cases and scaling #193

theferrit32 · 2025-07-15T19:18:43Z

theferrit32
Jul 15, 2025
Maintainer

Right now the database schema initialization code is embedded within the Python ObjectStore classes. There is one for vrs_objects and one for annotations. And most of these classes are written to be very abstract about what is being put in them.

To support scaling (to include table indexing on more columns and fields in sub-objects like Location) and to support more use cases such as additional variation types and liftovers, we will need to split out more object types to different tables, and add mapping tables for relationships like liftover/transcription/translation.

In initial brainstorming we came up with some tables we need now to support alleles, and some we will need to keep in mind as they will need to be added later.

For this refactoring we will be focusing on supporting postgresql and will remove additional database support (at least at first). Additional database support can also be added through forks or client libraries which implement the same Python Storage interface.

Lucidchart: https://lucid.app/lucidchart/c45c8743-3bc0-494d-9680-fabd564f62d5/edit?viewport_loc=-81%2C435%2C2106%2C1229%2COVZxwuj0N43Y&invitationId=inv_84080742-db4f-4fdc-9842-2afa61d2e990

New tables:

allele (need a state table?)
location
liftover_mapping (g-g mappings)
projection_mapping (pairwise g-c-r-p etc mappings)

Later:

copy_number

*categorial_variation (Need some modeling work on this. Store constraints in their own table?
What will we be querying on for each categorical variation type?

annotations (probably will still need a way to store arbitrary stuff)

*sequence_reference (maybe, if we want to store things like NCBI/GRCh identifiers internally
instead of going out to seqrepo to resolve those prior to internal anyvar queries which need them)

Mappings

For these:
liftover_mapping
projection_mapping

We can, for an input allele, compute the liftovers, transcription, and translation, and store them all with the input allele as the left-hand-side of each entry in the mapping table. For relationships which are bidirectional, we can store and additional mapping record mapping the right-hand-wide from those records as the left-hand-side with the input allele as the right-hand-side.

src_id	src_sequence_id	dst_id	dst_sequence_id	relationship_type
allele1	grch37.1	allele2	grch38.1	liftover
allele1	grch37.1	allele3	NM_01.1	transcription
allele3	NM_01.1	allele5	NP_01.1	translation

And if allele2 on grch38 also maps back to allele1 on grch37, add a row

src_id	src_sequence_id	dst_id	dst_sequence_id	relationship_type
allele1	grch37.1	allele2	grch38.1	liftover
allele1	grch37.1	allele3	NM_01.1	transcription
allele3	NM_01.1	allele5	NP_01.1	translation
allele2	grch38.1	allele1	grch37.1	liftover

But there are cases where an allele on grch38 (e.g. allele2) may ambiguously map back to two possible alleles on grch37 (in this case allele1 and an allele4) which can be represented by adding both of those:

src_id	src_sequence_id	dst_id	dst_sequence_id	relationship_type
allele1	grch37.1	allele2	grch38.1	liftover
allele1	grch37.1	allele3	NM_01.1	transcription
allele3	NM_01.1	allele5	NP_01.1	translation
allele2	grch38.1	allele1	grch37.1	liftover
allele2	grch38.1	allele4	grch37.1	liftover
allele4	grch37.1	allele2	grch38.1	liftover

(TODO: concrete examples)

larrybabb · 2025-07-15T22:29:11Z

larrybabb
Jul 15, 2025
Collaborator

I'm starting to think that we would never support the notion of allele2 on grch38 mapping to 2 different locations on grch37. I don't believe genome viewers support this and for good reason. If there isn't one definitive mapping then it simply can't be mapped with certainty.

@ahwagner what say you?

0 replies

ahwagner · 2025-07-16T08:42:12Z

ahwagner
Jul 16, 2025
Maintainer

I think we should support multiple mappings in the mapping tables, but provide functions that access this content and make decisions for us for specific downstream tasks–be it selecting a “preferred” / “primary” mapping, or returning no mappings if unambiguous.

0 replies

jsstevenson · 2025-07-16T12:40:31Z

jsstevenson
Jul 16, 2025
Maintainer

I think an arbitrary/unstructured annotation store is still a pretty firm requirement. We're envisioning its use in the GREGoR pilot as a way to support retrieval of the sample data from the original VCFs by providing the corresponding VCF coords for every allele (so that you can tabix back into the originating row)

0 replies

jsstevenson · 2025-07-16T20:43:49Z

jsstevenson
Jul 16, 2025
Maintainer

w/r/t liftover -- one thing I wonder about is if there should be some kind of relationship_type_meta table (I'm sorry!!) that specifies the mechanism more. The chainfile is one way of mapping between reference sequences but it's definitely not the /only/ way and UCSC even has a big disclaimer about using it in a way that we do (assuming I'm reading this correctly). I know we try to be very policy-minimal in a lot of our tooling so I wonder whether we want to commit to something so directly in this case.

(we talked about it and figured this could probably just fit into the relationship_type column if it were really necessary to specify)

0 replies

korikuzma · 2025-07-17T14:33:05Z

korikuzma
Jul 17, 2025
Maintainer

For this refactoring we will be focusing on supporting postgresql and will remove additional database support (at least at first). Additional database support can also be added through forks or client libraries which implement the same Python Storage interface.

@ehclark

0 replies

jsstevenson · 2025-07-17T14:43:09Z

jsstevenson
Jul 17, 2025
Maintainer

The mappings table will do a lot of the heavy lifting for many categorical variation constraints. One open question is how to support the feature context constraint. Specifically, this probably means having some kind of access to feature (gene) coordinates -- where will that come from?

Is it hidden somewhere in UTA?
Gene normalizer?
Just grab it directly from a GFF?

0 replies

jennifer-bowser · 2025-07-17T14:44:47Z

jennifer-bowser
Jul 17, 2025
Maintainer

As a note for our future selves: We decided to nix the sequence_id columns from the mapping tables for now. We can always add them in later if we deem it necessary.

0 replies

theferrit32 · 2025-07-17T14:50:02Z

theferrit32
Jul 17, 2025
Maintainer Author

We also discussed creating a state table which will have the superset of required fields for all 3 expression objects (length, referencelength, literalsequence)

1 reply

jennifer-bowser Jul 25, 2025
Maintainer

We've decided to hold off on making a state table for now, since we don't see a clear use case for needing to search by anything that would be in here

ahwagner · 2025-07-22T02:32:17Z

ahwagner
Jul 22, 2025
Maintainer

I understand this issue is blocking, so providing some direction here in advance of our AnyVar meeting in case it helps move things along.

This proposal seems like a significant restructuring of the backend. I think we have vacillated between postgres-only and multi-backend abstraction several times through the development of the product. I would want to make sure that whatever direction we go would support use cases like @ehclark's; it would be a loss if in the process of enhancing AnyVar we made it unsuitable for our existing user base.

The variant mapping table proposal is a good structure, I like it.

0 replies

ehclark · 2025-07-22T20:17:25Z

ehclark
Jul 22, 2025
Maintainer

Will try to sketch out some thoughts here...

If I had to go back and do it again, I would probably have added a new API to AnyVar that would return the VRS objects for a list of VCF changes instead using the VCF annotator and pushing the VRS objects into Snowflake from AnyVar. That would have turned AnyVar into a stateless compute engine and left the data management inside of Snowflake, which I think is a better division of responsibilities. (Our VEP wrapper service is this type of stateless model and it works well.)
The cross build mapping feature that I think is being discussed here is definitely of interest to me. If I could submit a change from a single build and get back the VRS objects for that change across multiple builds, that would be fantastic.
It looks to me like the general direction of AnyVar is to answer the question "what do we know about this variant?". The use case is variant assessment or interpretation. We will likely be building such a user facing service on top of our GDH system to serve this use case. The Variant Annotation spec and AnyVar and being able to incorporate more data sources into that use case is interesting to me.

It seems to me that there are two very separate use cases:

High volume computation of VRS IDs
Knowledge repository

For the first, I would probably move towards being able to run AnyVar without a backend or at least be able to have it not write computed VRS IDs to a backend.

For the second, here are some thoughts about how we might integrate AnyVar the Knowledge Repository into our infrastructure using the GDH as a backend.

Use AnyVar to ingest knowledge in Variant Annotation format
- How would version management work? For example, I want to upload a new version of ClinVar, replacing the existing one.
- Would it scale up to supporting data sources like gnomAD?
- How would we project Variant Annotation data into formats that would be appropriate and performant for running variant filtration queries for patient cohorts?
Use AnyVar to provide a variant knowledge browser
- How would we project variant knowledge loaded outside of AnyVar into Variant Annotation format so AnyVar could work with it
Integrating support for Cat-VRS into our infrastructure. This is a big unknown, but if we can leverage AnyVar to understand Cat-VRS and populate the necessary mappings to drive queries, that would be a win.

And lastly to comment on some of the proposed data model changes as they compare to our GDH:

We also have a VRS ID to related ID mapping table. Is has a mapping type, which is right now just contains/overlaps/contained etc since all of our VRS IDs are genomic coords. In addition to mapping type, we also track the source of the mapping.
We project a table from vrs_object that allows us to query against the genomic range covered by a VRS object. This is necessary to map sequence features to VRS IDs
I would strongly suggest adding columns to the annotations table that allows querying and management by data sets.

Hope that is helpful.

0 replies

Refactoring the backend storage classes to support new use cases and scaling #193

Uh oh!

Uh oh!

theferrit32 Jul 15, 2025 Maintainer

New tables:

Later:

Mappings

Replies: 10 comments · 1 reply

Uh oh!

larrybabb Jul 15, 2025 Collaborator

Uh oh!

Uh oh!

ahwagner Jul 16, 2025 Maintainer

Uh oh!

jsstevenson Jul 16, 2025 Maintainer

Uh oh!

Uh oh!

jsstevenson Jul 16, 2025 Maintainer

Uh oh!

korikuzma Jul 17, 2025 Maintainer

Uh oh!

Uh oh!

jsstevenson Jul 17, 2025 Maintainer

Uh oh!

jennifer-bowser Jul 17, 2025 Maintainer

Uh oh!

theferrit32 Jul 17, 2025 Maintainer Author

Uh oh!

jennifer-bowser Jul 25, 2025 Maintainer

Uh oh!

ahwagner Jul 22, 2025 Maintainer

Uh oh!

ehclark Jul 22, 2025 Maintainer

theferrit32
Jul 15, 2025
Maintainer

Replies: 10 comments 1 reply

larrybabb
Jul 15, 2025
Collaborator

ahwagner
Jul 16, 2025
Maintainer

jsstevenson
Jul 16, 2025
Maintainer

jsstevenson
Jul 16, 2025
Maintainer

korikuzma
Jul 17, 2025
Maintainer

jsstevenson
Jul 17, 2025
Maintainer

jennifer-bowser
Jul 17, 2025
Maintainer

theferrit32
Jul 17, 2025
Maintainer Author

jennifer-bowser Jul 25, 2025
Maintainer

ahwagner
Jul 22, 2025
Maintainer

ehclark
Jul 22, 2025
Maintainer