Refactoring the backend storage classes to support new use cases and scaling #193
Replies: 10 comments 1 reply
-
I'm starting to think that we would never support the notion of allele2 on grch38 mapping to 2 different locations on grch37. I don't believe genome viewers support this and for good reason. If there isn't one definitive mapping then it simply can't be mapped with certainty. @ahwagner what say you? |
Beta Was this translation helpful? Give feedback.
-
I think we should support multiple mappings in the mapping tables, but provide functions that access this content and make decisions for us for specific downstream tasks–be it selecting a “preferred” / “primary” mapping, or returning no mappings if unambiguous.
|
Beta Was this translation helpful? Give feedback.
-
I think an arbitrary/unstructured annotation store is still a pretty firm requirement. We're envisioning its use in the GREGoR pilot as a way to support retrieval of the sample data from the original VCFs by providing the corresponding VCF coords for every allele (so that you can tabix back into the originating row) |
Beta Was this translation helpful? Give feedback.
-
w/r/t liftover -- one thing I wonder about is if there should be some kind of (we talked about it and figured this could probably just fit into the relationship_type column if it were really necessary to specify) |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
The mappings table will do a lot of the heavy lifting for many categorical variation constraints. One open question is how to support the feature context constraint. Specifically, this probably means having some kind of access to feature (gene) coordinates -- where will that come from?
|
Beta Was this translation helpful? Give feedback.
-
As a note for our future selves: We decided to nix the |
Beta Was this translation helpful? Give feedback.
-
We also discussed creating a |
Beta Was this translation helpful? Give feedback.
-
I understand this issue is blocking, so providing some direction here in advance of our AnyVar meeting in case it helps move things along. This proposal seems like a significant restructuring of the backend. I think we have vacillated between postgres-only and multi-backend abstraction several times through the development of the product. I would want to make sure that whatever direction we go would support use cases like @ehclark's; it would be a loss if in the process of enhancing AnyVar we made it unsuitable for our existing user base. The variant mapping table proposal is a good structure, I like it. |
Beta Was this translation helpful? Give feedback.
-
Will try to sketch out some thoughts here...
It seems to me that there are two very separate use cases:
For the first, I would probably move towards being able to run AnyVar without a backend or at least be able to have it not write computed VRS IDs to a backend. For the second, here are some thoughts about how we might integrate AnyVar the Knowledge Repository into our infrastructure using the GDH as a backend.
And lastly to comment on some of the proposed data model changes as they compare to our GDH:
Hope that is helpful. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Right now the database schema initialization code is embedded within the Python
ObjectStore
classes. There is one forvrs_objects
and one forannotations
. And most of these classes are written to be very abstract about what is being put in them.To support scaling (to include table indexing on more columns and fields in sub-objects like
Location
) and to support more use cases such as additional variation types and liftovers, we will need to split out more object types to different tables, and add mapping tables for relationships like liftover/transcription/translation.In initial brainstorming we came up with some tables we need now to support alleles, and some we will need to keep in mind as they will need to be added later.
For this refactoring we will be focusing on supporting postgresql and will remove additional database support (at least at first). Additional database support can also be added through forks or client libraries which implement the same Python Storage interface.
Lucidchart: https://lucid.app/lucidchart/c45c8743-3bc0-494d-9680-fabd564f62d5/edit?viewport_loc=-81%2C435%2C2106%2C1229%2COVZxwuj0N43Y&invitationId=inv_84080742-db4f-4fdc-9842-2afa61d2e990
New tables:
allele (need a
state
table?)location
liftover_mapping (g-g mappings)
projection_mapping (pairwise g-c-r-p etc mappings)
Later:
copy_number
*categorial_variation (Need some modeling work on this. Store constraints in their own table?
What will we be querying on for each categorical variation type?
annotations (probably will still need a way to store arbitrary stuff)
*sequence_reference (maybe, if we want to store things like NCBI/GRCh identifiers internally
instead of going out to seqrepo to resolve those prior to internal anyvar queries which need them)
Mappings
For these:
liftover_mapping
projection_mapping
We can, for an input allele, compute the liftovers, transcription, and translation, and store them all with the input allele as the left-hand-side of each entry in the mapping table. For relationships which are bidirectional, we can store and additional mapping record mapping the right-hand-wide from those records as the left-hand-side with the input allele as the right-hand-side.
And if allele2 on grch38 also maps back to allele1 on grch37, add a row
But there are cases where an allele on grch38 (e.g. allele2) may ambiguously map back to two possible alleles on grch37 (in this case allele1 and an allele4) which can be represented by adding both of those:
(TODO: concrete examples)
Beta Was this translation helpful? Give feedback.
All reactions