Question about handling diploid variation in GPN-MSA reference genome training #51
Replies: 1 comment
-
Hi, thanks for the interest in our work! I agree that having a single reference is suboptimal. You could potentially train on multiple genomes/haplotypes, e.g. at each iteration sample one of your genomes and use that as prediction target. You could even instead of predict a single 1-hot encoding of the allele, predict the allele frequency across your genomes. In practice, we have found that this can improve performance to some degree, but is not necessary for good performance. Perhaps because averaged over the entire genome, the model sees enough variation at each genomic context. One important note though. If the model would be memorizing the training data perfectly, it would output the A with 100% probability. This is not we observe so far. The models seem to be using the context rather than memorizing. There might be some degree of bias for the reference though, depending on the degree of overfitting. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
While applying GPN-MSA to our datasets, I have a question about how the model handles diploid organisms.
Since reference genomes typically represent only one haploid version, could this bias the model's understanding of conservation? For example, if two alleles (A/T) are equally frequent and neutral across species, but reference genomes all use A, might the model incorrectly consider T non-conserved?
Thanks for your excellent work!
Beta Was this translation helpful? Give feedback.
All reactions