Question about handling diploid variation in GPN-MSA reference genome training #51

duzhch · 2025-04-16T15:09:35Z

duzhch
Apr 16, 2025

While applying GPN-MSA to our datasets, I have a question about how the model handles diploid organisms.
Since reference genomes typically represent only one haploid version, could this bias the model's understanding of conservation? For example, if two alleles (A/T) are equally frequent and neutral across species, but reference genomes all use A, might the model incorrectly consider T non-conserved?

Thanks for your excellent work!

gonzalobenegas · 2025-04-16T23:17:28Z

gonzalobenegas
Apr 16, 2025
Maintainer

Hi, thanks for the interest in our work! I agree that having a single reference is suboptimal. You could potentially train on multiple genomes/haplotypes, e.g. at each iteration sample one of your genomes and use that as prediction target. You could even instead of predict a single 1-hot encoding of the allele, predict the allele frequency across your genomes. In practice, we have found that this can improve performance to some degree, but is not necessary for good performance. Perhaps because averaged over the entire genome, the model sees enough variation at each genomic context.

One important note though. If the model would be memorizing the training data perfectly, it would output the A with 100% probability. This is not we observe so far. The models seem to be using the context rather than memorizing. There might be some degree of bias for the reference though, depending on the degree of overfitting.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about handling diploid variation in GPN-MSA reference genome training #51

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Question about handling diploid variation in GPN-MSA reference genome training #51

Uh oh!

duzhch Apr 16, 2025

Replies: 1 comment

Uh oh!

gonzalobenegas Apr 16, 2025 Maintainer

duzhch
Apr 16, 2025

gonzalobenegas
Apr 16, 2025
Maintainer