-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Hello,
I am looking for some advice on how to run findGSE and how to interpret the data when dealing with heterozygosity and plant genomes.
I summarized the runs on 4 species in this file:
FindGSE_tests_220124.pdf
The results show inaccuracy in genome size estimation as well as inconsistency in the resulting values when parameters change.
Briefly:
- the estimations vary when having
exp_hom=NN
or not - all 4 cases - correctly, if using
exp_hom=NN
at the mode of the homo peak, no estimation results (except for Cgil - because the het peak is buried? - if using
exp_hom=NN
larger than the mode of the homo peak, the estimation does not change - at different
exp_hom=NN
values, some estimations vary by a lot, some by very little.
by species:
- in Caus, findGSE seems to be working well, concordant with the HiFi assembly
- Lmul varies by a lot, with the correct value resulting when using
exp_hom=NN
LOWER than the homo peak - Cgig is always below 1 Gb (expected: 1.4 Gb), with some runs failing
- Cgil: a 4-fold size variation, though with 38 Gb raw HiFi data and a homo peak at 87, the genome could be at ~438 Mb.
The documentation says that the exp_hom=NN
should be between the homo peak and its double 2*hom_peak>x>hom_peak !
and I see that there is consistency in the estimations in that. The only thing is, sometimes the values are correct (Caus), others they are off (Cgig, Lmul).
Can you please detail some guidelines on how to use the tool and get a reliable and consistent estimation in the case that the flow cytometry value is not known?
Thanks!