seqr’s Global Variant Representation: Deprecation of Allele Number and Allele Frequency to Allele Counts #4828
lynnpais
announced in
Feature Updates
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
When seqr transitioned to a new search backend, we deprecated the previous callset-level allele frequencies in favor of a global seqr allele frequency. With time, we realized there were significant limitations to the way we computed these global statistics and we are now adjusting the way we surface them in seqr.
Why We’re Making This Change
Previously, whenever a joint-called VCF was loaded to seqr, we would increment our global Allele Count (AC) and our global Allele Number (AN) based on the corresponding counts in the VCF. However, reference-only sites are always filtered out of joint-called VCFs and as such, if a variant was fully absent from a callset we would not increment the AN at all. In some cases, this behavior is correct, as perhaps a site was not covered in the gVCFs at all and the AN should remain unchanged. However, other sites that were covered in the sequencing but were ref in all samples should have had their AN incremented. Because we had no way of confirming whether the site was called or covered, the entire callset was excluded from the AN. This led to a systematic underestimation of AN, which in turn inflated Allele Frequency (AF) values, as AF is computed as the fraction of AC/AN. This artificial inflation of AF values is especially problematic for rare variants that are absent from most callsets. Consider, for example, a variant that is only present in one small joint-called VCF and absent from all others. Its actual prevalence in seqr might be ~0.000002 but the AF would be reported as more like 0.02, depending on the size of the callset.
What Changed
We have fully deprecated Allele Number and Allele Frequency in seqr. The AF filter has been removed, and these values will no longer be shown in search results. We will continue to show the total Allele Count (AC) and Homozygote (Hom) count as before, and will continue to support filtering on these values. Please see below for recommendations on how to adjust your searches based on these changes to the available filters.
While not every site is represented in all callsets, one can estimate an upper bound for the allele number (AN) based on the total number of samples loaded in seqr. This information is now available when you hover over the seqr allele count for a given variant.
Updating Searches That Include the seqr AF Filter
The standard saved searches in seqr have been updated to use the seqr AC filter in lieu of the seqr AF filter. These changes will be applied automatically to all subsequent searches. However, for ad-hoc searches or customized saved searches, any previously added AF filters will now be ignored and we recommend updating these searches to use an AC cutoff as well/
In the Broad Institute’s instance of seqr on AnVIL, which currently includes over 60,000 samples, we observed that a reasonable substitution was to use an AC of roughly 1000 for every percent of AF. This means that for the standard dominant searches that had an AF cutoff of 1% we now use an AC cutoff of 1000, and for the standard recessive searches that had a 3% cutoff we use an AC of 3000. If you are using the Broad’s seqr instance or a similarly sized or larger local instance, we recommend you adjust your custom searches at roughly this rate, although users should always adjust the AC threshold to suit the needs of their specific cases.
For users running their own smaller instance of seqr, lower AC thresholds can be set to reflect the smaller size and characteristics of their datasets.
Beta Was this translation helpful? Give feedback.
All reactions