Skip to content

Stats issue to be reconsidered for the beacon integration #132

@wm75

Description

@wm75

Copied this over from #131.
Once the beacon integration goes live and sees some use, the limitations described here should be revisited.


The import method uses the following info fields

  • AN will map to callCount in the beacon DB. Has 2 * num_called as a fallback (num_called is calculated from VCF)
  • AF will map to frequency in the beacon DB. Has AC / AN as a fallback
  • VT will map to varianttype in the beacon DB. Database field is nullable, so it still imports fine without this
  • AC will map to alleleCount in the beacon DB. Will break the import when missing (for this dataset) - I added a line that AC is required.

There is an option min_ac for filtering out variants that were seen less than a minimal amount (1 by default). I currently set this to 0 - setting this to anything higher than 0 will also break the import for anything that does not contain VT (and maybe others too)

The import is a bit "python-esc" 😅

It has an _unpack method, that reads the INFO fields into nested lists.
While inserting variants list entries are just accessed by indices, leading to "index out of bounds" exceptions whenever something is not set.
There is a try/catch block around the whole for each variant in variants loop that catches these exceptions, cancelling 1000 variant imports a pop.


There is also a whole other block that is calling for SVTYPE and MATEID info field. I just never had any data with variant.is_sv == true


On another note, the same variant is never added twice duo to ON CONFLICT (datasetId, chromosome, start, reference, alternate) DO NOTHING.
In an ideal world we would increment sample and allele counts and recalculate the allele frequency.

But I´d argue that its not that big of a deal, since the datasets uploaded by users are arbitrary and therefore allele frequency across this data has not much meaning anyway.


TL;DR;

Had to add AC info field as a requirement.

The import routine that comes with beacon-python was written for a specific kind of dataset. It does the job for now, but if the feature sees some use we will write our own importer to handle all kinds of data (as suggested in the docs).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions