Stats issue to be reconsidered for the beacon integration

Copied this over from #131.
Once the beacon integration goes live and sees some use, the limitations described here should be revisited.

----

The import method uses the following info fields

* **AN**  will map to _callCount_ in the beacon DB. Has `2 * num_called` as a fallback (num_called is calculated from VCF)
* **AF** will map to _frequency_ in the beacon DB. Has `AC / AN` as a fallback
* **VT** will map to _varianttype_  in the beacon DB. Database field is nullable, so it still imports fine without this
* **AC** will map to _alleleCount_ in the beacon DB. **Will break the import when missing** (for this dataset) - I added a line that AC is required.

There is an option `min_ac` for filtering out variants that were seen less than a minimal amount (1 by default). I currently set this to 0 - setting this to anything higher than 0 will also break the import for anything that does not contain **VT** (and maybe others too)

The import is a bit "python-esc" :sweat_smile: 

It has an [_unpack](https://github.com/CSCfi/beacon-python/blob/master/beacon_api/utils/db_load.py#L136) method, that reads the INFO fields into nested lists.
While [inserting variants](https://github.com/CSCfi/beacon-python/blob/master/beacon_api/utils/db_load.py#L297) list entries are just accessed by indices, leading to "index out of bounds" exceptions whenever something is not set. 
There is a `try/catch` block around the whole `for each variant in variants` loop that catches these exceptions, cancelling 1000 variant imports a pop.

---

There is also [a whole other block](https://github.com/CSCfi/beacon-python/blob/master/beacon_api/utils/db_load.py#L158) that is calling for **SVTYPE** and **MATEID** info field. I just never had any data with `variant.is_sv == true`

---

On another note, [the same variant is never added twice](https://github.com/CSCfi/beacon-python/blob/master/beacon_api/utils/db_load.py#L365) duo to `ON CONFLICT (datasetId, chromosome, start, reference, alternate) DO NOTHING`.
In an ideal world we would increment sample and allele counts and recalculate the allele frequency.

But I´d argue that its not that big of a deal, since the datasets uploaded by users are arbitrary and therefore allele frequency across this data has not much meaning anyway.

---

TL;DR;

Had to add AC info field as a requirement.

The import routine that comes with beacon-python was written for a specific kind of dataset. It does the job for now, but if the feature sees some use we will write our own importer to handle all kinds of data ([as suggested in the docs](https://beacon-python.readthedocs.io/en/latest/db.html)).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stats issue to be reconsidered for the beacon integration #132

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stats issue to be reconsidered for the beacon integration #132

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions