Feasibility of training using generated data

Hello team, nice work on this project, I appreciate all of the development over the last several years on it. 

I and my group are primarily generic HW DNN accelerator people and as such have much less capability in gathering sequencing data for training basecallers. I was wondering therefore about the feasibility of training a basecaller such as Dorado on synthetic signal data from Squigulator. I note that this isn't the focus of your paper, rather the downstream analysis portions, and that you also notice that the the noise, particularly amplitude noise, has an impact on basecalling accuracy with the optimum being around the experimental noise the network was trained on. This would of course imply that inversely, if one trained a network on synthetic data with either 0 noise or too much noise, test accuracy on experimental data would be sub-optimal, which is fairly obvious.

Did you experiment at all with training new basecallers using synthetic data, and if so, how did it go, and if not, do you think it would be possible, even perhaps just using synthetic to augment experimental training data to increase the training size or train on genomes one doesn't have?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feasibility of training using generated data #24

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feasibility of training using generated data #24

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions