Skip to content

Feasibility of training using generated data #24

@william-simon

Description

@william-simon

Hello team, nice work on this project, I appreciate all of the development over the last several years on it.

I and my group are primarily generic HW DNN accelerator people and as such have much less capability in gathering sequencing data for training basecallers. I was wondering therefore about the feasibility of training a basecaller such as Dorado on synthetic signal data from Squigulator. I note that this isn't the focus of your paper, rather the downstream analysis portions, and that you also notice that the the noise, particularly amplitude noise, has an impact on basecalling accuracy with the optimum being around the experimental noise the network was trained on. This would of course imply that inversely, if one trained a network on synthetic data with either 0 noise or too much noise, test accuracy on experimental data would be sub-optimal, which is fairly obvious.

Did you experiment at all with training new basecallers using synthetic data, and if so, how did it go, and if not, do you think it would be possible, even perhaps just using synthetic to augment experimental training data to increase the training size or train on genomes one doesn't have?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions