lower base quality and more indels than actual data

moved from https://github.com/Psy-Fer/buttery-eel/issues/76#issue-3078704162


Hi, Hasindu! Here you go:

I generated synthetic data  using squigulator then basecalled with buttery-eel. There seems to be much more indels and lower base quality scores with the generated synthetic data than the actual data. Below is an IGV screenshot (upper panel: squiggulator+buttery-eel data; bottom panel:actual amplicon data sequenced in a R10 flowcell, lib prep kit NBD114, basecalled with SUP)

![Image](https://github.com/user-attachments/assets/fccde318-aff7-4229-8bac-c9616eccb84e)

Here are the commands used to generate the synthetic data:

```
config=dna_r10.4.1_e8.2_400bps_sup.cfg

#create artificial datasets
time $squigulator -x dna-r10-prom -f ${d} -t 8 -r ${r} -q $outdir/${i}_${d}"x"_${r}ideal_${n}.fasta \
--bps 400 --ont-friendly=yes $ref/${i}.fasta -o $datadir/${i}_${d}"x"_${r}_${n}.blow5

#basecall
time buttery-eel -g $basecaller --config $config --device cuda:1 -i $datadir/${i}_${d}"x"_${r}_${n}.blow5 -o $outdir/${i}_${d}"x"_${r}_buttery-eel_${n}.fastq \
--port auto --use_tcp --dorado_download_path $dorado_download_path

```
The mean baseQs are 13.3 for the synthetic data and 34.8 for the actual data.

![Image](https://github.com/user-attachments/assets/ac830298-eb82-4a7b-be54-b460d9ecb165)

Thank you in advance for your help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

lower base quality and more indels than actual data #26

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

lower base quality and more indels than actual data #26

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions