Using --isolate decreases Eukaryote assembly quality #741

xonq · 2021-04-10T13:29:49Z

xonq
Apr 10, 2021

I am assembling fungal genomes from 150 bp PE Illumina short reads. I've noted that it is recommended to use --isolate for "high-coverage multi-cell/isolate data"; however, when specified and compared the assembly quality decreased based on standard measurements (N50, contig number, largest contig). Furthermore, I was unable to recover a known gene cluster on one contig using --isolate, but it was recovered on a single contig when I reran without it.

with --isolate (contigs > 1kb):

N50-1000BP:              3991
L50-1000BP:              2432
L50%-1000BP:             0.19828781084386465
LARGEST_CONTIG:          76404
CONTIGS-1000BP:          12265
ASSEMBLY_LEN-1000BP:     37223681 
GC-1000BP:               0.47354643098571875

without --isolate (contigs > 1kb):

N50-1000BP:              8551
L50-1000BP:              899
L50%-1000BP:             0.09553666312433581 
LARGEST_CONTIG:          202404
CONTIGS-1000BP:          9410
ASSEMBLY_LEN-1000BP:     40587032
GC-1000BP:               0.4733252936175136

I therefore have evidence from a biological standpoint (the gene cluster recovery) and the assembly statistics (which I understand could be falsely better) that --isolate was detrimental to my assembly quality. Why is it recommended then?

asl · 2021-04-10T14:46:26Z

asl
Apr 10, 2021
Maintainer

Judging from the assembly length, it does not seem you're having a bacterial isolate dataset. At least I'm unaware about bacteria with genome size of 40 Mbp.

Anyway, will you please post your spades.log files from these runs?

0 replies

xonq · 2021-04-10T16:19:19Z

xonq
Apr 10, 2021
Author

@asl thank you for your reply. this is for a fungal genome, if --isolate is specifically for bacteria it may be more helpful if that is made clear in the software outputs.

I will post the spades.log files ASAP

0 replies

xonq · 2021-04-11T21:50:32Z

xonq
Apr 11, 2021
Author

spades.log for --isolate
spades.log for no flag

0 replies

cbird808 · 2021-04-22T17:41:46Z

cbird808
Apr 22, 2021

So are we or are we not supposed to use --isolate for 150bp PE data from large eukaryotic genomes with high depth of coverage from multicell DNA extractions?

Here are quotes from https://github.com/ablab/spades :
"If you have high-coverage data for bacterial/viral isolate or multi-cell organism, we highly recommend to use --isolate option."

"--isolate This flag is highly recommended for high-coverage isolate and multi-cell Illumina data; improves the assembly quality and running time. We also recommend to trim your reads prior to the assembly. More details can be found here. This option is not compatible with --only-error-correction or --careful options."

Judging from the assembly length, it does not seem you're having a bacterial isolate dataset. At least I'm unaware about bacteria with genome size of 40 Mbp.

Anyway, will you please post your spades.log files from these runs?

0 replies

cbird808 · 2021-04-23T17:36:30Z

cbird808
Apr 23, 2021

I can confirm the results of xonq. Removing the --isolate command results in an assembly with better summary stats in fish with 500-1000 MB 1n genomes

0 replies

cbird808 · 2021-05-05T03:00:43Z

cbird808
May 5, 2021

I've run busco on several assemblies of marine fishes with and without the --isolate setting. The assemblies without --isolate score better.

0 replies

asl · 2021-05-05T21:06:50Z

asl
May 5, 2021
Maintainer

Judging from @cbird808 datasets – the reason is low and uneven coverage plus additional coverage filtering enabled which removes significant parts of the assembly. @xonq case is similar: reads of 140 bp, custom maximum k-mer length of 121 and coverage filtering. This could easily create issues during the assembly. The number of isolated reads that did not enter the assembly is enormous.

0 replies

cbird808 · 2021-05-05T22:37:12Z

cbird808
May 5, 2021

thank you @asl for following up. My understanding is that for the type of genomes I'm working with (euk, Ill pe 150, no genomic resources, non model species) I should be using neither the--isolate nor the --cov-cutoff options.

0 replies

asl · 2021-05-05T22:55:58Z

asl
May 5, 2021
Maintainer

It's not that the euk genome is the problem, but rather the properties of input data: low and uneven coverage, etc. You may want to look into the possible problems during the sequencing / library preparation

0 replies

xonq · 2021-05-07T20:29:53Z

xonq
May 7, 2021
Author

It's not that the euk genome is the problem, but rather the properties of input data: low and uneven coverage, etc. You may want to look into the possible problems during the sequencing / library preparation

If low and uneven coverage is the problem, then shouldn't the thresholds for the error output below be adjusted?

=== Error correction and assembling warnings:

0:34:06.592 4G / 6G WARN General (launcher.cpp : 172) Your data seems to have high uniform coverage depth. It is
strongly recommended to use --isolate option.

0 replies

asl · 2021-05-07T21:35:16Z

asl
May 7, 2021
Maintainer

If low and uneven coverage is the problem, then shouldn't the thresholds for the error output below be adjusted?

Well, the problem is that there is no reliable way to asses whether the coverage is even post-hoc. Even more, the decisions made during the assembly might effectively "hide" the issues (at the expense of assembly quality, of course).

0 replies

Using --isolate decreases Eukaryote assembly quality #741

Uh oh!

Uh oh!

xonq Apr 10, 2021

Replies: 11 comments

Uh oh!

asl Apr 10, 2021 Maintainer

Uh oh!

xonq Apr 10, 2021 Author

Uh oh!

xonq Apr 11, 2021 Author

Uh oh!

Uh oh!

cbird808 Apr 22, 2021

Uh oh!

Uh oh!

cbird808 Apr 23, 2021

Uh oh!

cbird808 May 5, 2021

Uh oh!

asl May 5, 2021 Maintainer

Uh oh!

cbird808 May 5, 2021

Uh oh!

asl May 5, 2021 Maintainer

Uh oh!

Uh oh!

xonq May 7, 2021 Author

Uh oh!

asl May 7, 2021 Maintainer

xonq
Apr 10, 2021

asl
Apr 10, 2021
Maintainer

xonq
Apr 10, 2021
Author

xonq
Apr 11, 2021
Author

cbird808
Apr 22, 2021

cbird808
Apr 23, 2021

cbird808
May 5, 2021

asl
May 5, 2021
Maintainer

cbird808
May 5, 2021

asl
May 5, 2021
Maintainer

xonq
May 7, 2021
Author

asl
May 7, 2021
Maintainer