-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Hello, I was trying out the most recent version of the pipeline using bakta and compared to running with prokka on a set of CRAB sequences. Similar to a previous issue I brought up a few months ago when the default aligner was changed from roary to panaroo, I found that which gene annotation was used had a significant impact on the resulting SNP matrix and interpretation.
Please find attached an excel file that includes a comparison of output matrices and core genome metrics.
Up to this point I have been using prokka and roary, so the first matrix is essentially the status quo from my point of view. To focus on one part of the matrix, S19-S23 are all within two SNPs, but fewer than 10 SNPs apart from a few others included in the analysis and not more than 51 SNPs to any other sequence.
In the second matrix (bakta/rorary). S19-S23 now looks to be split into two subclusters, and more surprising to me are now >1000 SNPs apart from all other sequences.
In the third matrix, since the default annotator/aligner is Bakta/panaroo, I ran the same analysis this way as well. Another slightly different interpretation here. S19-S23 are no longer drastically different from the others as with bakta/roary, but there are other differences such as S22 no longer clusters with S19-S21, S23.
The final matrix is generated by BugSeq’s refMLST method and appears to most closely resemble the prokka/roary matrix.
I can share the fastqs files if you are interested.
Thanks,
Wes