Skip to content

Placements sensitive to outlier sequences? #33

@gavinmdouglas

Description

@gavinmdouglas

I recently noticed that the placements of all query sequences can be affected by the presence of a small number of outlier sequences when placing onto a large tree. This problem appears to be especially affect the pendant length estimates and less so the edge placements. I noticed this issue when running different subsets of a dataset with PICRUSt2, which wraps EPA-NG.

I've reproduced this for query datasets of 323 sequences into a tree of 20,000 sequences. In this example only one query sequence differs between the test datasets (essentially in one case the query sequence doesn't align the reference).

In the original case the focal query sequence alignment looks like this:

>02905cfb87861c837dde629596d9272b
....-----GGTCTTGACATC--CCTCT-GACGAGTGAGTAATGT-CG--CTT--T-C--CC--T----T----C--GG---------G------G--C-A-G-A-GGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCTTTAGTAGCCAGCA----GTAAGA-----TGGGAACTCTAGAGAGACTGCCGGGGATAACCCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGACCAGGGCTACACACGTGCTACAATGGC-G-T-A-A-ACAGAGGGA-AGCGACCCTGTGAAGGTAAGCAAATCCCAAAAATAACGTCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGCTGGAATCGCTAGTAATCGCGAATCAGA-ATGTCGCGGTGAATACGTTCCCGG----...

I swapped in random DNA for a different test file (with the same header) and the alignment looks like this (only a single base aligned):

>02905cfb87861c837dde629596d9272b
....--------------T---------------....

You can see in the jplace outputs of running EPA-NG that there are many differences in the placements, particularly in the pendant distances.

I think this issue might be related to issue #29. I didn't expect all placements to be affected by a single weird sequence. I wasn't able to reproduce this issue with the example datasets used in the EPA-NG paper and I'm thinking that maybe this issue only arises with large trees. Any insight would be greatly appreciated!

I ran EPA-NG with this command:

epa-ng --tree pro_ref.tre \
       --ref-msa ref_seqs_hmmalign.fasta \
       --query STUDY_SEQS \
       --chunk-size 5000 \
       -T 20 \
       -m pro_ref.model \
       -w epa_out \
       --filter-acc-lwr 0.99 \
       --filter-max 100

You can see the input and output files attached in this zipfile: placement_test.zip.

STUDY_SEQS corresponds to study_seqs_hmmalign_original.fasta and study_seqs_hmmalign_funky.fasta for the original dataset and the dataset with the outlier sequence, respectively. The output jplace files are named the same way.

Thanks,

Gavin

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions