Placements sensitive to outlier sequences?

I recently noticed that the placements of all query sequences can be affected by the presence of a small number of outlier sequences when placing onto a large tree. This problem appears to be especially affect the pendant length estimates and less so the edge placements. I noticed this issue when running different subsets of a dataset with [PICRUSt2](https://github.com/picrust/picrust2/wiki), which wraps EPA-NG.

I've reproduced this for query datasets of 323 sequences into a tree of 20,000 sequences. In this example only one query sequence differs between the test datasets (essentially in one case the query sequence doesn't align the reference).

In the original case the focal query sequence alignment looks like this:
```
>02905cfb87861c837dde629596d9272b
....-----GGTCTTGACATC--CCTCT-GACGAGTGAGTAATGT-CG--CTT--T-C--CC--T----T----C--GG---------G------G--C-A-G-A-GGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCTTTAGTAGCCAGCA----GTAAGA-----TGGGAACTCTAGAGAGACTGCCGGGGATAACCCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGACCAGGGCTACACACGTGCTACAATGGC-G-T-A-A-ACAGAGGGA-AGCGACCCTGTGAAGGTAAGCAAATCCCAAAAATAACGTCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGCTGGAATCGCTAGTAATCGCGAATCAGA-ATGTCGCGGTGAATACGTTCCCGG----...
```

I swapped in random DNA for a different test file (with the same header) and the alignment looks like this (only a single base aligned):
```
>02905cfb87861c837dde629596d9272b
....--------------T---------------....
```

You can see in the jplace outputs of running EPA-NG that there are many differences in the placements, particularly in the pendant distances.

I think this issue might be related to issue https://github.com/Pbdas/epa-ng/issues/29. I didn't expect all placements to be affected by a single weird sequence. I wasn't able to reproduce this issue with the example datasets used in the EPA-NG paper and I'm thinking that maybe this issue only arises with large trees. Any insight would be greatly appreciated!

I ran EPA-NG with this command:

```
epa-ng --tree pro_ref.tre \
       --ref-msa ref_seqs_hmmalign.fasta \
       --query STUDY_SEQS \
       --chunk-size 5000 \
       -T 20 \
       -m pro_ref.model \
       -w epa_out \
       --filter-acc-lwr 0.99 \
       --filter-max 100
```

You can see the input and output files attached in this zipfile: [placement_test.zip](https://github.com/Pbdas/epa-ng/files/3970075/placement_test.zip).

STUDY_SEQS corresponds to `study_seqs_hmmalign_original.fasta` and `study_seqs_hmmalign_funky.fasta` for the original dataset and the dataset with the outlier sequence, respectively. The output jplace files are named the same way.


Thanks,

Gavin



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Placements sensitive to outlier sequences? #33

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Placements sensitive to outlier sequences? #33

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions