Simplify a tree sequence down so as to collapse clonal lineages #1220

hyanwong · 2021-02-24T14:29:14Z

hyanwong
Feb 24, 2021
Maintainer

@a-ignatieva, who has started to investigate using tree sequences to look at SARS-CoV2, pointed out a useful thing that she (and others) might want to do, which is to be able to "collapse" all the descendants of an MRCA as long as the genomes descendant from that MRCA are all clonal (i.e. have undergone no detectable recombination).

This would be quite easy using simplify(identified_clonal_ancestor_nodes), but first we need to identify which nodes are the clonal ancestors. What's the easiest or quickest way to do this?

An alternative requirement might be to identify ancestral nodes for which all the descendant samples are the same (i.e. there might have been recombination between the descendant lineages but not outside, defining what we might call a "monophyletic" clade in a tree sequence). I wonder if there's a nice way to identify all these nodes?

Answered by hyanwong

Feb 4, 2023

An updated answer based on my previous comments. It's more complicated because we need to follow up the trail from existing non-clonal edges and mark all their ancestors as non-clonal too.

import numpy as np
def clonal_mrcas(ts, most_recent=True):
    """
    Identify the nodes in a tree sequence which define clonal subtrees (i.e. in which the
    samples descending from that node are identical and show identical relationships
    to each other over the entire tree sequence). This includes, at its limit, nodes
    with only a single descendant sample.
    
    :param bool most_recent: If True, and the clonal node is a unary node, return IDs of the
        most recent node that defines the…

View full answer

jeromekelleher · 2021-02-24T14:37:10Z

jeromekelleher
Feb 24, 2021
Maintainer

Good question! I wonder is there some incremental algorithm where we work from left to right? I.e, all nodes start as clonal and then as we work through the edge_diffs we mark nodes as being non clonal if they are a parent in an edge diff, or ancestral to a parent?

Must be something like that, right?

4 replies

hyanwong Feb 24, 2021
Maintainer Author

Here's another way I thought we could do it. I'm not sure if the logic is sound, though: (1) We squash edges (2) we identify all full-length edges (3) we identify all nodes which have a non-full-length descendant edge (4) we remove all edges that refer to these nodes (5) we simplify. Then (I think) the roots of any remaining trees are the clonal ancestors. I'm not sure how efficient this is, but it's nice to use existing tools.

jeromekelleher Feb 24, 2021
Maintainer

Yes, I think that would work. We're looking for the roots of the subtrees that don't change along the full length of the sequence.

jeromekelleher Feb 24, 2021
Maintainer

Could just delete all non-full edges, and simplify?

hyanwong Feb 24, 2021
Maintainer Author

I thought about that, but there might be a node which has one full edge descending from it, and another half-length edge descending from it too. I don't think we could consider that node as a (uniquely) clonal ancestor? Especially if such a node is nested within an otherwise clonal tree.

petrelharp · 2021-02-24T15:22:11Z

petrelharp
Feb 24, 2021
Maintainer

We just need to ask which nodes have only one parent in the edge table, right? So, something like:

parents = {x : set() for x in range(tables.nodes.num_rows)}
for e in tables.edges:
  num_parents[e.child] += set(e.parent)

clonal = np.where([len(x) == 1 for x in parent])[0]

I happen to have a utility for that in pyslim:

np.where(  pyslim.util.unique_labels_by_group(tables.edges.child, tables.edges.parent) )[0]

3 replies

hyanwong Feb 24, 2021
Maintainer Author

Doh, of course! Silly me. Nice one @petrelharp

hyanwong Feb 24, 2021
Maintainer Author

Although what if a node has a single parent in the edge table, but that only spans (say) half the genome? That can happen, right?

petrelharp Feb 24, 2021
Maintainer

Oh right good point. I guess we also want those that are entirely represented in the edge table, so I think this does it:

np.where(np.logical_and(
  pyslim.util.unique_labels_by_group(tables.edges.child, tables.edges.parent),
  np.bincount(1+tables.edges.child, tables.edges.right - tables.edges.left, minlength=tables.nodes.num_rows+1)[1:]
     == tables.sequence_length
) )[0]

hyanwong · 2021-02-24T15:53:41Z

hyanwong
Feb 24, 2021
Maintainer Author

Is this the sort of thing we eventually want as a function in tskit, or is a (tested?) recipe here sufficient in this case?

2 replies

petrelharp Feb 24, 2021
Maintainer

I think we'd want to have a reasonably common use case and a good name for it to be worth the trouble. I'm not sure we have either yet.

hyanwong Feb 24, 2021
Maintainer Author

Agreed. These discussion forums are a good place to template code - the only thing is that I'm not sure how we would do or store any bits of code to test solutions suggested here.

jeromekelleher · 2021-02-24T18:19:07Z

jeromekelleher
Feb 24, 2021
Maintainer

I think this does the job:

import numpy as np
import msprime

ts = msprime.sim_ancestry(5,
    recombination_rate=0.002, sequence_length=100, random_seed=53)

print("ORGINAL TS")
print(ts.draw_text())

tables = ts.dump_tables()
index = np.logical_and(
    tables.edges.left == 0, tables.edges.right == tables.sequence_length)
tables.edges.set_columns(
    left=tables.edges.left[index],
    right=tables.edges.right[index],
    parent=tables.edges.parent[index],
    child=tables.edges.child[index])

print("ONLY FULL SPAN EDGES")
ts2 = tables.tree_sequence()
print(ts2.draw_text())

print("SIMPLIFIED (but different IDs)")
ts3, node_map = ts2.simplify(map_nodes=True)
reverse_map = {mapped: j for j, mapped in enumerate(node_map) if mapped != -1}
print(ts3.draw_text())

tree = ts3.first()
mrcas = [root for root in tree.roots]
original_nodes = [reverse_map[root] for root in mrcas]

print("TRIMMED")
ts_trimmed = ts.simplify(original_nodes)
print(ts_trimmed.draw_text())

Giving:

ORGINAL TS                                     
4.40┊            19       ┊            19       ┊                                              
    ┊         ┏━━━┻━━━┓   ┊         ┏━━━┻━━━━┓  ┊                                              
3.25┊         ┃      18   ┊         ┃        ┃  ┊                                              
    ┊         ┃      ┏┻━┓ ┊         ┃        ┃  ┊                                              
1.01┊        17      ┃  ┃ ┊        17        ┃  ┊                                              
    ┊     ┏━━━┻━━┓   ┃  ┃ ┊     ┏━━━┻━━━┓    ┃  ┊                                              
0.63┊    16      ┃   ┃  ┃ ┊    16       ┃    ┃  ┊                                              
    ┊   ┏━┻━━┓   ┃   ┃  ┃ ┊   ┏━┻━━┓    ┃    ┃  ┊                                              
0.54┊   ┃    ┃   ┃   ┃  ┃ ┊   ┃    ┃   15    ┃  ┊                                              
    ┊   ┃    ┃   ┃   ┃  ┃ ┊   ┃    ┃   ┏┻━┓  ┃  ┊                                              
0.39┊  14    ┃   ┃   ┃  ┃ ┊  14    ┃   ┃  ┃  ┃  ┊                                              
    ┊  ┏┻━┓  ┃   ┃   ┃  ┃ ┊  ┏┻━┓  ┃   ┃  ┃  ┃  ┊                                              
0.21┊ 13  ┃  ┃   ┃   ┃  ┃ ┊ 13  ┃  ┃   ┃  ┃  ┃  ┊                                              
    ┊ ┏┻┓ ┃  ┃   ┃   ┃  ┃ ┊ ┏┻┓ ┃  ┃   ┃  ┃  ┃  ┊                                              
0.14┊ ┃ ┃ ┃ 12   ┃   ┃  ┃ ┊ ┃ ┃ ┃ 12   ┃  ┃  ┃  ┊                                              
    ┊ ┃ ┃ ┃ ┏┻┓  ┃   ┃  ┃ ┊ ┃ ┃ ┃ ┏┻┓  ┃  ┃  ┃  ┊                                              
0.05┊ ┃ ┃ ┃ ┃ ┃  ┃  11  ┃ ┊ ┃ ┃ ┃ ┃ ┃  ┃  ┃ 11  ┊                                              
    ┊ ┃ ┃ ┃ ┃ ┃  ┃  ┏┻┓ ┃ ┊ ┃ ┃ ┃ ┃ ┃  ┃  ┃ ┏┻┓ ┊                                              
0.05┊ ┃ ┃ ┃ ┃ ┃ 10  ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ 10  ┃ ┃ ┃ ┊                                              
    ┊ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┊                                              
0.00┊ 0 9 3 5 6 1 4 2 7 8 ┊ 0 9 3 5 6 1 4 8 2 7 ┊                                              
  0.00                  30.00                100.00                                            
                                               
ONLY FULL SPAN EDGES                           
4.40┊    19               ┊   
    ┊     ┃               ┊   
3.25┊     ┃               ┊   
    ┊     ┃               ┊   
1.01┊    17               ┊   
    ┊     ┃               ┊   
0.63┊    16               ┊   
    ┊   ┏━┻━━┓            ┊   
0.54┊   ┃    ┃            ┊   
    ┊   ┃    ┃            ┊   
0.39┊  14    ┃            ┊                    
    ┊  ┏┻━┓  ┃            ┊   
0.21┊ 13  ┃  ┃            ┊   
    ┊ ┏┻┓ ┃  ┃            ┊   
0.14┊ ┃ ┃ ┃ 12            ┊   
    ┊ ┃ ┃ ┃ ┏┻┓           ┊   
0.05┊ ┃ ┃ ┃ ┃ ┃     11    ┊   
    ┊ ┃ ┃ ┃ ┃ ┃     ┏┻┓   ┊   
0.05┊ ┃ ┃ ┃ ┃ ┃ 10  ┃ ┃   ┊   
    ┊ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃   ┊   
0.00┊ 0 9 3 5 6 1 4 2 7 8 ┊   
  0.00                 100.00 
SIMPLIFIED                                                                                     
0.63┊    15               ┊                                                                    
    ┊   ┏━┻━━┓            ┊                                                                    
0.39┊  14    ┃            ┊                                                                    
    ┊  ┏┻━┓  ┃            ┊                    
0.21┊ 13  ┃  ┃            ┊                    
    ┊ ┏┻┓ ┃  ┃            ┊   
0.14┊ ┃ ┃ ┃ 12            ┊   
    ┊ ┃ ┃ ┃ ┏┻┓           ┊   
0.05┊ ┃ ┃ ┃ ┃ ┃     11    ┊   
    ┊ ┃ ┃ ┃ ┃ ┃     ┏┻┓   ┊   
0.05┊ ┃ ┃ ┃ ┃ ┃ 10  ┃ ┃   ┊   
    ┊ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃   ┊   
0.00┊ 0 9 3 5 6 1 4 2 7 8 ┊   
  0.00                 100.00 
                                               
TRIMMED                                        
4.40┊    7    ┊     7   ┊     
    ┊  ┏━┻━┓  ┊   ┏━┻━┓ ┊     
3.25┊  ┃   6  ┊   ┃   ┃ ┊     
    ┊  ┃  ┏┻┓ ┊   ┃   ┃ ┊     
1.01┊  5  ┃ ┃ ┊   5   ┃ ┊     
    ┊ ┏┻┓ ┃ ┃ ┊  ┏┻━┓ ┃ ┊     
0.63┊ ┃ 2 ┃ ┃ ┊  ┃  2 ┃ ┊     
    ┊ ┃   ┃ ┃ ┊  ┃    ┃ ┊     
0.54┊ ┃   ┃ ┃ ┊  4    ┃ ┊     
    ┊ ┃   ┃ ┃ ┊ ┏┻┓   ┃ ┊     
0.05┊ ┃   1 ┃ ┊ ┃ ┃   1 ┊     
    ┊ ┃     ┃ ┊ ┃ ┃     ┊                      
0.05┊ 0     ┃ ┊ 0 ┃     ┊                      
    ┊       ┃ ┊   ┃     ┊     
0.00┊       3 ┊   3     ┊     
  0.00      30.00    100.00

Right? Surely removing all the non-full span edges is what we want, by definition. The rest is just fiddling to get rid of some annoying topology and mapping of node IDs.

3 replies

hyanwong Feb 24, 2021
Maintainer Author

I'm not sure removing all the non-full span edges quite does it (see my comment above)? It's true that the resulting trees will all represent clonal replication, but we may have removed some sample nodes from within what was previously a tree. I think for what @a-ignatieva needs we need to also check that the resulting trees contain all the same sample nodes as in the original TS. If they don't there's some fiddling to do to find potentially appropriate clonal subtrees. I'll see if I can come up with an example, if it's not clear.

hyanwong Feb 24, 2021
Maintainer Author

Here, for instance:

import io
import tskit

nodes = io.StringIO(
    """\
id  is_sample   time    population  individual  metadata
0   1   0.000000    0   -1
1   1   0.000000    0   -1
2   1   0.000000    0   -1
3   1   0.000000    0   -1
4   1   0.000000    0   -1
5   0   1.000000    0   -1
6   0   2.000000    0   -1
7   0   3.000000    0   -1
""")

edges = io.StringIO(
    """\
left    right   parent  child
0.000000    1.000000    5  0
0.000000    1.000000    5  1
0.000000    1.000000    6  5
0.000000    1.000000    6  2
0.000000    0.500000    6  3
0.000000    1.000000    7  6
0.000000    1.000000    7  4
0.500000    1.000000    7  3
"""
)
ts = tskit.load_text(
    nodes, edges, sequence_length=1, strict=False, base64_metadata=False
)

print(ts.draw_text())

giving:

3.00┊       7   ┊      7    ┊  
    ┊     ┏━┻━┓ ┊   ┏━━┻┳━┓ ┊  
2.00┊     6   ┃ ┊   6   ┃ ┃ ┊  
    ┊  ┏━━╋━┓ ┃ ┊  ┏┻━┓ ┃ ┃ ┊  
1.00┊  5  ┃ ┃ ┃ ┊  5  ┃ ┃ ┃ ┊  
    ┊ ┏┻┓ ┃ ┃ ┃ ┊ ┏┻┓ ┃ ┃ ┃ ┊  
0.00┊ 0 1 2 3 4 ┊ 0 1 2 3 4 ┊  
  0.00        0.50        1.00

for which the "clonal" tree is (surely?) everything below node 5. The "keep full edges" approach would create a "clonal tree" of nodes (0,1,2,4), excluding 3, and place the root at node 7.

jeromekelleher Feb 25, 2021
Maintainer

Ah, I see, thanks @hyanwong. Nice example. Looks like we're basically in a position to give a full solution here though, putting together the various snippets above?

hyanwong · 2021-02-26T14:50:22Z

hyanwong
Feb 26, 2021
Maintainer Author

I think the following is the right logic for the left-to-right algorithm that @jeromekelleher suggested. Probably worth checking against @petrelharp 's suggestion?

def clonal_mrcas(ts):
    # full length nodes without recombination will not be a parent in the edges in or out, apart from at the start
    for interval, edges_out, edges_in in ts.edge_diffs():
        if interval.left==0:
            is_full_length_clonal = np.ones(ts.num_nodes, dtype=bool)  # all nodes start as clonal
        else:
            for e in edges_in:
                is_full_length_clonal[e.parent] = False
        for e in edges_out:
            is_full_length_clonal[e.parent] = False
    clonal_nodes = np.where(is_full_length_clonal)[0]

    tables = ts.dump_tables()
    edges = tables.edges
    # only keep edges where both the child and the parent are full-length clonal nodes
    keep_edge = np.logical_and(np.isin(edges.child, clonal_nodes),  np.isin(edges.parent, clonal_nodes))
    tables.edges.set_columns(
            left = tables.edges.left[keep_edge],
            right=tables.edges.right[keep_edge],
            parent=tables.edges.parent[keep_edge],
            child=tables.edges.child[keep_edge],
        )
    ts = tables.tree_sequence()
    print("Debug: show the clonal subtrees", ts.draw_text(), sep="\n")
    assert ts.num_trees == 1
    return ts.first().roots

It works OK on my previous example. Here's another:

ts = msprime.simulate(8, recombination_rate=1, random_seed=321)
print("Original ts", ts.draw_text(), sep="\n")

clonal_nodes = clonal_mrcas(ts)
print(f"Subtrees under nodes {clonal_nodes} are clonal")
reduced_ts, node_map = ts.simplify(clonal_nodes, map_nodes=True)

print(
    "The tree seq with clonal subtrees removed",
    "(drawn using the original labels)",
    reduced_ts.draw_text(node_labels = {n:f"{i}" for i, n in enumerate(node_map)}),
    sep="\n",
)

Giving

Original ts
4.21┊                 ┊                 ┊                 ┊                 ┊          18     ┊  
    ┊                 ┊                 ┊                 ┊                 ┊       ┏━━━┻━━┓  ┊  
4.09┊                 ┊                 ┊          17     ┊          17     ┊       ┃      ┃  ┊  
    ┊                 ┊                 ┊       ┏━━━┻━━┓  ┊       ┏━━━┻━━┓  ┊       ┃      ┃  ┊  
2.59┊          16     ┊          16     ┊       ┃      ┃  ┊       ┃      ┃  ┊       ┃      ┃  ┊  
    ┊       ┏━━━┻━━┓  ┊       ┏━━━┻━━┓  ┊       ┃      ┃  ┊       ┃      ┃  ┊       ┃      ┃  ┊  
1.46┊       ┃      ┃  ┊       ┃      ┃  ┊       ┃      ┃  ┊      15      ┃  ┊      15      ┃  ┊  
    ┊       ┃      ┃  ┊       ┃      ┃  ┊       ┃      ┃  ┊    ┏━━┻━━┓   ┃  ┊    ┏━━┻━━┓   ┃  ┊  
1.27┊      14      ┃  ┊       ┃      ┃  ┊       ┃      ┃  ┊    ┃     ┃   ┃  ┊    ┃     ┃   ┃  ┊  
    ┊    ┏━━┻━━┓   ┃  ┊       ┃      ┃  ┊       ┃      ┃  ┊    ┃     ┃   ┃  ┊    ┃     ┃   ┃  ┊  
0.96┊    ┃     ┃   ┃  ┊      13      ┃  ┊      13      ┃  ┊    ┃     ┃   ┃  ┊    ┃     ┃   ┃  ┊  
    ┊    ┃     ┃   ┃  ┊    ┏━━┻━━┓   ┃  ┊    ┏━━┻━━┓   ┃  ┊    ┃     ┃   ┃  ┊    ┃     ┃   ┃  ┊  
0.64┊   12     ┃   ┃  ┊   12     ┃   ┃  ┊   12     ┃   ┃  ┊   12     ┃   ┃  ┊   12     ┃   ┃  ┊  
    ┊  ┏━┻━┓   ┃   ┃  ┊  ┏━┻━┓   ┃   ┃  ┊  ┏━┻━┓   ┃   ┃  ┊  ┏━┻━┓   ┃   ┃  ┊  ┏━┻━┓   ┃   ┃  ┊  
0.53┊ 11   ┃   ┃   ┃  ┊ 11   ┃   ┃   ┃  ┊ 11   ┃   ┃   ┃  ┊ 11   ┃   ┃   ┃  ┊ 11   ┃   ┃   ┃  ┊  
    ┊ ┏┻┓  ┃   ┃   ┃  ┊ ┏┻┓  ┃   ┃   ┃  ┊ ┏┻┓  ┃   ┃   ┃  ┊ ┏┻┓  ┃   ┃   ┃  ┊ ┏┻┓  ┃   ┃   ┃  ┊  
0.21┊ ┃ ┃  ┃  10   ┃  ┊ ┃ ┃  ┃  10   ┃  ┊ ┃ ┃  ┃  10   ┃  ┊ ┃ ┃  ┃   ┃  10  ┊ ┃ ┃  ┃   ┃  10  ┊  
    ┊ ┃ ┃  ┃  ┏┻┓  ┃  ┊ ┃ ┃  ┃  ┏┻┓  ┃  ┊ ┃ ┃  ┃  ┏┻┓  ┃  ┊ ┃ ┃  ┃   ┃  ┏┻┓ ┊ ┃ ┃  ┃   ┃  ┏┻┓ ┊  
0.18┊ ┃ ┃  ┃  ┃ ┃  9  ┊ ┃ ┃  ┃  ┃ ┃  9  ┊ ┃ ┃  ┃  ┃ ┃  9  ┊ ┃ ┃  ┃   9  ┃ ┃ ┊ ┃ ┃  ┃   9  ┃ ┃ ┊  
    ┊ ┃ ┃  ┃  ┃ ┃ ┏┻┓ ┊ ┃ ┃  ┃  ┃ ┃ ┏┻┓ ┊ ┃ ┃  ┃  ┃ ┃ ┏┻┓ ┊ ┃ ┃  ┃  ┏┻┓ ┃ ┃ ┊ ┃ ┃  ┃  ┏┻┓ ┃ ┃ ┊  
0.12┊ ┃ ┃  8  ┃ ┃ ┃ ┃ ┊ ┃ ┃  8  ┃ ┃ ┃ ┃ ┊ ┃ ┃  8  ┃ ┃ ┃ ┃ ┊ ┃ ┃  8  ┃ ┃ ┃ ┃ ┊ ┃ ┃  8  ┃ ┃ ┃ ┃ ┊  
    ┊ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┃ ┊  
0.00┊ 0 3 1 7 2 5 4 6 ┊ 0 3 1 7 2 5 4 6 ┊ 0 3 1 7 2 5 4 6 ┊ 0 3 1 7 4 6 2 5 ┊ 0 3 1 7 4 6 2 5 ┊  
  0.00              0.07              0.61              0.74              0.92              1.00 

Debug: show the clonal subtrees
4.21┊                 ┊  
    ┊                 ┊  
4.09┊                 ┊  
    ┊                 ┊  
2.59┊                 ┊  
    ┊                 ┊  
1.46┊                 ┊  
    ┊                 ┊  
1.27┊                 ┊  
    ┊                 ┊  
0.96┊                 ┊  
    ┊                 ┊  
0.64┊   12            ┊  
    ┊  ┏━┻━┓          ┊  
0.53┊ 11   ┃          ┊  
    ┊ ┏┻┓  ┃          ┊  
0.21┊ ┃ ┃  ┃  10      ┊  
    ┊ ┃ ┃  ┃  ┏┻┓     ┊  
0.18┊ ┃ ┃  ┃  ┃ ┃  9  ┊  
    ┊ ┃ ┃  ┃  ┃ ┃ ┏┻┓ ┊  
0.12┊ ┃ ┃  8  ┃ ┃ ┃ ┃ ┊  
    ┊ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┃ ┊  
0.00┊ 0 3 1 7 2 5 4 6 ┊  
  0.00              1.00 

Subtrees under nodes [12, 10, 9] are clonal
The tree seq with clonal subtrees removed
(drawn using the original labels)
4.21┊         ┊         ┊         ┊         ┊    18   ┊  
    ┊         ┊         ┊         ┊         ┊   ┏━┻━┓ ┊  
4.09┊         ┊         ┊    17   ┊    17   ┊   ┃   ┃ ┊  
    ┊         ┊         ┊   ┏━┻━┓ ┊   ┏━┻━┓ ┊   ┃   ┃ ┊  
2.59┊    16   ┊    16   ┊   ┃   ┃ ┊   ┃   ┃ ┊   ┃   ┃ ┊  
    ┊   ┏━┻━┓ ┊   ┏━┻━┓ ┊   ┃   ┃ ┊   ┃   ┃ ┊   ┃   ┃ ┊  
1.46┊   ┃   ┃ ┊   ┃   ┃ ┊   ┃   ┃ ┊  15   ┃ ┊  15   ┃ ┊  
    ┊   ┃   ┃ ┊   ┃   ┃ ┊   ┃   ┃ ┊  ┏┻┓  ┃ ┊  ┏┻┓  ┃ ┊  
1.27┊  14   ┃ ┊   ┃   ┃ ┊   ┃   ┃ ┊  ┃ ┃  ┃ ┊  ┃ ┃  ┃ ┊  
    ┊  ┏┻━┓ ┃ ┊   ┃   ┃ ┊   ┃   ┃ ┊  ┃ ┃  ┃ ┊  ┃ ┃  ┃ ┊  
0.96┊  ┃  ┃ ┃ ┊  13   ┃ ┊  13   ┃ ┊  ┃ ┃  ┃ ┊  ┃ ┃  ┃ ┊  
    ┊  ┃  ┃ ┃ ┊  ┏┻━┓ ┃ ┊  ┏┻━┓ ┃ ┊  ┃ ┃  ┃ ┊  ┃ ┃  ┃ ┊  
0.64┊ 12  ┃ ┃ ┊ 12  ┃ ┃ ┊ 12  ┃ ┃ ┊ 12 ┃  ┃ ┊ 12 ┃  ┃ ┊  
    ┊     ┃ ┃ ┊     ┃ ┃ ┊     ┃ ┃ ┊    ┃  ┃ ┊    ┃  ┃ ┊  
0.21┊    10 ┃ ┊    10 ┃ ┊    10 ┃ ┊    ┃ 10 ┊    ┃ 10 ┊  
    ┊       ┃ ┊       ┃ ┊       ┃ ┊    ┃    ┊    ┃    ┊  
0.18┊       9 ┊       9 ┊       9 ┊    9    ┊    9    ┊  
  0.00      0.07      0.61      0.74      0.92      1.00

5 replies

hyanwong Feb 26, 2021
Maintainer Author

Oh, and there's a really nice way to plot the clonal subtrees in a different colour using my SVG styling. E.g. in a Jupyter notebook

from IPython.display import SVG
SVG(ts.draw_svg(style="".join(f".n{n} .node .edge {{stroke:red}}" for n in clonal_nodes)))

jeromekelleher Feb 26, 2021
Maintainer

(I edited the examples above to put the "python" after the backticks in the code examples @hyanwong, so they render as Python)

jeromekelleher Feb 26, 2021
Maintainer

Looks great @hyanwong, but can you update to get rid of the monkey patch, please? That's going to really confuse people, and it's not clear #1221 is actually the right approach. Just put in the set_columns and let's not worry about edge metadata (which won't be a problem in the vast majority of cases).

jeromekelleher Feb 27, 2021
Maintainer

What do you think @a-ignatieva, does this do what you're looking for?

a-ignatieva Feb 27, 2021
Collaborator

Yes that's perfect! Thanks!

hyanwong · 2023-02-04T12:18:41Z

hyanwong
Feb 4, 2023
Maintainer Author

Gah, I'm not sure the selected answer here does actually work. Here's a counterexample:

tables = tskit.Tree.generate_comb(5).tree_sequence.dump_tables()
e = np.where(tables.edges.child == 2)[0][0]
tables.edges[e] = tables.edges[e].replace(right=0.5)
tables.sort()
ts = tables.tree_sequence()

print(ts.draw_text())

4.00┊   8       ┊   8       ┊  
    ┊ ┏━┻━┓     ┊ ┏━┻━┓     ┊  
3.00┊ ┃   7     ┊ ┃   7     ┊  
    ┊ ┃ ┏━┻━┓   ┊ ┃ ┏━┻┓    ┊  
2.00┊ ┃ ┃   6   ┊ ┃ ┃  6    ┊  
    ┊ ┃ ┃ ┏━┻┓  ┊ ┃ ┃  ┃    ┊  
1.00┊ ┃ ┃ ┃  5  ┊ ┃ ┃  5    ┊  
    ┊ ┃ ┃ ┃ ┏┻┓ ┊ ┃ ┃ ┏┻┓   ┊  
0.00┊ 0 1 2 3 4 ┊ 0 1 3 4 2 ┊  
  0.00        0.50        1.00

If we run the code above, we erroneously identify the root (8) as the top of a clonal subtree:

clonal_nodes = clonal_mrcas(ts)
print(f"Subtrees under nodes {clonal_nodes} are clonal")
reduced_ts, node_map = ts.simplify(clonal_nodes, map_nodes=True)

print(
    "The tree seq with clonal subtrees removed",
    "(drawn using the original labels)",
    reduced_ts.first().draw_text(node_labels = {n:f"{i}" for i, n in enumerate(node_map)}),
    sep="\n",
)

Debug: show the clonal subtrees
4.00┊  8        ┊
    ┊ ┏┻┓       ┊
3.00┊ ┃ 7       ┊
    ┊ ┃ ┃       ┊
2.00┊ ┃ ┃       ┊
    ┊ ┃ ┃       ┊
1.00┊ ┃ ┃    5  ┊
    ┊ ┃ ┃   ┏┻┓ ┊
0.00┊ 0 1 2 3 4 ┊
    0           1

Subtrees under nodes [2, 5, 8] are clonal
The tree seq with clonal subtrees removed
(drawn using the original labels)
 8 
 ┃ 
 6 
┏┻┓
┃ 5
┃  
2

The only clonal subtree here, IMO, is below node 5 (although we should also be reporting nodes 0, 1, and 2 as clonal)

2 replies

hyanwong Feb 4, 2023
Maintainer Author

I think we simply need to check that the subtrees identified by this method have the same set of samples under them in the pruned TS as in the first tree of the original TS.

hyanwong Feb 4, 2023
Maintainer Author

Here's a more complex example, where I think the clonal subtree nodes should be defined as 8, 9, 5, and 6:

tables = tskit.Tree.generate_balanced(7).tree_sequence.dump_tables()
e = np.where(tables.edges.child == 5)[0][0]
tables.edges[e] = tables.edges[e].replace(right=0.5)
tables.sort()
ts = tables.tree_sequence()

print(ts.draw_text())

3.00┊      12       ┊     12        ┊  
    ┊   ┏━━━┻━━┓    ┊   ┏━━┻━━┓     ┊  
2.00┊   8     11    ┊   8    11     ┊  
    ┊ ┏━┻┓   ┏━┻━┓  ┊ ┏━┻┓   ┏┻━┓   ┊  
1.00┊ ┃  7   9  10  ┊ ┃  7   9 10   ┊  
    ┊ ┃ ┏┻┓ ┏┻┓ ┏┻┓ ┊ ┃ ┏┻┓ ┏┻┓ ┃   ┊  
0.00┊ 0 1 2 3 4 5 6 ┊ 0 1 2 3 4 6 5 ┊  
  0.00            0.50            1.00

hyanwong · 2023-02-04T17:36:25Z

hyanwong
Feb 4, 2023
Maintainer Author

An updated answer based on my previous comments. It's more complicated because we need to follow up the trail from existing non-clonal edges and mark all their ancestors as non-clonal too.

import numpy as np
def clonal_mrcas(ts, most_recent=True):
    """
    Identify the nodes in a tree sequence which define clonal subtrees (i.e. in which the
    samples descending from that node are identical and show identical relationships
    to each other over the entire tree sequence). This includes, at its limit, nodes
    with only a single descendant sample.
    
    :param bool most_recent: If True, and the clonal node is a unary node, return IDs of the
        most recent node that defines the clonal subtree. In this case, the returned IDs represent
        cases where the node is either a tip or a coalescent point.
    :return: a list of nodes defining subtrees which are constant over the entire tree sequence
    :rtype: list
    """
    for interval, edges_out, edges_in in ts.edge_diffs():
        if interval.left==0:
            is_full_length_clonal = np.ones(ts.num_nodes, dtype=bool)  # all nodes start as clonal
        else:
            for e in edges_in:
                is_full_length_clonal[e.parent] = False
        for e in edges_out:
            is_full_length_clonal[e.parent] = False
    clonal_nodes = np.where(is_full_length_clonal)[0]

    tables = ts.dump_tables()
    edges = tables.edges
    # only keep edges where both the child and the parent are full-length clonal nodes
    keep_edge = np.logical_and(np.isin(edges.child, clonal_nodes),  np.isin(edges.parent, clonal_nodes))
    tables.edges.set_columns(
            left = tables.edges.left[keep_edge],
            right=tables.edges.right[keep_edge],
            parent=tables.edges.parent[keep_edge],
            child=tables.edges.child[keep_edge],
        )
    clonal_ts = tables.tree_sequence()
    assert clonal_ts.num_trees == 1

    # Also remove all the edges ascending from removed edges
    tree = clonal_ts.first()
    non_clonal_ancestors = set()
    deleted_edges = np.logical_not(keep_edge)
    for u in np.unique(ts.edges_parent[deleted_edges]):
        while u != tskit.NULL and u not in non_clonal_ancestors:
            non_clonal_ancestors.add(u)
            u = tree.parent(u)
    non_clonal_ancestors = np.array(list(non_clonal_ancestors))
    tables = clonal_ts.dump_tables()
    remove_edge = np.isin(tables.edges.parent, non_clonal_ancestors)
    tables.edges.replace_with(tables.edges[np.logical_not(remove_edge)])
    clonal_ts = tables.tree_sequence()   
    
    tree = ts.first(sample_lists=True)
    clonal_tree = clonal_ts.first(sample_lists=True)
    clonal_nodes = []
    for root in clonal_tree.roots:
        # Clonal trees should have the same set of samples underneath as in the original tree
        assert set(tree.samples(root)) == set(clonal_tree.samples(root))
        u = root
        if most_recent:
            # decend to the first coalescent node (i.e. MRCA)
            while clonal_tree.num_children(u) == 1:
                u = clonal_tree.children(u)[0]
        clonal_nodes.append(u)
    return clonal_nodes

1 reply

hyanwong Feb 4, 2023
Maintainer Author

Here's a demo on some of the previous examples:

tables = tskit.Tree.generate_balanced(7).tree_sequence.dump_tables()
e = np.where(tables.edges.child == 5)[0][0]
tables.edges[e] = tables.edges[e].replace(right=0.5)
tables.sort()
ts = tables.tree_sequence()

print(ts.draw_text())

clonal_nodes = clonal_mrcas(ts)
print(f"Subtrees under nodes {clonal_nodes} are clonal")
reduced_ts, node_map = ts.simplify(clonal_nodes, map_nodes=True)

print(
    "The tree seq with clonal subtrees removed",
    "(drawn using the original labels)",
    reduced_ts.draw_text(node_labels = {n:f"{i}" for i, n in enumerate(node_map)}),
    sep="\n",
)

3.00┊      12       ┊     12        ┊  
    ┊   ┏━━━┻━━┓    ┊   ┏━━┻━━┓     ┊  
2.00┊   8     11    ┊   8    11     ┊  
    ┊ ┏━┻┓   ┏━┻━┓  ┊ ┏━┻┓   ┏┻━┓   ┊  
1.00┊ ┃  7   9  10  ┊ ┃  7   9 10   ┊  
    ┊ ┃ ┏┻┓ ┏┻┓ ┏┻┓ ┊ ┃ ┏┻┓ ┏┻┓ ┃   ┊  
0.00┊ 0 1 2 3 4 5 6 ┊ 0 1 2 3 4 6 5 ┊  
  0.00            0.50            1.00 

Subtrees under nodes [5, 6, 9, 8] are clonal
The tree seq with clonal subtrees removed
(drawn using the original labels)
3.00┊    12   ┊    12   ┊  
    ┊   ┏━┻━┓ ┊    ┏┻━┓ ┊  
2.00┊  11   8 ┊   11  8 ┊  
    ┊  ┏┻━┓   ┊   ┏┻┓   ┊  
1.00┊ 10  9   ┊   ┃ 9   ┊  
    ┊ ┏┻┓     ┊   ┃     ┊  
0.00┊ 5 6     ┊ 5 6     ┊  
  0.00      0.50      1.00

Another small example

import io
import tskit

nodes = io.StringIO(
    """\
id  is_sample   time    population  individual  metadata
0   1   0.000000    0   -1
1   1   0.000000    0   -1
2   1   0.000000    0   -1
3   1   0.000000    0   -1
4   1   0.000000    0   -1
5   0   1.000000    0   -1
6   0   2.000000    0   -1
7   0   3.000000    0   -1
""")

edges = io.StringIO(
    """\
left    right   parent  child
0.000000    1.000000    5  0
0.000000    1.000000    5  1
0.000000    1.000000    6  5
0.000000    1.000000    6  2
0.000000    0.500000    6  3
0.000000    1.000000    7  6
0.000000    1.000000    7  4
0.500000    1.000000    7  3
"""
)
ts = tskit.load_text(
    nodes, edges, sequence_length=1, strict=False, base64_metadata=False
)

print(ts.draw_text())
clonal_nodes = clonal_mrcas(ts)
print(f"Subtrees under nodes {clonal_nodes} are clonal")
reduced_ts, node_map = ts.simplify(clonal_nodes, map_nodes=True)

print(
    "The tree seq with clonal subtrees removed",
    "(drawn using the original labels)",
    reduced_ts.draw_text(node_labels = {n:f"{i}" for i, n in enumerate(node_map)}),
    sep="\n",
)

3.00┊       7   ┊      7    ┊  
    ┊     ┏━┻━┓ ┊   ┏━━┻┳━┓ ┊  
2.00┊     6   ┃ ┊   6   ┃ ┃ ┊  
    ┊  ┏━━╋━┓ ┃ ┊  ┏┻━┓ ┃ ┃ ┊  
1.00┊  5  ┃ ┃ ┃ ┊  5  ┃ ┃ ┃ ┊  
    ┊ ┏┻┓ ┃ ┃ ┃ ┊ ┏┻┓ ┃ ┃ ┃ ┊  
0.00┊ 0 1 2 3 4 ┊ 0 1 2 3 4 ┊  
  0.00        0.50        1.00 

Subtrees under nodes [2, 3, 4, 5] are clonal
The tree seq with clonal subtrees removed
(drawn using the original labels)
3.00┊     7   ┊     7   ┊  
    ┊   ┏━┻━┓ ┊  ┏━━╋━┓ ┊  
2.00┊   6   ┃ ┊  6  ┃ ┃ ┊  
    ┊ ┏━╋━┓ ┃ ┊ ┏┻┓ ┃ ┃ ┊  
1.00┊ ┃ ┃ 5 ┃ ┊ ┃ 5 ┃ ┃ ┊  
    ┊ ┃ ┃   ┃ ┊ ┃   ┃ ┃ ┊  
0.00┊ 2 3   4 ┊ 2   3 4 ┊  
  0.00      0.50      1.00

And a larger one:

ts = msprime.sim_ancestry(5,
    recombination_rate=0.002, sequence_length=100, random_seed=53)
print("Original ts", ts.draw_text(), sep="\n")

clonal_nodes = clonal_mrcas(ts)
print(f"Subtrees under nodes {clonal_nodes} are clonal")
reduced_ts, node_map = ts.simplify(clonal_nodes, map_nodes=True)

print(
    "The tree seq with clonal subtrees removed",
    "(drawn using the original labels)",
    reduced_ts.draw_text(node_labels = {n:f"{i}" for i, n in enumerate(node_map)}),
    sep="\n",
)

Original ts
4.40┊            19       ┊            19       ┊ 
    ┊         ┏━━━┻━━━┓   ┊         ┏━━━┻━━━━┓  ┊ 
3.25┊         ┃      18   ┊         ┃        ┃  ┊ 
    ┊         ┃      ┏┻━┓ ┊         ┃        ┃  ┊ 
1.01┊        17      ┃  ┃ ┊        17        ┃  ┊ 
    ┊     ┏━━━┻━━┓   ┃  ┃ ┊     ┏━━━┻━━━┓    ┃  ┊ 
0.63┊    16      ┃   ┃  ┃ ┊    16       ┃    ┃  ┊ 
    ┊   ┏━┻━━┓   ┃   ┃  ┃ ┊   ┏━┻━━┓    ┃    ┃  ┊ 
0.54┊   ┃    ┃   ┃   ┃  ┃ ┊   ┃    ┃   15    ┃  ┊ 
    ┊   ┃    ┃   ┃   ┃  ┃ ┊   ┃    ┃   ┏┻━┓  ┃  ┊ 
0.39┊  14    ┃   ┃   ┃  ┃ ┊  14    ┃   ┃  ┃  ┃  ┊ 
    ┊  ┏┻━┓  ┃   ┃   ┃  ┃ ┊  ┏┻━┓  ┃   ┃  ┃  ┃  ┊ 
0.21┊ 13  ┃  ┃   ┃   ┃  ┃ ┊ 13  ┃  ┃   ┃  ┃  ┃  ┊ 
    ┊ ┏┻┓ ┃  ┃   ┃   ┃  ┃ ┊ ┏┻┓ ┃  ┃   ┃  ┃  ┃  ┊ 
0.14┊ ┃ ┃ ┃ 12   ┃   ┃  ┃ ┊ ┃ ┃ ┃ 12   ┃  ┃  ┃  ┊ 
    ┊ ┃ ┃ ┃ ┏┻┓  ┃   ┃  ┃ ┊ ┃ ┃ ┃ ┏┻┓  ┃  ┃  ┃  ┊ 
0.05┊ ┃ ┃ ┃ ┃ ┃  ┃  11  ┃ ┊ ┃ ┃ ┃ ┃ ┃  ┃  ┃ 11  ┊ 
    ┊ ┃ ┃ ┃ ┃ ┃  ┃  ┏┻┓ ┃ ┊ ┃ ┃ ┃ ┃ ┃  ┃  ┃ ┏┻┓ ┊ 
0.05┊ ┃ ┃ ┃ ┃ ┃ 10  ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ 10  ┃ ┃ ┃ ┊ 
    ┊ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┊ 
0.00┊ 0 9 3 5 6 1 4 2 7 8 ┊ 0 9 3 5 6 1 4 8 2 7 ┊ 
    0                    30                    100

Subtrees under nodes [8, 10, 11, 16] are clonal
The tree seq with clonal subtrees removed
(drawn using the original labels)
4.40┊     19     ┊      19    ┊ 
    ┊   ┏━━┻━━┓  ┊     ┏━┻━━┓ ┊ 
3.25┊  18     ┃  ┊     ┃    ┃ ┊ 
    ┊ ┏━┻┓    ┃  ┊     ┃    ┃ ┊ 
1.01┊ ┃  ┃   17  ┊    17    ┃ ┊ 
    ┊ ┃  ┃  ┏━┻┓ ┊   ┏━┻━┓  ┃ ┊ 
0.63┊ ┃  ┃  ┃ 16 ┊   ┃  16  ┃ ┊ 
    ┊ ┃  ┃  ┃    ┊   ┃      ┃ ┊ 
0.54┊ ┃  ┃  ┃    ┊  15      ┃ ┊ 
    ┊ ┃  ┃  ┃    ┊ ┏━┻┓     ┃ ┊ 
0.05┊ ┃ 11  ┃    ┊ ┃  ┃    11 ┊ 
    ┊ ┃     ┃    ┊ ┃  ┃       ┊ 
0.05┊ ┃    10    ┊ ┃ 10       ┊ 
    ┊ ┃          ┊ ┃          ┊ 
0.00┊ 8          ┊ 8          ┊ 
    0           30           100

Simplify a tree sequence down so as to collapse clonal lineages #1220

Uh oh!

hyanwong Feb 24, 2021 Maintainer

Replies: 7 comments · 20 replies

Uh oh!

jeromekelleher Feb 24, 2021 Maintainer

Uh oh!

Uh oh!

hyanwong Feb 24, 2021 Maintainer Author

Uh oh!

jeromekelleher Feb 24, 2021 Maintainer

Uh oh!

jeromekelleher Feb 24, 2021 Maintainer

Uh oh!

hyanwong Feb 24, 2021 Maintainer Author

Uh oh!

petrelharp Feb 24, 2021 Maintainer

Uh oh!

Uh oh!

hyanwong Feb 24, 2021 Maintainer Author

Uh oh!

hyanwong Feb 24, 2021 Maintainer Author

Uh oh!

petrelharp Feb 24, 2021 Maintainer

Uh oh!

hyanwong Feb 24, 2021 Maintainer Author

Uh oh!

petrelharp Feb 24, 2021 Maintainer

Uh oh!

hyanwong Feb 24, 2021 Maintainer Author

Uh oh!

jeromekelleher Feb 24, 2021 Maintainer

Uh oh!

Uh oh!

hyanwong Feb 24, 2021 Maintainer Author

Uh oh!

Uh oh!

hyanwong Feb 24, 2021 Maintainer Author

Uh oh!

jeromekelleher Feb 25, 2021 Maintainer

Uh oh!

Uh oh!

hyanwong Feb 26, 2021 Maintainer Author

Uh oh!

hyanwong Feb 26, 2021 Maintainer Author

Uh oh!

jeromekelleher Feb 26, 2021 Maintainer

Uh oh!

jeromekelleher Feb 26, 2021 Maintainer

Uh oh!

Uh oh!

jeromekelleher Feb 27, 2021 Maintainer

Uh oh!

a-ignatieva Feb 27, 2021 Collaborator

Uh oh!

Uh oh!

hyanwong Feb 4, 2023 Maintainer Author

Uh oh!

hyanwong Feb 4, 2023 Maintainer Author

Uh oh!

Uh oh!

hyanwong Feb 4, 2023 Maintainer Author

Uh oh!

hyanwong Feb 4, 2023 Maintainer Author

Uh oh!

hyanwong
Feb 24, 2021
Maintainer

Replies: 7 comments 20 replies

jeromekelleher
Feb 24, 2021
Maintainer

hyanwong Feb 24, 2021
Maintainer Author

jeromekelleher Feb 24, 2021
Maintainer

jeromekelleher Feb 24, 2021
Maintainer

hyanwong Feb 24, 2021
Maintainer Author

petrelharp
Feb 24, 2021
Maintainer

hyanwong Feb 24, 2021
Maintainer Author

hyanwong Feb 24, 2021
Maintainer Author

petrelharp Feb 24, 2021
Maintainer

hyanwong
Feb 24, 2021
Maintainer Author

petrelharp Feb 24, 2021
Maintainer

hyanwong Feb 24, 2021
Maintainer Author

jeromekelleher
Feb 24, 2021
Maintainer

hyanwong Feb 24, 2021
Maintainer Author

hyanwong Feb 24, 2021
Maintainer Author

jeromekelleher Feb 25, 2021
Maintainer

hyanwong
Feb 26, 2021
Maintainer Author

hyanwong Feb 26, 2021
Maintainer Author

jeromekelleher Feb 26, 2021
Maintainer

jeromekelleher Feb 26, 2021
Maintainer

jeromekelleher Feb 27, 2021
Maintainer

a-ignatieva Feb 27, 2021
Collaborator

hyanwong
Feb 4, 2023
Maintainer Author

hyanwong Feb 4, 2023
Maintainer Author

hyanwong Feb 4, 2023
Maintainer Author

hyanwong
Feb 4, 2023
Maintainer Author