-
Hi there! I am trying to detect, between any two samples, when a segment is truly clonally inherited verses when one of the two's segments is from a recombination/gene conversion event that just so happened to coalesce at the same time MRCA as the other sample. They would have the same ancestor node and Tmrca, but take different paths. My thought is that this could theoretically happen at high rates in a multiple merger coalescent. I've thought about using I'm thinking incorporating extra gene conversion nodes could differentiate between the two, but don't know how to integrate them with Any help is greatly appreciated. Best, |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 5 replies
-
Hi, CJ - to answer the question I think we need some more context? Is this a question about tree sequences produced by msprime? If so, and you want to be able to always have separate edges for segments that took separate paths through the pedigree, then - I could be wrong, but - I think you need to use (Mabye this should be an msprime discussion?) |
Beta Was this translation helpful? Give feedback.
-
I thought it didn't work and but after testing it again it does!! Thank you for your help! It was helpful to know I was on the right track. I will close the discussion now. Basically I noticed that only every second GC node is in the segment that was transferred, while all the other ones mark the complement segments of the genome (that don't come from gc). (Maybe it would be helpful to note this somewhere in the docs? The docs only talk about recombination nodes where it doesn't really matter which corresponds with left and right, while I had a hard time figuring out which GC nodes corresponded to the transferred segment, or if such a correspondance even existed). So I basically walked up the tree and looked for these nodes. GENE_CONVERSION_FLAG = 1 << 21
gc_nodes = [
node.id for node in ts.nodes() if node.flags == GENE_CONVERSION_FLAG
]
real_gc_nodes = set(gc_nodes[1::2]) # gc nodes from the transferred segment (only ever other node)
def are_clonal(i, j, real_gc_nodes, tree):
mrca = tree.mrca(i,j)
# no real gc node on path from i to mrca
while i != mrca:
if i in real_gc_nodes:
return False
i = tree.parent(i)
# no real gc node on path from j to mrca
while j != mrca:
if j in real_gc_nodes:
return False
j = tree.parent(j)
return True
frac_trueclonal = []
# how much of each pair of genomes is clonal?
for (i, j) in pairs:
clonal = sum(tree.span for tree in ts.trees() if are_clonal(i, j, real_gc_nodes, tree))
frac_trueclonal.append(clonal/ts.sequence_length) Walking up the tree for every pair and every tree turned out to be way too slow for large amonts of GC, so I instead converted each tree into a graph, chopped it up at these real GC nodes, and each pair within the components created by chopping up the tree must be clonal, so I'd add the tree's span to that pair's "clonal fraction". This is much faster and does the exact same. n = mts.num_samples
L = mts.sequence_length
GENE_CONVERSION_FLAG = 1 << 21
gc_nodes = [u.id for u in mts.nodes() if u.flags == GENE_CONVERSION_FLAG]
real_gc = set(gc_nodes[1::2])
num_pairs = math.comb(n, 2)
span = np.zeros(num_pairs, dtype=float)
# for each marginal tree, find components when you "remove" real_gc nodes (building a graph excluding them)
for tree in mts.trees():
g = rx.PyGraph()
node_map = {} # get graph idx for node idx
for u in tree.nodes():
if u not in real_gc:
node_index = g.add_node(u)
node_map[u] = node_index
for u in node_map:
p = tree.parent(u)
if p != tskit.NULL and p in node_map:
g.add_edge(node_map[u], node_map[p], None)
components = rx.connected_components(g)
# for all pairs in component, add tree span as clonal
for comp in components:
sample_nodes = [g[i] for i in comp if tree.is_sample(g[i])]
for i, j in combinations(sorted(sample_nodes), 2):
pair_index = int(j - i - 1 + n * i - (i * (i + 1)) // 2)
span[pair_index] += tree.span
frac_trueclonal = span/L |
Beta Was this translation helpful? Give feedback.
I thought it didn't work and but after testing it again it does!! Thank you for your help! It was helpful to know I was on the right track. I will close the discussion now.
Basically I noticed that only every second GC node is in the segment that was transferred, while all the other ones mark the complement segments of the genome (that don't come from gc). (Maybe it would be helpful to note this somewhere in the docs? The docs only talk about recombination nodes where it doesn't really matter which corresponds with left and right, while I had a hard time figuring out which GC nodes corresponded to the transferred segment, or if such a correspondance even existed).
So I basically walked up…