Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Returning the full ARG with ts.simplify() #2765

Open
kitchensjn opened this issue Jun 15, 2023 · 14 comments
Open

Returning the full ARG with ts.simplify() #2765

kitchensjn opened this issue Jun 15, 2023 · 14 comments

Comments

@kitchensjn
Copy link

kitchensjn commented Jun 15, 2023

Is there a method when simplifying a tree sequence to remove all unary nodes except the recombination nodes (a middle ground between ts.simplify(keep_unary=False) versus ts.simplify(keep_unary=True))? We are working with the tree sequence output from a SLiM simulation with initializeTreeSeq(retainCoalescentOnly=F), which contains lots of unary nodes, and we want to simplify it down to just the nodes that affect the ARG structure, the full ARG. As the output tree sequence from SLiM does not have marked recombination nodes, these would first need to be identified before simplifying. Copying @pderaje as he is working on this with me.

(See MesserLab/SLiM#376 for the initial post before determining it was better suited here.)

@jeromekelleher
Copy link
Member

This isn't very easy right now all right - @hyanwong I think we have some hacks for this in sc2ts? Something like, identify the recombinants as the nodes that have greater than 1 parent, and then mark feed them in as samples or something?

@hyanwong
Copy link
Member

hyanwong commented Jun 19, 2023

Yes, I think it would have to be a hack. You could either mark them as samples (and then unmark them afterwards, or maybe even use update_sample_flags=False), or associate them with an individual and use simplify(keep_unary_in_individuals=True) (and then remove those individuals afterwards).

The advantage to the "mark samples" method is that it is reasonably clean. The advantage to the individuals method is that it will correctly remove portions of edges that lead to the recombination node but do not lead on to other samples (e.g. "nonancestral" recombination events). My (untested) guess is that the samples method will not remove nonancestral recombination nodes.

I'm not aware of any hacks in sc2ts that keep these in (apart from here, where we don't reset the sample flags). Mostly we just use keep_unary=True.

Here's some (semi-untested) code for the two methods. The new "filter_nodes=False" option to simplify is very handy to check that the plots are sensible.

import msprime
import numpy as np

arg = msprime.sim_ancestry(2, sequence_length=1e3, recombination_rate=0.001, record_full_arg=True)
re_nodes = np.where(arg.nodes_flags & msprime.NODE_IS_RE_EVENT)[0]
style = "".join(f".n{u} > .sym {{fill: red}}" for u in re_nodes)
arg.draw_svg(style=style)

image

# "sample" method
s1_arg = arg.simplify(np.concatenate((arg.samples(), re_nodes)), update_sample_flags=False, filter_nodes=False)
s1_arg.draw_svg(style=style)

image

# "individual" method
tables = arg.dump_tables()
individual_arr = tables.nodes.individual
for u in re_nodes:
    individual_arr[u] = tables.individuals.add_row()
tables.nodes.individual = individual_arr
tables.simplify(keep_unary_in_individuals=True, filter_nodes=False)
tables.nodes.individual = arg.nodes_individual  # set the individuals back to the original
s2_arg = tables.tree_sequence()
s2_arg.draw_svg(style=style)

image

@hyanwong
Copy link
Member

A keep_multiple_parent_nodes option would be a reasonable idea IMO (note that these might not always be strict "recombination nodes": they could be summaries of an earlier Re event. @GertjanBisschop has been thinking about rationalising the various simplify options, so he might have an opinion.

@pderaje
Copy link

pderaje commented Jun 19, 2023

Thanks, @hyanwong! We were doing something similar to the "samples" method, except SLiM doesn't automatically flag the recombination nodes, so we identified nodes with multiple parents, which raised two questions

  1. Is there a way to automatically flag those nodes with multiple parents, either while recording the tree sequence in SLiM or while simplifying in tskit (something like keep_multiple_parent_nodes should do the trick)? Otherwise going through each nd takes a while in larger ARGs.

  2. While adding the recombination nodes to be sampled, we weren't sure what we should be adding to the list. In the following code, is it the two parents or the nd itself (or maybe it doesn't matter) that should be added to the sample list?

`def ts_to_ARG(ts):
ts_tables = ts.dump_tables()
node_table = ts_tables.nodes
flags = node_table.flags
recomb_nodes = []

for nd in ts_sim.nodes(): 
    parents = np.unique(ts_tables.edges.parent[np.where(ts_tables.edges.child == nd.id)[0]])
    children = np.unique(ts_tables.edges.child[np.where(ts_tables.edges.parent == nd.id)[0]])
    if len(parents) == 2: 
        recomb_nodes += list(parents)
        flags[parents] = [131072,131072]
        
    if len(parents) > 2:
        raise TypeError('Error',nd.id)
    
node_table.flags = flags
ts_tables.sort() 
ts_new = ts_tables.tree_sequence()

keep_nodes = list(ts_new.samples()) + list(np.unique(recomb_nodes))
ts_final, maps = ts_new.simplify(samples=keep_nodes, map_nodes = True, keep_input_roots=False, keep_unary=False, update_sample_flags = False)

return ts_final, maps, recomb_nodes`

@hyanwong
Copy link
Member

  1. Is there a way to automatically flag those nodes with multiple parents, either while recording the tree sequence in SLiM or while simplifying in tskit (something like keep_multiple_parent_nodes should do the trick)? Otherwise going through each nd takes a while in larger ARGs.

There isn't. But it should be possible from the edge table arrays, more-or-less in a single pass right? You could do an np.unique on the 2 x num_edges 2D array, keeping unique pairs of (child, parent), sorted by child, then look for duplicate child IDs, I think?

  1. While adding the recombination nodes to be sampled, we weren't sure what we should be adding to the list. In the following code, is it the two parents or the nd itself (or maybe it doesn't matter) that should be added to the sample list?

I guess you want the node itself

@hyanwong
Copy link
Member

hyanwong commented Jun 19, 2023

I guess you want the node itself

Ah, I was wrong. It depends how you represent the recombination event. In msprime, we create 2 nodes per recombination event (because it helps us to calculate the likelihood under the Hudson coalescent, see tskit-dev/msprime#1942). Quoting from there:

I can't see how storing the recombination as two nodes helps with this. Surely the position of the breakpoint is lost in this case too?

The exact position of the breakpoint is lost, but you retain information about which non-ancestral gap contains the breakpoint, even when there are several gaps. That's enough to work out how many links are available for an effective recombination, and hence the rate of effective recombinations, in the next time interval.

Ah, thanks for the explanation. So if (in the original msprime code) you have a pair of RE nodes (either of which can be a parent in multiple edges, you take the rightmost edge position from the first node, and the leftmost from the second node, and (because you only have one breakpoint per recombination event) you are guaranteed that the "hidden" breakpoint lies between these? Is that right? I can see how that might work, but I can't see on earth how you would generalise this to multiple breakpoints. Perhaps this is an indications that the system we are switching to, with the breakpoints for the RE node stored in metadata, is more correct, if a little more bespoke.

In this case you might want to keep the parents.

In the SLiM case, my guess is that you have one recombination node per event, so you want the children. This is all a bit messy!

@hyanwong
Copy link
Member

it should be possible from the edge table arrays, more-or-less in a single pass right? You could do an np.unique on the 2 x num_edges 2D array, keeping unique pairs of (child, parent), sorted by child, then look for duplicate child IDs,

@kitchensjn : I think this finds the nodes with multiple parents, doesn't it? Could you check my logic, and if it's correct, I can add it as a Q&A to the discussions forum.

uniq_child_parent = np.unique(np.column_stack((ts.edges_child, ts.edges_parent)), axis=0)
nd, count = np.unique(uniq_child_parent[:, 0], return_counts=True)
multiple_parents = nd[count > 1]
print(f"Nodes with multiple parents are {multiple_parents}")

@kitchensjn
Copy link
Author

kitchensjn commented Jun 20, 2023

Yup, this will return all of the nodes with more than one parent. Then checking that it identifies the parents with recombination flags should be something like:

import numpy as np
import msprime


ts = msprime.sim_ancestry(
    samples=4,
    recombination_rate=1e-8,
    sequence_length=2_000,
    population_size=10_000,
    record_full_arg=True
)

uniq_child_parent = np.unique(np.column_stack((ts.edges_child, ts.edges_parent)), axis=0)
nd, count = np.unique(uniq_child_parent[:, 0], return_counts=True)
multiple_parents = nd[count > 1]
recomb_nodes = ts.edges_parent[np.in1d(ts.edges_child, multiple_parents)]
print(f"Recombination nodes: {recomb_nodes}. Flags match properly: {np.all(ts.nodes_flags[recomb_nodes]==msprime.NODE_IS_RE_EVENT)}")

@hyanwong
Copy link
Member

hyanwong commented Jun 20, 2023

Yes, although the recombination nodes created by msprime.sim_ancestry(... record_full_arg=True) are not the ones with multiple parents, but (confusingly, and for technical reasons) the nodes above the node with multiple parents. That was the point of my digression above. It should probably work in SLiM though, assuming that SLiM doesn't use the 2-RE-node encoding used by msprime.

@kitchensjn
Copy link
Author

Just to clarify, so please correct me if I've misunderstood:

My code should work for tree sequences with the 2-RE-node encoding (msprime) as it returns the parents of the nodes with multiple parents. It is equivalent to np.where(ts.nodes_flags==msprime.NODE_IS_RE_EVENT)[0] for a tree sequence generated using msprime.sim_ancestry(... record_full_arg=True).

The final two lines of my code would not be needed for a tree sequence that uses a 1-RE-node encoding (SLiM, most likely). For those tree sequences, the array multiple_parents would contain all of the recombination nodes.

@jeromekelleher
Copy link
Member

jeromekelleher commented Jun 21, 2023

There may not be a good reason for doing things this way in msprime with the two re nodes, now that we can keep unary nodes more flexibly.

@GertjanBisschop can you comment? It would be good to make a decision here regarding how we record recombs before we release the new additional nodes API

(This is an msprime issue though - can someone open an issue on msprime to discuss potentially changing how we record re nodes for the new additional nodes API please?)

@hyanwong
Copy link
Member

Just to clarify, so please correct me if I've misunderstood:

No you are right - I didn't read your code fully, sorry!

@GertjanBisschop
Copy link
Member

GertjanBisschop commented Jun 21, 2023

(This is an msprime issue though - can someone open an issue on msprime to discuss potentially changing how we record re nodes for the new additional nodes API please?)

I opened an issue. As @hyanwong already mentioned, the 2-nodes vs 1-node recombination event encoding has been discussed before. Not entirely sure yet how the more flexible node recording would help resolve why we stuck with the 2-node encoding.

@hyanwong
Copy link
Member

Just to note in passing that a large number of the ARG nodes that are not in a tree sequence are not recombination nodes, but common-ancestor-non-coalescent nodes. You would probably want to keep these too.

I think a flexible thing would be to be able to pass a bit array of flags to simplify, and keep any nodes with those flags set. We could have a separate routine for flagging up nodes in the correct way, which would be different for a 1-RE node vs 2-RE node TS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants