-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Returning the full ARG with ts.simplify() #2765
Comments
This isn't very easy right now all right - @hyanwong I think we have some hacks for this in sc2ts? Something like, identify the recombinants as the nodes that have greater than 1 parent, and then mark feed them in as samples or something? |
Yes, I think it would have to be a hack. You could either mark them as samples (and then unmark them afterwards, or maybe even use The advantage to the "mark samples" method is that it is reasonably clean. The advantage to the I'm not aware of any hacks in sc2ts that keep these in (apart from here, where we don't reset the sample flags). Mostly we just use Here's some (semi-untested) code for the two methods. The new "filter_nodes=False" option to simplify is very handy to check that the plots are sensible. import msprime
import numpy as np
arg = msprime.sim_ancestry(2, sequence_length=1e3, recombination_rate=0.001, record_full_arg=True)
re_nodes = np.where(arg.nodes_flags & msprime.NODE_IS_RE_EVENT)[0]
style = "".join(f".n{u} > .sym {{fill: red}}" for u in re_nodes)
arg.draw_svg(style=style) # "sample" method
s1_arg = arg.simplify(np.concatenate((arg.samples(), re_nodes)), update_sample_flags=False, filter_nodes=False)
s1_arg.draw_svg(style=style) # "individual" method
tables = arg.dump_tables()
individual_arr = tables.nodes.individual
for u in re_nodes:
individual_arr[u] = tables.individuals.add_row()
tables.nodes.individual = individual_arr
tables.simplify(keep_unary_in_individuals=True, filter_nodes=False)
tables.nodes.individual = arg.nodes_individual # set the individuals back to the original
s2_arg = tables.tree_sequence()
s2_arg.draw_svg(style=style) |
A |
Thanks, @hyanwong! We were doing something similar to the "samples" method, except SLiM doesn't automatically flag the recombination nodes, so we identified nodes with multiple parents, which raised two questions
`def ts_to_ARG(ts):
|
There isn't. But it should be possible from the edge table arrays, more-or-less in a single pass right? You could do an
I guess you want the node itself |
Ah, I was wrong. It depends how you represent the recombination event. In msprime, we create 2 nodes per recombination event (because it helps us to calculate the likelihood under the Hudson coalescent, see tskit-dev/msprime#1942). Quoting from there:
In this case you might want to keep the parents. In the SLiM case, my guess is that you have one recombination node per event, so you want the children. This is all a bit messy! |
@kitchensjn : I think this finds the nodes with multiple parents, doesn't it? Could you check my logic, and if it's correct, I can add it as a Q&A to the discussions forum. uniq_child_parent = np.unique(np.column_stack((ts.edges_child, ts.edges_parent)), axis=0)
nd, count = np.unique(uniq_child_parent[:, 0], return_counts=True)
multiple_parents = nd[count > 1]
print(f"Nodes with multiple parents are {multiple_parents}") |
Yup, this will return all of the nodes with more than one parent. Then checking that it identifies the parents with recombination flags should be something like:
|
Yes, although the recombination nodes created by |
Just to clarify, so please correct me if I've misunderstood: My code should work for tree sequences with the 2-RE-node encoding (msprime) as it returns the parents of the nodes with multiple parents. It is equivalent to The final two lines of my code would not be needed for a tree sequence that uses a 1-RE-node encoding (SLiM, most likely). For those tree sequences, the array |
There may not be a good reason for doing things this way in msprime with the two re nodes, now that we can keep unary nodes more flexibly. @GertjanBisschop can you comment? It would be good to make a decision here regarding how we record recombs before we release the new additional nodes API (This is an msprime issue though - can someone open an issue on msprime to discuss potentially changing how we record re nodes for the new additional nodes API please?) |
No you are right - I didn't read your code fully, sorry! |
I opened an issue. As @hyanwong already mentioned, the 2-nodes vs 1-node recombination event encoding has been discussed before. Not entirely sure yet how the more flexible node recording would help resolve why we stuck with the 2-node encoding. |
Just to note in passing that a large number of the ARG nodes that are not in a tree sequence are not recombination nodes, but common-ancestor-non-coalescent nodes. You would probably want to keep these too. I think a flexible thing would be to be able to pass a bit array of flags to |
Is there a method when simplifying a tree sequence to remove all unary nodes except the recombination nodes (a middle ground between ts.simplify(keep_unary=False) versus ts.simplify(keep_unary=True))? We are working with the tree sequence output from a SLiM simulation with
initializeTreeSeq(retainCoalescentOnly=F)
, which contains lots of unary nodes, and we want to simplify it down to just the nodes that affect the ARG structure, the full ARG. As the output tree sequence from SLiM does not have marked recombination nodes, these would first need to be identified before simplifying. Copying @pderaje as he is working on this with me.(See MesserLab/SLiM#376 for the initial post before determining it was better suited here.)
The text was updated successfully, but these errors were encountered: