Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reference database followed by assigning - inconsistent clusters #50

Closed
johnlees opened this issue Aug 21, 2019 · 8 comments
Closed

Reference database followed by assigning - inconsistent clusters #50

johnlees opened this issue Aug 21, 2019 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@johnlees
Copy link
Member

Running with --use-model, followed by --assign-query gave the following errors:

Network loaded: 344 samples
WARNING: Old cluster 9 split across multiple new clusters
WARNING: Old cluster 28 split across multiple new clusters

etc etc

Looking at cluster 9, the definition in the pruned network and the original cluster file are not consistent:

set(G.nodes()).intersection(oldClusters['9'])
{'assemblies_r4/5386_1_12.fa', 'assemblies_r4/9262_5_53.fa', 'assemblies_r4/19944_7_53.fa', 'assemblies_r4/SP2LAU.fa', 'assemblies_r4/19944_7_117.fa', 'assemblies_r4/19944_6_170.fa'}

list(nx.connected_components(G))
[{'assemblies_r4/11791_7_71.fa'}, {'assemblies_r4/11791_7_74.fa'}, {'assemblies_r4/11791_7_78.fa'}, {'assemblies_r4/11791_7_79.fa'}, {'assemblies_r4/8898_3_73.fa', 'assemblies_r4/11791_7_81.fa'}, {'assemblies_r4/11791_7_83.fa'}, {'assemblies_r4/11791_7_86.fa'}, {'assemblies_r4/11791_7_89.fa'}, {'assemblies_r4/11791_7_90.fa'}, {'assemblies_r4/11791_7_93.fa'}, {'assemblies_r4/11791_7_94.fa'}, {'assemblies_r4/11791_7_95.fa'}, {'assemblies_r4/11822_8_16.fa'}, {'assemblies_r4/11822_8_25.fa'}, {'assemblies_r4/11822_8_27.fa'}, {'assemblies_r4/11822_8_4.fa'}, {'assemblies_r4/11822_8_9.fa'}, {'assemblies_r4/5731_7_7.fa', 'assemblies_r4/11826_4_63.fa'}, {'assemblies_r4/11826_4_72.fa'}, {'assemblies_r4/11826_4_73.fa'}, {'assemblies_r4/11826_4_85.fa'}, {'assemblies_r4/11826_4_87.fa'}, {'assemblies_r4/19944_6_139.fa'}, {'assemblies_r4/19944_6_148.fa'}, {'assemblies_r4/19944_6_150.fa'}, {'assemblies_r4/19944_6_161.fa'}, {'assemblies_r4/19944_6_162.fa'}, {'assemblies_r4/19944_6_164.fa'}, {'assemblies_r4/19944_6_169.fa'}, {'assemblies_r4/5386_1_12.fa', 'assemblies_r4/19944_6_170.fa', 'assemblies_r4/SP2LAU.fa'}, {'assemblies_r4/19944_6_174.fa'}, {'assemblies_r4/19944_7_116.fa'}, {'assemblies_r4/19944_7_117.fa', 'assemblies_r4/19944_7_53.fa'}, {'assemblies_r4/19944_7_11.fa'}, {'assemblies_r4/19944_7_122.fa'}, {'assemblies_r4/ERR1733243.fa', 'assemblies_r4/19944_7_123.fa'}, {'assemblies_r4/19944_7_130.fa'}, {'assemblies_r4/SRS2376552.fa', 'assemblies_r4/SRS2372224.fa', 'assemblies_r4/19944_7_134.fa', 'assemblies_r4/SRS2376192.fa'}, {'assemblies_r4/19944_7_15.fa'}, {'assemblies_r4/19944_7_22.fa'}, {'assemblies_r4/19944_7_23.fa'}, {'assemblies_r4/SRS2372700.fa', 'assemblies_r4/19944_7_24.fa'}, {'assemblies_r4/19944_7_27.fa'}, {'assemblies_r4/19944_7_28.fa'}, ...

I thought this was fixed in #28 / #29, but I seem to still get this kind of error with the current code. Using --full-db everything is fine.

Look into:

  • The network pruning step - is this working ok?
  • What do the clusters look like generated pre- and post- this step?
@johnlees johnlees added the bug Something isn't working label Aug 21, 2019
@johnlees johnlees self-assigned this Aug 21, 2019
@johnlees
Copy link
Member Author

Strangely if I use poppunk_references to do the pruning, I get the errors, but for different cluster numbers:

WARNING: Old cluster 11 split across multiple new clusters
WARNING: Old cluster 56 split across multiple new clusters
WARNING: Old cluster 78 split across multiple new clusters
WARNING: Old cluster 72 split across multiple new clusters
WARNING: Old cluster 48 split across multiple new clusters
WARNING: Old cluster 21 split across multiple new clusters
WARNING: Old cluster 35 split across multiple new clusters
WARNING: Old cluster 73 split across multiple new clusters
WARNING: Old cluster 5 split across multiple new clusters
WARNING: Old cluster 8 split across multiple new clusters
WARNING: Old cluster 8 split across multiple new clusters
WARNING: Old cluster 49 split across multiple new clusters
WARNING: Old cluster 64 split across multiple new clusters

@johnlees
Copy link
Member Author

(This is in GAS: https://www.nature.com/articles/s41588-019-0417-8)

@johnlees
Copy link
Member Author

Actual assigned clusters are identical however. Is the warning message wrong?

@johnlees
Copy link
Member Author

I think that it is possible, once cliques have been pruned, that connected components in the old graph to no longer be connected due to the edge removal. I will check with this example. A sensible fix would be to add in the minimal number of edges to reconnect them (just need to check this wouldn't interfere with clique selection if using --update-db)

@nickjcroucher
Copy link
Collaborator

nickjcroucher commented Aug 23, 2019 via email

@johnlees
Copy link
Member Author

Here is the offending cluster before pruning:
component_9

Seems likely to be split in this case because of that high stress edge.

We may not actually need to reconnect them, another approach would be to use another check that the clusters and network are consistent.
Or a processes which adds nodes and edges back in after pruning each time to ensure clusters remain connected.
I'll have a look at which might be preferable, but my feeling is keeping the information present in the full network would be preferable

@johnlees
Copy link
Member Author

Ok, I have a fix for this I am about to commit. By changing the reference pick function we don't have to worry about further updates or adding false cliques:

  • After picking the references, find any clusters that are represented for by more than one ref.
  • For each of these clusters, check (over all pairs of references in the cluster) whether they are in the same component in the pruned graph
  • If they are not, find the shortest path between them and add the nodes on that path to the reference list
  • After doing this for all clusters with >1 ref, re-prune the original graph using the now extended reference list

In the GAS example, this went from 340 to 360 references. Checked and --assign-query no longer gives the warning messages.

Cluster 9 (shown above), rather than being represented by three references in two components now looks like this:
cluster_9_fixref

@johnlees
Copy link
Member Author

(the top 6 nodes contain cliques, but if 19944_8_129 isn't picked as the representative it will split up into two components which I think is what happened before. There might be a way of doing this which is cleverer about clique picks and can prune further than this, but I think this is a reasonable solution that doesn't increase size too much more)

ccoulombe pushed a commit to ccoulombe/PopPUNK that referenced this issue Oct 28, 2022
Remove the boundary code from the package
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants