Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add calculation of minimum spanning trees #141

Merged
merged 75 commits into from
Jan 30, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
62fbbc0
Simple function for converting an MST to newick format
nickjcroucher Jan 6, 2021
2a91dbd
Generates MST readable by Grapetree
nickjcroucher Jan 6, 2021
c7beb6f
Restructure generation of phylogenies
nickjcroucher Jan 7, 2021
22cbf80
Enable construction of network from sparse distance matrix
nickjcroucher Jan 7, 2021
c075a2d
Add weights to networks from sparse matrices
nickjcroucher Jan 7, 2021
62e8d0c
Construct network from sparse matrix
nickjcroucher Jan 7, 2021
41cd019
Prevent undefined variable being returned in ListDistInts
nickjcroucher Jan 8, 2021
9dc6700
Separate dense and sparse network processing
nickjcroucher Jan 8, 2021
5f9fb86
Edit progress messages
nickjcroucher Jan 8, 2021
58f17cd
Add extra progress messages
nickjcroucher Jan 8, 2021
fa5a2d1
Combine sparse and full distance matrices
nickjcroucher Jan 8, 2021
b4e9b36
Reformat MST output
nickjcroucher Jan 8, 2021
d0d920c
Add debug message
nickjcroucher Jan 10, 2021
0f2e0a5
Update debug message
nickjcroucher Jan 10, 2021
18d7eea
Save intermediate MST graph
nickjcroucher Jan 10, 2021
5867f07
Temporarily remove intermediate MST graph
nickjcroucher Jan 11, 2021
3e48fc8
Clean up debug messages before merge with master
nickjcroucher Jan 11, 2021
5a3fb7a
Resolve conflicts with master
nickjcroucher Jan 11, 2021
e8d8765
Remove conflict message
nickjcroucher Jan 11, 2021
046b067
Make function calls consistent
nickjcroucher Jan 11, 2021
fd86339
Fix whitespace in network.py
nickjcroucher Jan 11, 2021
58cbaee
Fix whitespace in network.py
nickjcroucher Jan 11, 2021
7ed6cef
Extend permitted range of ranks
nickjcroucher Jan 11, 2021
ddcd4b7
Replace floor with int
nickjcroucher Jan 11, 2021
9bc521f
Add explanation of phylogency construction code
nickjcroucher Jan 11, 2021
ed21378
Edit whitespace in network.py
nickjcroucher Jan 11, 2021
fb477ca
Command reformat in network.py
nickjcroucher Jan 11, 2021
ecdc251
Edit whitespace in network.py
nickjcroucher Jan 11, 2021
48bbcf8
Remove unused code from plot.py
nickjcroucher Jan 11, 2021
42b66b3
Edit whitespace in network.py
nickjcroucher Jan 11, 2021
682acc4
Reformat command in plot.py
nickjcroucher Jan 11, 2021
7f9a1cf
Remove unused variables from plot.py
nickjcroucher Jan 11, 2021
caa5af8
Reformat command in plot.py
nickjcroucher Jan 11, 2021
2ee963e
Change name of check_tree_exists function
nickjcroucher Jan 11, 2021
55faaed
Merge branch local and online changes to mst branch
nickjcroucher Jan 11, 2021
4e63aca
Fix write_tree docstring
nickjcroucher Jan 11, 2021
ae9548e
Add error message to write_tree
nickjcroucher Jan 11, 2021
76e27bd
Remove unused MST code
nickjcroucher Jan 11, 2021
7e5aa73
Change variable name in network.py
nickjcroucher Jan 11, 2021
1b3b5bc
Validate network distance type
nickjcroucher Jan 12, 2021
8fec15c
Propagate tree loading function name change
nickjcroucher Jan 15, 2021
a5f931a
Change tree output options
nickjcroucher Jan 15, 2021
8e50745
Separate phylogenetic methods into trees.py
nickjcroucher Jan 15, 2021
ce72382
Change file path for loading tree
nickjcroucher Jan 15, 2021
25db4e1
Remove unnecessary dendropy imports
nickjcroucher Jan 15, 2021
24e5791
Allow core or accessory distances to be used for MST
nickjcroucher Jan 15, 2021
9a82fe0
Fix typo in argument parsing
nickjcroucher Jan 16, 2021
465c466
Pin dendropy version
johnlees Jan 27, 2021
fa85b60
Remove MST with unweighted graph
johnlees Jan 27, 2021
71fda35
Remove sparse matrix mst
johnlees Jan 27, 2021
9ae1519
Unneeded arg
johnlees Jan 27, 2021
a584c7c
Add script to make MSTs from sparse matrix input
johnlees Jan 27, 2021
686267d
Note from issue
johnlees Jan 28, 2021
41d13a8
Code to make MST plots
johnlees Jan 28, 2021
6a19d57
Add drawMST docstring
johnlees Jan 28, 2021
53c5eba
Add cluster definitions to sparse_mst
johnlees Jan 28, 2021
be21d6b
Remove summary from output
johnlees Jan 28, 2021
c149dbf
Read sketchlib version from module
johnlees Jan 29, 2021
b1fc5f7
Make stress plot point size smaller
johnlees Jan 29, 2021
a92b184
Fix cugraph initialisation
johnlees Jan 29, 2021
9fa2084
Correct adjlist function call
johnlees Jan 29, 2021
b42a765
Use cudf explicitly
johnlees Jan 29, 2021
77c2218
Correct weight col name
johnlees Jan 29, 2021
89fcae7
Generate tuple list from df
johnlees Jan 29, 2021
791eaf1
Correct booleans
johnlees Jan 29, 2021
7733e1f
Copy values to host
johnlees Jan 29, 2021
39ae907
Correct graph name passed forwards
johnlees Jan 29, 2021
b62d2aa
Run MST modes in tests
johnlees Jan 29, 2021
a49beaa
Fix MST component joining
johnlees Jan 29, 2021
6185cb2
Save MST as graphml (in sparse_mst)
johnlees Jan 29, 2021
482f8cc
Begin MST docs
johnlees Jan 29, 2021
f1cc5c2
Avoid graph tools import in visualise
johnlees Jan 30, 2021
d8d1cc0
Update docs
johnlees Jan 30, 2021
03c1a18
Update visualisation usage
johnlees Jan 30, 2021
96038da
Add MST to autodoc
johnlees Jan 30, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 8 additions & 4 deletions PopPUNK/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,9 @@
# import poppunk package
from .__init__ import __version__

# globals
accepted_weights_types = ["core", "accessory", "euclidean"]

#******************************#
#* *#
#* Command line parsing *#
Expand Down Expand Up @@ -238,14 +241,15 @@ def main():
# check if working with lineages
if args.fit_model == 'lineage':
rank_list = sorted([int(x) for x in args.ranks.split(',')])
if min(rank_list) == 0 or max(rank_list) > 100:
sys.stderr.write('Ranks should be small non-zero integers for sensible results\n')
sys.exit(1)
if int(min(rank_list)) == 0:
sys.stderr.write("Ranks must be >= 1\n")
if max(rank_list) > 100:
sys.stderr.write("WARNING: Ranks should be small non-zero integers for sensible lineage results\n")

# run according to mode
sys.stderr.write("PopPUNK (POPulation Partitioning Using Nucleotide Kmers)\n")
sys.stderr.write("\t(with backend: " + dbFuncs['backend'] + " v" + dbFuncs['backend_version'] + "\n")
sys.stderr.write('\t sketchlib: ' + checkSketchlibLibrary() + ')\n')
sys.stderr.write("\t sketchlib: " + checkSketchlibLibrary() + ")\n")

# Check on parallelisation of graph-tools
setGtThreads(args.threads)
Expand Down
3 changes: 2 additions & 1 deletion PopPUNK/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,12 +90,13 @@ def loadClusterFit(pkl_file, npz_file, outPrefix = "", max_samples = 100000):
sys.stderr.write("Loading previously refined model\n")
load_obj = RefineFit(outPrefix)
elif fit_type == "lineage":
sys.stderr.write("Loading previously lineage cluster model\n")
sys.stderr.write("Loading lineage cluster model\n")
load_obj = LineageFit(outPrefix, fit_object[0])
else:
raise RuntimeError("Undefined model type: " + str(fit_type))

load_obj.load(fit_data, fit_object)
sys.stderr.write("Completed model loading\n")
return load_obj

class ClusterFit:
Expand Down
109 changes: 103 additions & 6 deletions PopPUNK/network.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,9 @@
from functools import partial
from multiprocessing import Pool
import graph_tool.all as gt
import dendropy

from .__main__ import accepted_weights_types

from .sketchlib import addRandom

Expand All @@ -46,7 +49,6 @@ def fetchNetwork(network_dir, model, refList, ref_graph = False,
Names of references that should be in the network
ref_graph (bool)
Use ref only graph, if available

[default = False]
core_only (bool)
Return the network created using only core distances
Expand Down Expand Up @@ -263,7 +265,8 @@ def writeReferences(refList, outPrefix):
return refFileName

def constructNetwork(rlist, qlist, assignments, within_label,
summarise = True, edge_list = False, weights = None):
summarise = True, edge_list = False, weights = None,
weights_type = 'euclidean', sparse_input = None):
"""Construct an unweighted, undirected network without self-loops.
Nodes are samples and edges where samples are within the same cluster

Expand All @@ -281,13 +284,17 @@ def constructNetwork(rlist, qlist, assignments, within_label,
from :func:`~PopPUNK.bgmm.findWithinLabel`
summarise (bool)
Whether to calculate and print network summaries with :func:`~networkSummary`

(default = True)
edge_list (bool)
Whether input is edges, tuples of (v1, v2). Used with lineage assignment
weights (numpy.array)
If passed, the core,accessory distances for each assignment, which will
be annotated as an edge attribute
weights_type (str)
Specifies the type of weight to be annotated on the graph - options are core,
accessory or euclidean distance
sparse_input (numpy.array)
Sparse distance matrix from lineage fit

Returns:
G (graph)
Expand All @@ -303,21 +310,37 @@ def constructNetwork(rlist, qlist, assignments, within_label,
self_comparison = False
vertex_labels.append(qlist)

# Check weights type is valid
if weights_type not in accepted_weights_types:
sys.stderr.write("Unable to calculate distance type " + str(weights_type) + "; "
"accepted types are " + str(accepted_weights_types) + "\n")
sys.exit(1)
if edge_list and sparse_input:
raise RuntimeError("Cannot construct network from edge list and sparse matrix")

# identify edges
connections = []
if edge_list:
if weights is not None:
connections = []
for weight, (ref, query) in zip(weights, assignments):
connections.append((ref, query, weight))
else:
connections = assignments
elif sparse_input is not None:
for ref, query, weight in zip(sparse_input.row, sparse_input.col, sparse_input.data):
connections.append((ref, query, weight))
else:
for row_idx, (assignment, (ref, query)) in enumerate(zip(assignments,
listDistInts(rlist, qlist,
self = self_comparison))):
if assignment == within_label:
if weights is not None:
dist = np.linalg.norm(weights[row_idx, :])
if weights_type == 'euclidean':
dist = np.linalg.norm(weights[row_idx, :])
elif weights_type == 'core':
dist = weights[row_idx, 0]
elif weights_type == 'accessory':
dist = weights[row_idx, 1]
edge_tuple = (ref, query, dist)
else:
edge_tuple = (ref, query)
Expand All @@ -327,7 +350,7 @@ def constructNetwork(rlist, qlist, assignments, within_label,
G = gt.Graph(directed = False)
G.add_vertex(len(vertex_labels))

if weights is not None:
if weights is not None or sparse_input is not None:
eweight = G.new_ep("float")
G.add_edge_list(connections, eprops = [eweight])
G.edge_properties["weight"] = eweight
Expand Down Expand Up @@ -704,3 +727,77 @@ def printExternalClusters(newClusters, extClusterFile, outPrefix,
pd.DataFrame(data=d).to_csv(outPrefix + "_external_clusters.csv",
columns = ["sample"] + list(extClusters.keys()),
index = False)

def generate_minimum_spanning_tree(G, from_cugraph = False):
"""Generate a minimum spanning tree from a network

Args:
G (network)
Graph tool network
from_cugraph (bool)
If a pre-calculated MST from cugraph
[default = False]

nickjcroucher marked this conversation as resolved.
Show resolved Hide resolved
Returns:
mst_network (str)
Minimum spanning tree (as graph-tool graph)
"""
#
# Create MST
#
if not from_cugraph:
sys.stderr.write("Starting calculation of minimum-spanning tree\n")

# Test if weighted network and calculate minimum spanning tree
if "weight" in G.edge_properties:
mst_edge_prop_map = gt.min_spanning_tree(G, weights = G.ep["weight"])
mst_network = gt.GraphView(G, efilt = mst_edge_prop_map)
else:
sys.stderr.write("generate_minimum_spanning_tree requires a weighted graph\n")
raise RuntimeError("MST passed unweighted graph")
else:
mst_network = G

# Find seed nodes as those with greatest outdegree in each component
seed_vertices = set()
component_assignments, component_frequencies = gt.label_components(mst_network)
for component_index in range(len(component_frequencies)):
component_members = component_assignments.a == component_index
component = gt.GraphView(mst_network, vfilt = component_members)
component_vertices = component.get_vertices()
out_degrees = component.get_out_degrees(component_vertices)
seed_vertex = list(component_vertices[np.where(out_degrees == np.amax(out_degrees))])
seed_vertices.add(seed_vertex[0]) # Can only add one otherwise not MST

# If multiple components, calculate distances between seed nodes
if len(component_frequencies) > 1:
# Extract distances
connections = []
max_weight = float(np.max(G.edge_properties["weight"].a))
for ref in seed_vertices:
seed_edges = G.get_all_edges(ref, [G.ep['weight']])
found = False # Not all edges may be in graph
for seed_edge in seed_edges:
if seed_edge[1] in seed_vertices:
found = True
connections.append((seed_edge))
# TODO: alternative would be to requery the DB (likely quick)
if found == False:
for query in seed_vertices:
if query != ref:
connections.append((ref, query, max_weight))

# Construct graph
seed_G = gt.Graph(directed = False)
seed_G.add_vertex(len(seed_vertex))
eweight = seed_G.new_ep("float")
seed_G.add_edge_list(connections, eprops = [eweight])
seed_G.edge_properties["weight"] = eweight
seed_mst_edge_prop_map = gt.min_spanning_tree(seed_G, weights = seed_G.ep["weight"])
seed_mst_network = gt.GraphView(seed_G, efilt = seed_mst_edge_prop_map)
# Insert seed MST into original MST - may be possible to use graph_union with include=True & intersection
deep_edges = seed_mst_network.get_edges([seed_mst_network.ep["weight"]])
mst_network.add_edge_list(deep_edges)

sys.stderr.write("Completed calculation of minimum-spanning tree\n")
return mst_network
Loading