Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graph tool #83

Merged
merged 62 commits into from
Jul 16, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
7260e50
First addition of graph-tools code
nickjcroucher May 7, 2020
09ebf6d
Include listDistInts routine in utils.py
nickjcroucher May 7, 2020
9eb7d84
Functioning model refinement with graph-tools
nickjcroucher May 7, 2020
09aacaf
Update extraction of references from network
nickjcroucher May 8, 2020
e2476af
More efficient extraction of references
nickjcroucher May 8, 2020
027b36d
Remove redundant imports and fix output
nickjcroucher May 8, 2020
5623a67
Switch from multiprocessing to open MP parallelisation using graph-tools
nickjcroucher May 8, 2020
88714dd
Fix network loading message
nickjcroucher May 9, 2020
145eee6
Graph loading function updated
nickjcroucher May 9, 2020
e75541f
Update visualisation code
nickjcroucher May 9, 2020
0109e0b
Refactor lineage_clustering code to use graph-tool
nickjcroucher May 12, 2020
188ed40
Update docstrings
nickjcroucher May 12, 2020
a75cb4f
Enable visualisation of lineage networks using Cytoscape
nickjcroucher May 12, 2020
2fc779b
Add extra network printing features
nickjcroucher May 12, 2020
fc3a69a
Change to network-based definitions of lineages
nickjcroucher May 12, 2020
dc042b9
Enable visualisation of networks post-processing
nickjcroucher May 13, 2020
4a18c1e
Enable querying and pruning of networks
nickjcroucher May 13, 2020
a263367
Fix output of names and labels
nickjcroucher May 14, 2020
9eecdaa
Remove debugging message
nickjcroucher May 14, 2020
09e6564
Add new dependency of lineage clustering on ref-db to tests
nickjcroucher May 14, 2020
09db4dc
Add graph-tool to dependencies
nickjcroucher May 14, 2020
9520b91
Overwrite for local running of tests
nickjcroucher May 14, 2020
2235736
Change references to query sequences in network extension
nickjcroucher May 14, 2020
1df2387
Use hash for query sequence name retrieval
nickjcroucher May 14, 2020
a38e58d
Use list for query sequence retrieval
nickjcroucher May 14, 2020
92956f0
Correct maths of listDistInts
nickjcroucher May 14, 2020
14a465f
Merge branch 'sketchlib140' into graph-tool
nickjcroucher Jul 3, 2020
a6037c3
Remove legacy mash test
nickjcroucher Jul 3, 2020
97dffbd
Merge branch 'sketchlib140' into graph-tool
johnlees Jul 3, 2020
a661bd3
Merge remote-tracking branch 'origin/master' into graph-tool
johnlees Jul 3, 2020
957434c
Adjusting test file
nickjcroucher Jul 4, 2020
1a36336
Fix test file
nickjcroucher Jul 4, 2020
22f20d9
Change minimum k step
nickjcroucher Jul 13, 2020
e00707a
Restore generate-viz mode test
nickjcroucher Jul 15, 2020
f6b86f6
Specified graph-tool package as a dependency in documentation
nickjcroucher Jul 15, 2020
04df0df
Removed outdated parts from troubleshooting document
nickjcroucher Jul 15, 2020
651586f
Update docstrings for graph-tool
nickjcroucher Jul 15, 2020
1d5766b
Update PopPUNK/__main__.py
nickjcroucher Jul 15, 2020
ad04c72
Update PopPUNK/__main__.py
nickjcroucher Jul 15, 2020
82e245a
Update PopPUNK/__main__.py
nickjcroucher Jul 15, 2020
46cc362
Remove debug file printing
nickjcroucher Jul 15, 2020
c6167e8
Update PopPUNK/mash.py
nickjcroucher Jul 15, 2020
3fb0006
Whitespace removed
nickjcroucher Jul 15, 2020
8e29c1e
Update PopPUNK/lineage_clustering.py
nickjcroucher Jul 15, 2020
c040518
Update PopPUNK/lineage_clustering.py
nickjcroucher Jul 15, 2020
0f8d1f9
Change default lineage cluster
nickjcroucher Jul 15, 2020
3c495c0
Merge branch 'graph-tool' of https://github.com/johnlees/PopPUNK into…
nickjcroucher Jul 15, 2020
be09381
Update PopPUNK/mash.py
nickjcroucher Jul 15, 2020
169fe66
Tidying up network construction
nickjcroucher Jul 15, 2020
cdacc00
Merge branch 'graph-tool' of https://github.com/johnlees/PopPUNK into…
nickjcroucher Jul 15, 2020
0ed9bcc
Assign local variable more clearly
nickjcroucher Jul 15, 2020
ed2820f
Improve error message for isolates missing from network
nickjcroucher Jul 15, 2020
abf0bc3
Tidying of excess code
nickjcroucher Jul 15, 2020
06a8648
Update PopPUNK/__main__.py
nickjcroucher Jul 15, 2020
39dda3d
Update PopPUNK/__main__.py
nickjcroucher Jul 15, 2020
a5a58a8
Replace mashOrder with dbOrder
nickjcroucher Jul 15, 2020
0a299b7
Reinstate model.save()
nickjcroucher Jul 16, 2020
4ff2a49
Update network component extraction code
nickjcroucher Jul 16, 2020
0ac1fd4
Expand comment to explain network resuse
nickjcroucher Jul 16, 2020
8e8fefa
Expanded explanation in comments
nickjcroucher Jul 16, 2020
ecbdbe9
Convert to listDistInts to generator function
nickjcroucher Jul 16, 2020
17a3f0e
Remove redundant function
nickjcroucher Jul 16, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
230 changes: 130 additions & 100 deletions PopPUNK/__main__.py

Large diffs are not rendered by default.

126 changes: 88 additions & 38 deletions PopPUNK/lineage_clustering.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
from collections import defaultdict
import pickle
import collections
import networkx as nx
import graph_tool.all as gt
from multiprocessing import Pool, RawArray, shared_memory, managers
try:
from multiprocessing import Pool, shared_memory
Expand Down Expand Up @@ -91,17 +91,16 @@ def get_nearest_neighbours(rank, isolates = None, ranks = None):
frozen set of nearest neighbours.
"""
# data structure
nn = {}
nn = set()
# load shared ranks
ranks_shm = shared_memory.SharedMemory(name = ranks.name)
ranks = np.ndarray(ranks.shape, dtype = ranks.dtype, buffer = ranks_shm.buf)
# apply along axis
for i in isolates:
nn[i] = defaultdict(frozenset)
isolate_ranks = ranks[i,:]
closest_ranked = np.ravel(np.where(isolate_ranks <= rank))
neighbours = frozenset(closest_ranked.tolist())
nn[i] = neighbours
for j in closest_ranked.tolist():
nn.add((i,j))
# return dict
return nn

Expand Down Expand Up @@ -213,12 +212,17 @@ def cluster_into_lineages(distMat, rank_list = None, output = None,
"""

# data structures
lineage_clustering = defaultdict(dict)
lineage_assignation = defaultdict(dict)
overall_lineage_seeds = defaultdict(dict)
overall_lineages = defaultdict(dict)
max_existing_cluster = {rank:1 for rank in rank_list}

# load existing scheme if supplied
if existing_scheme is not None:
with open(existing_scheme, 'rb') as pickle_file:
lineage_clustering, overall_lineage_seeds, rank_list = pickle.load(pickle_file)
lineage_assignation, overall_lineage_seeds, rank_list = pickle.load(pickle_file)
for rank in rank_list:
max_existing_cluster[rank] = max(lineage_assignation[rank].values()) + 1

# generate square distance matrix
seqLabels, coreMat, accMat = \
Expand Down Expand Up @@ -263,37 +267,87 @@ def cluster_into_lineages(distMat, rank_list = None, output = None,
distance_ranks_shared_array = np.ndarray(distance_ranks.shape, dtype = distance_ranks.dtype, buffer = distance_ranks_raw.buf)
distance_ranks_shared_array[:] = distance_ranks[:]
distance_ranks_shared_array = NumpyShared(name = distance_ranks_raw.name, shape = distance_ranks.shape, dtype = distance_ranks.dtype)


# build a graph framework for network outputs
# create graph structure with internal vertex property map
# storing lineage assignation cannot load boost.python within spawned
# processes so have to run network analysis separately
G = gt.Graph(directed = False)
G.add_vertex(len(isolate_list))
# add sequence labels for visualisation
vid = G.new_vertex_property('string',
vals = isolate_list)
G.vp.id = vid

# parallelise neighbour identification for each rank
with Pool(processes = num_processes) as pool:
results = pool.map(partial(run_clustering_for_rank,
distances_input = distances_shared_array,
distance_ranks_input = distance_ranks_shared_array,
isolates = isolate_list_shared,
previous_seeds = overall_lineage_seeds),
results = pool.map(partial(get_nearest_neighbours,
ranks = distance_ranks_shared_array,
isolates = isolate_list_shared),
rank_list)

# extract results from multiprocessing pool
# extract results from multiprocessing pool and save output network
nn = defaultdict(dict)

for n,result in enumerate(results):
# get results per rank
rank = rank_list[n]
lineage_clustering[rank], overall_lineage_seeds[rank] = result
# get neigbours
edges_to_add = result
# store results in network
G.add_edge_list(edges_to_add)
# calculate connectivity of each vertex
vertex_out_degrees = G.get_out_degrees(G.get_vertices())
# identify components and rank by frequency
components, component_frequencies = gt.label_components(G)
component_frequency_ranks = (len(component_frequencies) - rankdata(component_frequencies, method = 'ordinal').astype(int)).tolist()
# construct a name translation table
# begin with previously defined clusters
component_name = [None] * len(component_frequencies)
for seed in overall_lineage_seeds[rank]:
isolate_index = isolate_list.index(seed)
component_number = components[isolate_index]
if component_name[component_number] is None or component_name[component_number] > overall_lineage_seeds[rank][seed]:
component_name[component_number] = overall_lineage_seeds[rank][seed]
# name remaining components in rank order
for component_rank in range(len(component_frequency_ranks)):
#
component_number = component_frequency_ranks.index(component_rank)
if component_name[component_number] is None:
component_name[component_number] = max_existing_cluster[rank]
# find seed isolate
component_max_degree = np.amax(vertex_out_degrees[np.where(components.a == component_number)])
seed_isolate_index = int(np.where((components.a == component_number) & (vertex_out_degrees == component_max_degree))[0][0])
seed_isolate = isolate_list[seed_isolate_index]
overall_lineage_seeds[rank][seed_isolate] = max_existing_cluster[rank]
# increment
max_existing_cluster[rank] = max_existing_cluster[rank] + 1
# store assignments
for isolate_index,isolate_name in enumerate(isolate_list):
original_component = components.a[isolate_index]
renamed_component = component_name[original_component]
lineage_assignation[rank][isolate_name] = renamed_component
# save network
G.save(file_name = output + "/" + os.path.basename(output) + '_rank_' + str(rank) + '_lineages.gt', fmt = 'gt')
# clear edges - nodes in graph can be reused but edges differ between ranks
G.clear_edges()
nickjcroucher marked this conversation as resolved.
Show resolved Hide resolved

# store output
with open(output + "/" + output + '_lineages.pkl', 'wb') as pickle_file:
pickle.dump([lineage_clustering, overall_lineage_seeds, rank_list], pickle_file)

pickle.dump([lineage_assignation, overall_lineage_seeds, rank_list], pickle_file)
# process multirank lineages
overall_lineages = {}
overall_lineages = {'Rank_' + str(rank):{} for rank in rank_list}
overall_lineages['overall'] = {}
for index,isolate in enumerate(isolate_list):
overall_lineage = None
for rank in rank_list:
overall_lineages['Rank_' + str(rank)][isolate] = lineage_clustering[rank][index]
overall_lineages['Rank_' + str(rank)][isolate] = lineage_assignation[rank][isolate]
if overall_lineage is None:
overall_lineage = str(lineage_clustering[rank][index])
overall_lineage = str(lineage_assignation[rank][isolate])
else:
overall_lineage = overall_lineage + '-' + str(lineage_clustering[rank][index])
overall_lineage = overall_lineage + '-' + str(lineage_assignation[rank][isolate])
overall_lineages['overall'][isolate] = overall_lineage

# print output as CSV
Expand Down Expand Up @@ -326,13 +380,13 @@ def run_clustering_for_rank(rank, distances_input = None, distance_ranks_input =
Whether to extend a previously generated analysis or not.

Returns:
lineage_clustering (dict)
lineage_assignation (dict)
Assignment of each isolate to a cluster.
lineage_seed (dict)
Seed isolate used to initiate each cluster.
neighbours (nested dict)
Neighbour relationships between isolates for R.
"""
connections (set of tuples)
Edges to add to network describing lineages.
"""

# load shared memory objects
distances_shm = shared_memory.SharedMemory(name = distances_input.name)
Expand All @@ -347,31 +401,27 @@ def run_clustering_for_rank(rank, distances_input = None, distance_ranks_input =
if previous_seeds is not None:
seeds = previous_seeds[rank]

# create graph structure
G = nx.Graph()
G.add_nodes_from(isolate_indices)
G.nodes.data('lineage', default = 0)

# identify nearest neighbours
nn = get_nearest_neighbours(rank,
ranks = distance_ranks_input,
isolates = isolate_list)

# iteratively identify lineages
lineage_index = 1
while nx.number_of_isolates(G) > 0:
connections = set()
lineage_assignation = {isolate:None for isolate in isolate_list}

while None in lineage_assignation.values():
if lineage_index in seeds.keys():
seed_isolate = seeds[lineage_index]
else:
seed_isolate = pick_seed_isolate(G, distances = distances_input)
seed_isolate = pick_seed_isolate(lineage_assignation, distances = distances_input)
# skip over previously-defined seeds if amalgamated into different lineage now
if nx.is_isolate(G, seed_isolate):
if lineage_assignation[seed_isolate] is None:
seeds[lineage_index] = seed_isolate
G = get_lineage(G, nn, seed_isolate, lineage_index)
lineage_assignation, added_connections = get_lineage(lineage_assignation, nn, seed_isolate, lineage_index)
connections.update(added_connections)
lineage_index = lineage_index + 1

# identify components and name lineages
lineage_clustering = {node:nodedata for (node, nodedata) in G.nodes(data='lineage')}


# return clustering
return lineage_clustering, seeds
return lineage_assignation, seeds, nn, connections
15 changes: 9 additions & 6 deletions PopPUNK/mash.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@
from glob import glob
from random import sample
import numpy as np
import networkx as nx
from scipy import optimize
try:
from multiprocessing import Pool, shared_memory
Expand Down Expand Up @@ -542,10 +541,10 @@ def queryDatabase(rNames, qNames, dbPrefix, queryPrefix, klist, self = True, num
# Check mash output is consistent with expected order
# This is ok in all tests, but best to check and exit in case something changes between mash versions
expected_names = iterDistRows(refList, qNames, self)

prev_ref = ""
skip = 0
skipped = 0

for line in mashOut:
# Skip the first row with self and symmetric elements
if skipped < skip:
Expand Down Expand Up @@ -602,17 +601,20 @@ def queryDatabase(rNames, qNames, dbPrefix, queryPrefix, klist, self = True, num

# run pairwise analyses across kmer lengths, mutating distMat
# Create range of rows that each thread will work with
# if there is only one pair, apply_along_axis will not work
if threads > number_pairs:
threads = number_pairs
rows_per_thread = int(number_pairs / threads)
big_threads = number_pairs % threads
start = 0
mat_chunks = []

for thread in range(threads):
end = start + rows_per_thread
if thread < big_threads:
end += 1
mat_chunks.append((start, end))
start = end

# create empty distMat that can be shared with multiple processes
distMat = np.zeros((number_pairs, 2), dtype=raw.dtype)
with SharedMemoryManager() as smm:
Expand All @@ -624,7 +626,6 @@ def queryDatabase(rNames, qNames, dbPrefix, queryPrefix, klist, self = True, num

shm_distMat = smm.SharedMemory(size = distMat.nbytes)
distMat_shared = NumpyShared(name = shm_distMat.name, shape = (number_pairs, 2), dtype = raw.dtype)

# Run regressions
with Pool(processes = threads) as pool:
pool.map(partial(fitKmerBlock,
Expand Down Expand Up @@ -668,7 +669,10 @@ def fitKmerBlock(idxRanges, distMat, raw, klist, jacobian):

# analyse
(start, end) = idxRanges
distMat[start:end, :] = np.apply_along_axis(fitKmerCurve, 1, raw[start:end, :], klist, jacobian)
if raw.shape[0] == 1:
distMat[start:end, :] = fitKmerCurve(raw[0,:], klist, jacobian)
else:
distMat[start:end, :] = np.apply_along_axis(fitKmerCurve, 1, raw[start:end, :], klist, jacobian)


def fitKmerCurve(pairwise, klist, jacobian):
Expand Down Expand Up @@ -707,4 +711,3 @@ def fitKmerCurve(pairwise, klist, jacobian):

# Return core, accessory
return(np.flipud(transformed_params))

Loading