Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to MST #148

Merged
merged 338 commits into from
Mar 23, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
338 commits
Select commit Hold shift + click to select a range
c64a2c5
Change cudf list conversion
nickjcroucher Mar 10, 2021
b84a269
Update printClusters flags
nickjcroucher Mar 10, 2021
cf1d85b
Allow for saving of cugraph objects
nickjcroucher Mar 10, 2021
81b7033
Fix missing function reference
nickjcroucher Mar 10, 2021
1f708d6
Change vertex list to set for difference
nickjcroucher Mar 10, 2021
79b754f
GPU graphs for non-lineage mode
nickjcroucher Mar 10, 2021
6566544
Change node index extraction
nickjcroucher Mar 10, 2021
bec01e5
Restore missing nodes to GPU graph
nickjcroucher Mar 11, 2021
d216576
Add missing nodes
nickjcroucher Mar 11, 2021
a369c9d
Use range list in place of integer
nickjcroucher Mar 11, 2021
d51d8fd
Fix range list
nickjcroucher Mar 11, 2021
a4c3210
Remove pandas intermediate for data frame
nickjcroucher Mar 11, 2021
ca28aa5
Fix data frame name
nickjcroucher Mar 11, 2021
702c6b9
Add in isolated vertices in GPU graph
nickjcroucher Mar 11, 2021
8c14b5e
Change max to int conversion
nickjcroucher Mar 11, 2021
414efff
Change max calculation
nickjcroucher Mar 11, 2021
6938b53
Change max format
nickjcroucher Mar 11, 2021
90ebda6
Add message checking on maximum
nickjcroucher Mar 11, 2021
4915c67
Change int to float
nickjcroucher Mar 11, 2021
7a0e404
Add warning for missing nodes
nickjcroucher Mar 11, 2021
19a4248
Test DF structure
nickjcroucher Mar 11, 2021
6de7509
Change warning message print format
nickjcroucher Mar 11, 2021
bf278b0
Change cudf definition
nickjcroucher Mar 11, 2021
2d08c72
Add reference extraction for GPU graphs
nickjcroucher Mar 11, 2021
21dc84b
Change ktruss command
nickjcroucher Mar 11, 2021
f64c422
Change ktruss processing
nickjcroucher Mar 11, 2021
f88a0c2
Change components processing
nickjcroucher Mar 11, 2021
77133cb
Change components options
nickjcroucher Mar 11, 2021
9dcaefd
Format Gtruss for graph input
nickjcroucher Mar 11, 2021
5eb7e69
Try option 1 for ktruss
nickjcroucher Mar 11, 2021
b2897b6
Raise ktruss k to 5
nickjcroucher Mar 11, 2021
9b58655
Change ktruss formats
nickjcroucher Mar 11, 2021
98dba61
Print network summaries
nickjcroucher Mar 11, 2021
24713ee
Fix grammar
nickjcroucher Mar 11, 2021
65d8cb4
Test Louvain
nickjcroucher Mar 11, 2021
2be9513
Test Leiden
nickjcroucher Mar 11, 2021
cf012fb
Process Leiden output
nickjcroucher Mar 11, 2021
74bb8ee
Process Leiden both outputs
nickjcroucher Mar 11, 2021
a694176
Change grouping variable
nickjcroucher Mar 11, 2021
0c0789d
Test grouping code
nickjcroucher Mar 11, 2021
e0d6f87
Fi grouping code
nickjcroucher Mar 12, 2021
b9319dc
Change iloc selection
nickjcroucher Mar 12, 2021
5b3d183
Change selection processing
nickjcroucher Mar 12, 2021
b29d82c
Remove column select
nickjcroucher Mar 12, 2021
1b5fd31
Change list conversion
nickjcroucher Mar 12, 2021
85ac6f5
Add reference graph construction
nickjcroucher Mar 12, 2021
17a0997
Add missing bracket
nickjcroucher Mar 12, 2021
2e1f3ee
Add edge list
nickjcroucher Mar 12, 2021
e5cb974
Change column names
nickjcroucher Mar 12, 2021
eecf239
Remove weights from reference graph
nickjcroucher Mar 12, 2021
7fe3277
Add self loops for reference graph
nickjcroucher Mar 12, 2021
d6a3442
Change column names
nickjcroucher Mar 12, 2021
fc9d0a5
Change df concatenation
nickjcroucher Mar 12, 2021
366630b
Print ref graph
nickjcroucher Mar 12, 2021
76671a3
Add resolution parameter to Leiden method
nickjcroucher Mar 12, 2021
1a646b6
Add GPU graph loading
nickjcroucher Mar 12, 2021
f22cf55
Change GPU graph writing
nickjcroucher Mar 12, 2021
accafd3
Change CSV compression
nickjcroucher Mar 12, 2021
a2bb284
Change output file name
nickjcroucher Mar 12, 2021
2729a11
Add suffix to output file
nickjcroucher Mar 12, 2021
9466ae6
Correct suffix to output file
nickjcroucher Mar 12, 2021
80533f8
Fix dist order with lineage mode
johnlees Mar 12, 2021
ce8135e
docstring typo
johnlees Mar 12, 2021
cca4a7c
Add GPU summaries
nickjcroucher Mar 12, 2021
4a53fd2
Remove surplus bracket
nickjcroucher Mar 12, 2021
3c7d1d2
Change sum of degree
nickjcroucher Mar 12, 2021
b75dcce
Load cugraph libraries
nickjcroucher Mar 12, 2021
8ccddf4
Print degree for debug
nickjcroucher Mar 12, 2021
81bfb5e
Change access to degree
nickjcroucher Mar 12, 2021
3368c6b
Add missing bracket
nickjcroucher Mar 12, 2021
ff07201
Change degree print statement
nickjcroucher Mar 12, 2021
628ad32
Convert to pandas
nickjcroucher Mar 12, 2021
8e36694
Change iteration over components
nickjcroucher Mar 12, 2021
f19adfc
Print details of components
nickjcroucher Mar 12, 2021
14a14f9
Convert series value to int
nickjcroucher Mar 12, 2021
38298f2
Extract single value for size
nickjcroucher Mar 12, 2021
fb6ba09
Change column name
nickjcroucher Mar 12, 2021
2973f6e
Print component betweenness
nickjcroucher Mar 12, 2021
49e4fc9
Find maximum betweenness
nickjcroucher Mar 12, 2021
2461855
Change column name
nickjcroucher Mar 12, 2021
69edec7
Betweeness access change
nickjcroucher Mar 12, 2021
00d84d5
Change summary stat recording
nickjcroucher Mar 12, 2021
438c269
Tidy up debug messages
nickjcroucher Mar 12, 2021
0382024
Transitivity calculation details
nickjcroucher Mar 12, 2021
583fa13
Change printing of debug
nickjcroucher Mar 12, 2021
e5bb57a
Print counts
nickjcroucher Mar 12, 2021
bdf8a83
Enable GPUs for refinement
nickjcroucher Mar 13, 2021
2e1c802
Change kwarg to arg in optimise
nickjcroucher Mar 13, 2021
4e68b7b
Change default arguments
nickjcroucher Mar 13, 2021
a2ce789
Cascade use_gpu argument through functions
nickjcroucher Mar 13, 2021
9748785
Change refine arguments
nickjcroucher Mar 13, 2021
e7ff375
Communicate GPU use
nickjcroucher Mar 13, 2021
9eba438
Improve graph reconstruction in refinement
nickjcroucher Mar 13, 2021
ca2992b
Load CUDA libraries
nickjcroucher Mar 13, 2021
64e6284
Add debug message
nickjcroucher Mar 13, 2021
181a328
Fix column names
nickjcroucher Mar 13, 2021
900ed24
Make column names consistent
nickjcroucher Mar 13, 2021
5601a2b
Updating networks with CUDA
nickjcroucher Mar 13, 2021
15c106a
Change betweeness processing
nickjcroucher Mar 13, 2021
f950cd8
Add GPU options to assign
nickjcroucher Mar 13, 2021
a6a037e
Change function argument order
nickjcroucher Mar 13, 2021
2b65e61
Change argument typo
nickjcroucher Mar 13, 2021
9cc9ad6
Add CUDA load for querying
nickjcroucher Mar 14, 2021
4035062
Fix graph suffix for GPU
nickjcroucher Mar 14, 2021
01dcf37
Quote column name
nickjcroucher Mar 14, 2021
cbeebaf
Define graph name
nickjcroucher Mar 14, 2021
aee03b1
Change cudf column names on loading
nickjcroucher Mar 14, 2021
a0fce56
Add weights to column names
nickjcroucher Mar 14, 2021
43d13a7
Print formatted DF
nickjcroucher Mar 14, 2021
2ea95f4
Remove Pandas index from CSV
nickjcroucher Mar 14, 2021
41868b9
Update graph loading message
nickjcroucher Mar 14, 2021
39e80a7
Update graph loading message again
nickjcroucher Mar 14, 2021
c12d08a
Change gpu option
nickjcroucher Mar 14, 2021
04ee820
Change name of tuples
nickjcroucher Mar 14, 2021
655025b
Change printClusters to use GPU
nickjcroucher Mar 14, 2021
50cc317
Edit component assignments
nickjcroucher Mar 14, 2021
dc9a702
Print node count
nickjcroucher Mar 14, 2021
067149c
Return updated graph from function
nickjcroucher Mar 14, 2021
bde7dd8
Update to be consistent with changes to network function
nickjcroucher Mar 14, 2021
0aa02c6
Remove debug messages
nickjcroucher Mar 14, 2021
20283b4
Merge remote-tracking branch 'origin/dist_order_redux' into mst_dev
nickjcroucher Mar 15, 2021
60306b8
Merge in master
nickjcroucher Mar 15, 2021
eca9b50
Ensure consistency across function arguments
nickjcroucher Mar 15, 2021
d7fb652
Add omitted sys exit when missing distance file
nickjcroucher Mar 15, 2021
593b26f
Update file name formatting
nickjcroucher Mar 15, 2021
69724c0
Edit whitespace
nickjcroucher Mar 15, 2021
470fec0
Edit whitespace
nickjcroucher Mar 15, 2021
9f600de
Edit whitespace
nickjcroucher Mar 15, 2021
9570244
Fix qList variable name
nickjcroucher Mar 15, 2021
95fb235
Remove debug message
nickjcroucher Mar 15, 2021
f323ccf
Edit whitespace
nickjcroucher Mar 15, 2021
5333e88
Remove debug message
nickjcroucher Mar 15, 2021
78a295f
Change comment wording
nickjcroucher Mar 15, 2021
6eacd4b
Reorder column indices
nickjcroucher Mar 15, 2021
083722b
Change column indices
nickjcroucher Mar 15, 2021
bbdf53b
Edit whitespace
nickjcroucher Mar 15, 2021
c7daea2
Update assign arguments
nickjcroucher Mar 15, 2021
e18ed35
Make CUDA library imports global
nickjcroucher Mar 15, 2021
fcf2858
Only import GPU libraries once
nickjcroucher Mar 15, 2021
090994f
Add reference isolate to assign command
nickjcroucher Mar 15, 2021
89f97ce
Changes to command line phrasing
nickjcroucher Mar 16, 2021
059d804
Change GPU library loading
nickjcroucher Mar 16, 2021
c9b07e8
Merge branch 'mst_dev' of https://github.com/johnlees/PopPUNK into ms…
nickjcroucher Mar 16, 2021
7f110c8
Update web test
nickjcroucher Mar 16, 2021
a87c94d
Change distance QC routine
nickjcroucher Mar 16, 2021
46a9a9a
Update distance QC functions
nickjcroucher Mar 16, 2021
40a6e4f
Select reference isolate where not supplied
nickjcroucher Mar 16, 2021
845acb0
Change missing nodes to error
nickjcroucher Mar 16, 2021
bb2f9e1
Use function for checking network vertex count
nickjcroucher Mar 16, 2021
dbcb4f6
Tidy up obsolete text
nickjcroucher Mar 16, 2021
f61ccd0
Add self_loop function
nickjcroucher Mar 16, 2021
55507b4
Remove condition on adding edges
nickjcroucher Mar 16, 2021
acbeeea
Add copy function for models
nickjcroucher Mar 16, 2021
90144c0
Fix file and cluster name processing
nickjcroucher Mar 16, 2021
1b40461
Change network loading functions
nickjcroucher Mar 16, 2021
ae777f4
Update import of old networks to use cugraph
nickjcroucher Mar 17, 2021
616c31e
Fix processing of distance matrix
nickjcroucher Mar 17, 2021
03ab3f8
Avoid overwrite on qcDistMat
nickjcroucher Mar 17, 2021
ef450c6
Start checking reference graph connectivity
nickjcroucher Mar 17, 2021
08f58ff
Enable qcDistMat to create output directory
nickjcroucher Mar 17, 2021
6094b2b
Change vertex count error message
nickjcroucher Mar 17, 2021
93c872d
Change column naming in cugraph
nickjcroucher Mar 17, 2021
0f77ade
Get column names
nickjcroucher Mar 17, 2021
90aee39
Rename in place
nickjcroucher Mar 17, 2021
b037593
Compare cudfs
nickjcroucher Mar 17, 2021
ce8fccb
View reference cudf
nickjcroucher Mar 17, 2021
fe24179
View overall cudf
nickjcroucher Mar 17, 2021
b8d03b2
Concat cudf
nickjcroucher Mar 17, 2021
a8e7e33
Merge cudf
nickjcroucher Mar 17, 2021
86b8d85
Filter merged cudf
nickjcroucher Mar 17, 2021
c34a51f
Summarise merged cudf
nickjcroucher Mar 17, 2021
b4d1fbd
Change bool in while loop
nickjcroucher Mar 17, 2021
8c93ac3
Rename bool variable
nickjcroucher Mar 17, 2021
af08e6a
Change cudf tallying
nickjcroucher Mar 17, 2021
3af73ca
identify column names
nickjcroucher Mar 17, 2021
2f7d1f5
Rename columns
nickjcroucher Mar 17, 2021
049aba5
Fix loop control
nickjcroucher Mar 17, 2021
ce67074
Remove some debug messages
nickjcroucher Mar 17, 2021
a80f206
Print counting information
nickjcroucher Mar 17, 2021
ab52160
Print counting information
nickjcroucher Mar 17, 2021
490a227
Print as list
nickjcroucher Mar 17, 2021
7683bf9
Print as unique list
nickjcroucher Mar 17, 2021
659ccff
Find overall max
nickjcroucher Mar 17, 2021
80bbd67
Test reference connectivity
nickjcroucher Mar 17, 2021
51e442e
use debug mode
nickjcroucher Mar 17, 2021
0e69d8c
Change group by variable
nickjcroucher Mar 17, 2021
c981792
Correct group variable selection
nickjcroucher Mar 17, 2021
906b534
Add extra debug print statement
nickjcroucher Mar 17, 2021
ee510c4
Change ref selection
nickjcroucher Mar 17, 2021
2b981e2
Extend debug mode
nickjcroucher Mar 17, 2021
23e6b60
Further debug
nickjcroucher Mar 17, 2021
2151ec8
Changes to debug message
nickjcroucher Mar 17, 2021
a57a1b8
Update reference indices
nickjcroucher Mar 17, 2021
920a66b
Change series to set conversion
nickjcroucher Mar 17, 2021
8c9cf4d
Change set processing
nickjcroucher Mar 17, 2021
4f646b3
Change filtering conditions
nickjcroucher Mar 17, 2021
4669728
Change definition of reference set
nickjcroucher Mar 17, 2021
25b8ccc
Change definition of reference set
nickjcroucher Mar 17, 2021
294f3fa
Change definition of reference set
nickjcroucher Mar 17, 2021
ba946b1
Debug series filtering
nickjcroucher Mar 17, 2021
cebdf3a
Extract series values
nickjcroucher Mar 17, 2021
f86e814
Change extraction of values from series
nickjcroucher Mar 17, 2021
26bbc0a
Comment code for impending review
nickjcroucher Mar 17, 2021
a9d8fb0
Remove unnecessary loop
nickjcroucher Mar 17, 2021
92d0f0c
Change vertex selection for SSSP
nickjcroucher Mar 17, 2021
01bbe77
Reconstruct reference graph where necessary
nickjcroucher Mar 17, 2021
ad66e9c
Debug for missing nodes
nickjcroucher Mar 17, 2021
7d49d86
Remove debug message
nickjcroucher Mar 17, 2021
512c127
Add missing nodes with cugraph
nickjcroucher Mar 17, 2021
c504f96
Add missing nodes with cugraph
nickjcroucher Mar 17, 2021
8b5ed90
Add missing nodes with cugraph
nickjcroucher Mar 17, 2021
624bce8
Add missing nodes with cugraph
nickjcroucher Mar 17, 2021
f032f26
Change cugraph node count retrieval
nickjcroucher Mar 17, 2021
5dc2b0e
Change int format
nickjcroucher Mar 17, 2021
7602dd3
Change save function definition
nickjcroucher Mar 17, 2021
bdb0141
Change GPU score calculation
nickjcroucher Mar 17, 2021
4b78c1e
Update cytoscape viz test
nickjcroucher Mar 17, 2021
35abac0
Changes to messages and function arguments
nickjcroucher Mar 18, 2021
6a7e806
Disambiguation of term 'reference'
nickjcroucher Mar 18, 2021
a94fe2d
Check type isolate is in QC filtered set
nickjcroucher Mar 22, 2021
348e502
Add type isolate to reference set
nickjcroucher Mar 22, 2021
5041a91
Fixes to conditional statements
nickjcroucher Mar 22, 2021
7901e42
Fixes to function arguments
nickjcroucher Mar 22, 2021
7742e83
Change cudf memory management
nickjcroucher Mar 22, 2021
77e9d64
Add no-plot mode for models
nickjcroucher Mar 22, 2021
2c7d014
Extend no-plot mode for models
nickjcroucher Mar 22, 2021
c450710
Move model processing flags to optimisation arg group
nickjcroucher Mar 22, 2021
f823964
Update new lines
nickjcroucher Mar 22, 2021
a5c4f02
Edit whitespace
nickjcroucher Mar 22, 2021
e6abfb4
Edit whitespace
nickjcroucher Mar 22, 2021
78ee9c2
Edit whitespace
nickjcroucher Mar 22, 2021
5b362fb
Reinsert library loading warning
nickjcroucher Mar 22, 2021
03f11a8
Change model plotting behaviour
nickjcroucher Mar 22, 2021
3e5160e
Add cudf and cugraph
nickjcroucher Mar 22, 2021
13e565f
Change list command to append
nickjcroucher Mar 22, 2021
ed7e549
Merge branch 'master' into mst_dev
johnlees Mar 22, 2021
841256d
Merge changes from VLKC PR
johnlees Mar 22, 2021
771ae57
Limit betweenness calculation with GPU
nickjcroucher Mar 22, 2021
c8093c8
Merge branch 'mst_dev' of https://github.com/johnlees/PopPUNK into ms…
nickjcroucher Mar 22, 2021
ca8c429
Set from numpy ndarray
nickjcroucher Mar 22, 2021
b526a2f
Convert ndarray to list
nickjcroucher Mar 22, 2021
414c27b
Convert ndarray to list
nickjcroucher Mar 22, 2021
0cef59d
Remove debug message
nickjcroucher Mar 22, 2021
d6667a1
Remove nvidia packages from CI
johnlees Mar 23, 2021
b5e03eb
Remove whitespace
johnlees Mar 23, 2021
62e69d2
Remove cudf/cugraph err message
johnlees Mar 23, 2021
a1c6692
trailing whitespace
johnlees Mar 23, 2021
1a11b21
Change cugraph betweenness calculation
nickjcroucher Mar 23, 2021
f98bd0f
Remove multiprocessing block from 2d network refine w/ GPU
johnlees Mar 23, 2021
e1879a8
Fix web test
johnlees Mar 23, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
recursive-include scripts *.py
recursive-include PopPUNK/data *.json *.gz *.txt
recursive-include PopPUNK/data *.gz
135 changes: 78 additions & 57 deletions PopPUNK/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,10 @@ def get_options():
'separate database [default = False]', default=False, action='store_true')
qcGroup.add_argument('--max-a-dist', help='Maximum accessory distance to permit [default = 0.5]',
default = 0.5, type = float)
qcGroup.add_argument('--max-pi-dist', help='Maximum core distance to permit [default = 0.5]',
default = 0.5, type = float)
qcGroup.add_argument('--type-isolate', help='Isolate from which distances will be calculated for pruning [default = None]',
default = None, type = str)
qcGroup.add_argument('--length-sigma', help='Number of standard deviations of length distribution beyond '
'which sequences will be excluded [default = 5]', default = 5, type = int)
qcGroup.add_argument('--length-range', help='Allowed length range, outside of which sequences will be excluded '
Expand Down Expand Up @@ -121,8 +125,6 @@ def get_options():
type=float, default = None)
refinementGroup.add_argument('--manual-start', help='A file containing information for a start point. '
'See documentation for help.', default=None)
refinementGroup.add_argument('--no-local', help='Do not perform the local optimization step (speed up on very large datasets)',
default=False, action='store_true')
refinementGroup.add_argument('--model-dir', help='Directory containing model to use for assigning queries '
'to clusters [default = reference database directory]', type = str)
refinementGroup.add_argument('--score-idx',
Expand Down Expand Up @@ -150,7 +152,12 @@ def get_options():
other.add_argument('--threads', default=1, type=int, help='Number of threads to use [default = 1]')
other.add_argument('--gpu-sketch', default=False, action='store_true', help='Use a GPU when calculating sketches (read data only) [default = False]')
other.add_argument('--gpu-dist', default=False, action='store_true', help='Use a GPU when calculating distances [default = False]')
other.add_argument('--gpu-graph', default=False, action='store_true', help='Use a GPU when calculating networks [default = False]')
other.add_argument('--deviceid', default=0, type=int, help='CUDA device ID, if using GPU [default = 0]')
other.add_argument('--no-plot', help='Switch off model plotting, which can be slow for large datasets',
default=False, action='store_true')
other.add_argument('--no-local', help='Do not perform the local optimization step in model refinement (speed up on very large datasets)',
default=False, action='store_true')

other.add_argument('--version', action='version',
version='%(prog)s '+__version__)
Expand Down Expand Up @@ -200,6 +207,9 @@ def main():
from .network import constructNetwork
from .network import extractReferences
from .network import printClusters
from .network import get_vertex_list
from .network import save_network
from .network import checkNetworkVertexCount

from .plot import writeClusterCsv
from .plot import plot_scatter
Expand Down Expand Up @@ -233,7 +243,10 @@ def main():
'length_sigma': args.length_sigma,
'length_range': args.length_range,
'prop_n': args.prop_n,
'upper_n': args.upper_n
'upper_n': args.upper_n,
'max_pi_dist': args.max_pi_dist,
'max_a_dist': args.max_a_dist,
'type_isolate': args.type_isolate
}

# Dict of DB access functions
Expand Down Expand Up @@ -288,38 +301,42 @@ def main():
sys.stderr.write("--create-db requires --r-files and --output")
sys.exit(1)

# generate sketches and QC sequences
# generate sketches and QC sequences to identify sequences not matching specified criteria
createDatabaseDir(args.output, kmers)
seq_names = constructDatabase(
args.r_files,
kmers,
sketch_sizes,
args.output,
args.threads,
args.overwrite,
codon_phased = args.codon_phased,
calc_random = True)

rNames = seq_names
qNames = seq_names
refList, queryList, distMat = queryDatabase(rNames = rNames,
qNames = qNames,
dbPrefix = args.output,
queryPrefix = args.output,
klist = kmers,
self = True,
number_plot_fits = args.plot_fit,
threads = args.threads)
qcDistMat(distMat, refList, queryList, args.max_a_dist)

# Save results
dists_out = args.output + "/" + os.path.basename(args.output) + ".dists"
storePickle(refList, queryList, True, distMat, dists_out)
seq_names_passing = \
constructDatabase(
nickjcroucher marked this conversation as resolved.
Show resolved Hide resolved
args.r_files,
kmers,
sketch_sizes,
args.output,
args.threads,
args.overwrite,
codon_phased = args.codon_phased,
calc_random = True)

# calculate distances between sequences
distMat = queryDatabase(rNames = seq_names_passing,
qNames = seq_names_passing,
dbPrefix = args.output,
queryPrefix = args.output,
klist = kmers,
self = True,
number_plot_fits = args.plot_fit,
threads = args.threads)

# QC pairwise distances to identify long distances indicative of anomalous sequences in the collection
seq_names_passing, distMat = qcDistMat(distMat,
seq_names_passing,
seq_names_passing,
args.output,
args.output,
qc_dict)

# Plot results
plot_scatter(distMat,
args.output + "/" + os.path.basename(args.output) + "_distanceDistribution",
args.output + " distances")
if not args.no_plot:
plot_scatter(distMat,
args.output + "/" + os.path.basename(args.output) + "_distanceDistribution",
args.output + " distances")

#******************************#
#* *#
Expand All @@ -340,7 +357,7 @@ def main():
sys.stderr.write("Need to provide --ref-db where .h5 and .dists from "
"--create-db mode were output")
if args.distances is None:
distances = os.path.basename(args.ref_db) + "/" + args.ref_db + ".dists"
distances = args.ref_db + "/" + os.path.basename(args.ref_db) + ".dists"
else:
distances = args.distances
if args.output is None:
Expand All @@ -365,8 +382,9 @@ def main():

# Load the distances
refList, queryList, self, distMat = readPickle(distances, enforce_self=True)
if qcDistMat(distMat, refList, queryList, args.max_a_dist) == False \
and args.qc_filter == "stop":
seq_names = set(set(refList) | set(queryList))
seq_names_passing, distMat = qcDistMat(distMat, refList, queryList, args.ref_db, output, qc_dict)
if len(set(seq_names_passing).difference(seq_names)) > 0 and args.qc_filter == "stop":
sys.stderr.write("Distances failed quality control (change QC options to run anyway)\n")
sys.exit(1)

Expand All @@ -382,13 +400,11 @@ def main():
model = DBSCANFit(output)
model.set_threads(args.threads)
assignments = model.fit(distMat, args.D, args.min_cluster_prop)
model.plot()
# Run Gaussian model
elif args.fit_model == "bgmm":
model = BGMMFit(output)
model.set_threads(args.threads)
assignments = model.fit(distMat, args.K)
model.plot(distMat, assignments)
elif args.fit_model == "refine":
new_model = RefineFit(output)
new_model.set_threads(args.threads)
Expand All @@ -398,23 +414,21 @@ def main():
args.indiv_refine,
args.unconstrained,
args.score_idx,
args.no_local)
new_model.plot(distMat)
args.no_local,
args.gpu_graph)
model = new_model
elif args.fit_model == "threshold":
new_model = RefineFit(output)
new_model.set_threads(args.threads)
assignments = new_model.apply_threshold(distMat,
args.threshold)
new_model.plot(distMat)
model = new_model
elif args.fit_model == "lineage":
# run lineage clustering. Sparsity & low rank should keep memory
# usage of dict reasonable
model = LineageFit(output, rank_list)
model.set_threads(args.threads)
model.fit(distMat, args.use_accessory)
model.plot(distMat)

assignments = {}
for rank in rank_list:
Expand All @@ -423,6 +437,10 @@ def main():

# save model
model.save()

# plot model
if not args.no_plot:
model.plot(distMat, assignments)

# use model
else:
Expand All @@ -443,7 +461,8 @@ def main():
queryList,
assignments,
model.within_label,
weights=weights)
weights = weights,
use_gpu = args.gpu_graph)
nickjcroucher marked this conversation as resolved.
Show resolved Hide resolved
else:
# Lineage fit requires some iteration
indivNetworks = {}
Expand All @@ -459,13 +478,15 @@ def main():
refList,
assignments[rank],
0,
edge_list=True,
weights=weights
edge_list = True,
weights = weights,
use_gpu = args.gpu_graph
)
lineage_clusters[rank] = \
printClusters(indivNetworks[rank],
refList,
printCSV = False)
printCSV = False,
use_gpu = args.gpu_graph)

# print output of each rank as CSV
overall_lineage = createOverallLineage(rank_list, lineage_clusters)
Expand All @@ -480,16 +501,14 @@ def main():
genomeNetwork = indivNetworks[min(rank_list)]

# Ensure all in dists are in final network
networkMissing = set(map(str,set(range(len(refList))).difference(list(genomeNetwork.vertices()))))
if len(networkMissing) > 0:
missing_isolates = [refList[m] for m in networkMissing]
sys.stderr.write("WARNING: Samples " + ", ".join(missing_isolates) + " are missing from the final network\n")
checkNetworkVertexCount(refList, genomeNetwork, use_gpu = args.gpu_graph)

fit_type = model.type
isolateClustering = {fit_type: printClusters(genomeNetwork,
refList,
output + "/" + os.path.basename(output),
externalClusterCSV = args.external_clustering)}
externalClusterCSV = args.external_clustering,
use_gpu = args.gpu_graph)}

# Write core and accessory based clusters, if they worked
if model.indiv_fitted:
Expand Down Expand Up @@ -517,9 +536,7 @@ def main():
fit_type = 'accessory'
genomeNetwork = indivNetworks['accessory']

genomeNetwork.save(output + "/" + \
os.path.basename(output) + '_graph.gt',
fmt = 'gt')
save_network(genomeNetwork, prefix = output, suffix = "_graph", use_gpu = args.gpu_graph)

#******************************#
#* *#
Expand All @@ -530,7 +547,12 @@ def main():
# (this no longer loses information and should generally be kept on)
if model.type != "lineage":
newReferencesIndices, newReferencesNames, newReferencesFile, genomeNetwork = \
extractReferences(genomeNetwork, refList, output, threads = args.threads)
extractReferences(genomeNetwork,
refList,
output,
type_isolate = qc_dict['type_isolate'],
threads = args.threads,
use_gpu = args.gpu_graph)
nodes_to_remove = set(range(len(refList))).difference(newReferencesIndices)
names_to_remove = [refList[n] for n in nodes_to_remove]

Expand All @@ -539,9 +561,8 @@ def main():
prune_distance_matrix(refList, names_to_remove, distMat,
output + "/" + os.path.basename(output) + ".refs.dists")
# Save reference network
genomeNetwork.save(output + "/" + \
os.path.basename(output) + '.refs_graph.gt',
fmt = 'gt')
save_network(genomeNetwork, prefix = output, suffix = ".refs_graph",
use_gpu = args.gpu_graph)
removeFromDB(args.ref_db, output, names_to_remove)
os.rename(output + "/" + os.path.basename(output) + ".tmp.h5",
output + "/" + os.path.basename(output) + ".refs.h5")
Expand Down
Loading