read_clustering error with own data, test data works #67

thierryjanssens · 2022-06-08T07:58:36Z

Hi all when I run the code below (after adjust the conda environment yml files to umap-learn =0.5.3 and blast+=2.12.0 the demo data work.

nextflow run main.nf --reads "test_datasets/mock4_run3bc08_5000.fastq" --db "./db/16S_ribosomal_RNA" --tax "db/taxdb/" -profile conda

When I try the same approach on my own data (5 kreads of full length 16S) with the following code:

nextflow run main.nf --reads "/path/to/barcode01.fastq" --db "./db/16S_ribosomal_RNA" --tax "db/taxd
b/" -profile conda

Error executing process > 'read_clustering (1)'

Caused by:
Process read_clustering (1) terminated with an error exit status (1)
Command executed [/home/minion/git/NanoCLUST/templates/umap_hdbscan.py]:
#!/usr/bin/env python

import numpy as np
import umap
import matplotlib.pyplot as plt
from sklearn import decomposition
import random
import pandas as pd
import hdbscan

df = pd.read_csv("freqs.txt", delimiter=" ")

#UMAP
motifs = [x for x in df.columns.values if x not in ["read", "length"]]
X = df.loc[:,motifs]
X_embedded = umap.UMAP(n_neighbors=15, min_dist=0.1, verbose=2).fit_transform(X)

df_umap = pd.DataFrame(X_embedded, columns=["D1", "D2"])
umap_out = pd.concat([df["read"], df["length"], df_umap], axis=1)

#HDBSCAN
X = umap_out.loc[:,["D1", "D2"]]
umap_out["bin_id"] = hdbscan.HDBSCAN(min_cluster_size=int(50), cluster_selection_epsilon=int(0.5)).fit_predict(X)

#PLOT
plt.figure(figsize=(20,20))
plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=umap_out["bin_id"], cmap='Spectral', s=1)
plt.xlabel("UMAP1", fontsize=18)
plt.ylabel("UMAP2", fontsize=18)
plt.gca().set_aspect('equal', 'datalim')
plt.title("Projecting " + str(len(umap_out['bin_id'])) + " reads. " + str(len(umap_out['bin_id'].unique())) + " clusters generated by HDBSCAN", fontsize=18)

for cluster in np.sort(umap_out['bin_id'].unique()):
read = umap_out.loc[umap_out['bin_id'] == cluster].iloc[0]
plt.annotate(str(cluster), (read['D1'], read['D2']), weight='bold', size=14)
plt.savefig('hdbscan.output.png')
umap_out.to_csv("hdbscan.output.tsv", sep=" ", index=False)

Command exit status:
1

Command output:
UMAP( verbose=2)
Tue Jun 7 23:36:49 2022 Construct fuzzy simplicial set
Tue Jun 7 23:36:50 2022 Finding Nearest Neighbors
Tue Jun 7 23:36:50 2022 Building RP forest with 21 trees
Tue Jun 7 23:36:55 2022 NN descent for 17 iterations
1 / 17
2 / 17
3 / 17
4 / 17
5 / 17
6 / 17
Stopping threshold met -- exiting after 6 iterations
Tue Jun 7 23:37:14 2022 Finished Nearest Neighbor Search
Tue Jun 7 23:37:17 2022 Construct embedding
Tue Jun 7 23:38:27 2022 Finished embedding

Command error:
Epochs completed: 91%| █████████ 182/200 [00:51]
Epochs completed: 92%| █████████▏ 183/200 [00:51]
Epochs completed: 92%| █████████▏ 184/200 [00:52]
Epochs completed: 92%| █████████▎ 185/200 [00:52]
Epochs completed: 93%| █████████▎ 186/200 [00:52]
Epochs completed: 94%| █████████▎ 187/200 [00:52]
Epochs completed: 94%| █████████▍ 188/200 [00:53]
Epochs completed: 94%| █████████▍ 189/200 [00:53]
Epochs completed: 95%| █████████▌ 190/200 [00:53]
Epochs completed: 96%| █████████▌ 191/200 [00:54]
Epochs completed: 96%| █████████▌ 192/200 [00:54]
Epochs completed: 96%| █████████▋ 193/200 [00:54]
Epochs completed: 97%| █████████▋ 194/200 [00:54]
Epochs completed: 98%| █████████▊ 195/200 [00:55]
Epochs completed: 98%| █████████▊ 196/200 [00:55]
Epochs completed: 98%| █████████▊ 197/200 [00:55]
Epochs completed: 99%| █████████▉ 198/200 [00:55]
Epochs completed: 100%| █████████▉ 199/200 [00:56]
Epochs completed: 100%| ██████████ 200/200 [00:56]
Epochs completed: 100%| ██████████ 200/200 [00:56]
Traceback (most recent call last):
File "/home/minion/git/NanoCLUST/work/conda/read_clustering-5ad1d823e66c1828058a33f36a6c51c6/lib/python3.8/site-packages/joblib/parallel.py", line 822, in dispatch_one_batch
tasks = self._ready_batches.get(block=False)
File "/home/minion/git/NanoCLUST/work/conda/read_clustering-5ad1d823e66c1828058a33f36a6c51c6/lib/python3.8/queue.py", line 167, in get
raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File ".command.sh", line 23, in
umap_out["bin_id"] = hdbscan.HDBSCAN(min_cluster_size=int(50), cluster_selection_epsilon=int(0.5)).fit_predict(X)
File "/home/minion/git/NanoCLUST/work/conda/read_clustering-5ad1d823e66c1828058a33f36a6c51c6/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 941, in fit_predict
self.fit(X)
File "/home/minion/git/NanoCLUST/work/conda/read_clustering-5ad1d823e66c1828058a33f36a6c51c6/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 919, in fit
self.min_spanning_tree) = hdbscan(X, **kwargs)
File "/home/minion/git/NanoCLUST/work/conda/read_clustering-5ad1d823e66c1828058a33f36a6c51c6/lib/python3.8/site-packages/hdbscan/hdbscan.py", line 610, in hdbscan
(single_linkage_tree, result_min_span_tree) = memory.cache(
File "/home/minion/git/NanoCLUST/work/conda/read_clustering-5ad1d823e66c1828058a33f36a6c51c6/lib/python3.8/site-packages/joblib/memory.py", line 349, in call
return self.func(*args, **kwargs)
File "/home/minion/git/NanoCLUST/work/conda/read_clustering-5ad1d823e66c1828058a33f36a6c51c6/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 275, in _hdbscan_boruvka_kdtree
alg = KDTreeBoruvkaAlgorithm(tree, min_samples, metric=metric,
File "hdbscan/_hdbscan_boruvka.pyx", line 375, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.init
File "hdbscan/_hdbscan_boruvka.pyx", line 411, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds
File "/home/minion/git/NanoCLUST/work/conda/read_clustering-5ad1d823e66c1828058a33f36a6c51c6/lib/python3.8/site-packages/joblib/parallel.py", line 1043, in call
if self.dispatch_one_batch(iterator):
File "/home/minion/git/NanoCLUST/work/conda/read_clustering-5ad1d823e66c1828058a33f36a6c51c6/lib/python3.8/site-packages/joblib/parallel.py", line 833, in dispatch_one_batch
islice = list(itertools.islice(iterator, big_batch_size))
File "hdbscan/_hdbscan_boruvka.pyx", line 412, in genexpr
TypeError: delayed() got an unexpected keyword argument 'check_pickle'

Work dir:
/home/minion/git/NanoCLUST/work/d5/c1956140ebe9ed2b034fdd72099a72

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named .command.sh

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_clustering error with own data, test data works #67

read_clustering error with own data, test data works #67

thierryjanssens commented Jun 8, 2022

read_clustering error with own data, test data works #67

read_clustering error with own data, test data works #67

Comments

thierryjanssens commented Jun 8, 2022