Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error read_clustering (ValueError: could not convert string to float: 'TTTTG') #84

Open
SieglindeCoppens opened this issue Aug 2, 2023 · 0 comments

Comments

@SieglindeCoppens
Copy link

Hi!

I was getting the following error for both the test data and my own data:

executor >  local (5)
[98/2d4484] process > QC (1)                   [100%] 1 of 1 ✔
[87/209bff] process > fastqc (1)               [100%] 1 of 1 ✔
[9b/223d33] process > kmer_freqs (1)           [100%] 1 of 1 ✔
[e6/82bfaa] process > read_clustering (1)      [100%] 1 of 1, failed: 1 ✘
[-        ] process > split_by_cluster         -
[-        ] process > read_correction          -
[-        ] process > draft_selection          -
[-        ] process > racon_pass               -
[-        ] process > medaka_pass              -
[-        ] process > consensus_classification -
[-        ] process > join_results             -
[-        ] process > get_abundances           -
[-        ] process > plot_abundances          -
[80/872ff6] process > output_documentation     [100%] 1 of 1 ✔
Error executing process > 'read_clustering (1)'

Caused by:
  Process `read_clustering (1)` terminated with an error exit status (1)

Command executed [/home/idun/1_Software/NanoCLUST/templates/umap_hdbscan.py]:

  #!/usr/bin/env python
  
  import numpy as np
  import umap
  import matplotlib.pyplot as plt
  from sklearn import decomposition
  import random
  import pandas as pd
  import hdbscan
  
  df = pd.read_csv("freqs.txt", delimiter="	")
  
  #UMAP
  motifs = [x for x in df.columns.values if x not in ["read", "length"]]
  X = df.loc[:,motifs]
  X_embedded = umap.UMAP(n_neighbors=15, min_dist=0.1, verbose=2).fit_transform(X)
  
  df_umap = pd.DataFrame(X_embedded, columns=["D1", "D2"])
  umap_out = pd.concat([df["read"], df["length"], df_umap], axis=1)
  
  #HDBSCAN
  X = umap_out.loc[:,["D1", "D2"]]
  umap_out["bin_id"] = hdbscan.HDBSCAN(min_cluster_size=int(50), cluster_selection_epsilon=int(0.5)).fit_predict(X)
  
  #PLOT
  plt.figure(figsize=(20,20))
  plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=umap_out["bin_id"], cmap='Spectral', s=1)
  plt.xlabel("UMAP1", fontsize=18)
  plt.ylabel("UMAP2", fontsize=18)
  plt.gca().set_aspect('equal', 'datalim')
  plt.title("Projecting " + str(len(umap_out['bin_id'])) + " reads. " + str(len(umap_out['bin_id'].unique())) + " clusters generated by HDBSCAN", fontsize=18)
  
  for cluster in np.sort(umap_out['bin_id'].unique()):
      read = umap_out.loc[umap_out['bin_id'] == cluster].iloc[0]
      plt.annotate(str(cluster), (read['D1'], read['D2']), weight='bold', size=14)
  
  plt.savefig('hdbscan.output.png')
  umap_out.to_csv("hdbscan.output.tsv", sep="	", index=False)

Command exit status:
  1

Command output:
  (empty)

Command error:
  Matplotlib created a temporary config/cache directory at /tmp/matplotlib-dyrbsl_v because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
  sys:1: DtypeWarning: Columns (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513) have mixed types.Specify dtype option on import or set low_memory=False.
  Traceback (most recent call last):
    File ".command.sh", line 16, in <module>
      X_embedded = umap.UMAP(n_neighbors=15, min_dist=0.1, verbose=2).fit_transform(X)
    File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/umap/umap_.py", line 2014, in fit_transform
      self.fit(X, y)
    File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/umap/umap_.py", line 1613, in fit
      X = check_array(X, dtype=np.float32, accept_sparse="csr", order="C")
    File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/sklearn/utils/validation.py", line 72, in inner_f
      return f(**kwargs)
    File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/sklearn/utils/validation.py", line 598, in check_array
      array = np.asarray(array, order=order, dtype=dtype)
    File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
      return array(a, dtype, copy=False, order=order)
    File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/pandas/core/generic.py", line 1778, in __array__
      return np.asarray(self._values, dtype=dtype)
    File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
      return array(a, dtype, copy=False, order=order)
  ValueError: could not convert string to float: 'TTTTG'

Work dir:
  /home/idun/1_Software/NanoCLUST/work/e6/82bfaa94d00dc318b1037dc0f4851f

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

It seemed to be caused by a first line in the freqs.txt that was not being skipped (see below), so the dataframe in the umap_hdbscan.py script did not get loaded in correctly.
image
I changed line 11 of umap_hdbscan.py to skip the first line.
From:
df = pd.read_csv("$kmer_freqs", delimiter="\t")
To:
df = pd.read_csv("$kmer_freqs", delimiter="\t", skiprows=[0])

And now it works fine for me.

I just wanted to note this issue if anyone else encountered it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant