Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a database (DB) from a custom database. #88

Open
miangher opened this issue Nov 3, 2023 · 0 comments
Open

Create a database (DB) from a custom database. #88

miangher opened this issue Nov 3, 2023 · 0 comments

Comments

@miangher
Copy link

miangher commented Nov 3, 2023

Good morning! I need your help!
From bibi's database "leBIBI IV SSU-rDNA (16S) Automated ProKaryotes Phylogeny," I've tried to generate the necessary data for NanoCLUST to be able to use them when performing the analysis. I've used programs like BLAST+ 2.13.0 (makeblastdb) to try to obtain the following extensions: .ndb, .nhr, .nin, .nnd, .nni, .nog, .nos, .not, .nsq, .ntf, .nto, but there are always two extensions that don't appear: .nnd and .nni.
When I run the program, I get the following error:

(Nextflow) cnr-strep@cnrstrep-Precision-3660:~/NanoCLUST$ nextflow run main.nf -profile docker --reads '/media/cnr-strep/ACC22AE1C22AB00E/FastqHAC-Lactobacillus/FastQ_Bichat/Fastq-HAC-16052022/barcode17/trimming/barcode17.filtered.fastq' --db 'db/16S_ribosomal_RNA' --tax 'db/taxdb/'
N E X T F L O W ~ version 22.10.6
Launching main.nf [determined_ampere] DSL1 - revision: 2a51687d92


  _   __                     ________    __  _____________
 / | / /___ _____  ____     / ____/ /   / / / / ___/_  __/
/  |/ / __ `/ __ \/ __ \   / /   / /   / / / /\__ \ / /   

/ /| / // / / / / // / / // // // // // /
/
/ |
/_,// //_/ _/__/_//___///

NanoCLUST v1.0dev

Run Name : determined_ampere
Reads : /media/cnr-strep/ACC22AE1C22AB00E/FastqHAC-Lactobacillus/FastQ_Bichat/Fastq-HAC-16052022/barcode17/trimming/barcode17.filtered.fastq
Max Resources : 128 GB memory, 16 cpus, 10d time per job
Container : docker - [:]
Output dir : ./results
Launch dir : /home/cnr-strep/NanoCLUST
Working dir : /home/cnr-strep/NanoCLUST/work
Script dir : /home/cnr-strep/NanoCLUST
User : cnr-strep
Config Profile : docker

executor > local (23)
[8b/15691c] process > QC (1) [100%] 1 of 1 ✔
[5e/36346c] process > fastqc (1) [100%] 1 of 1 ✔
[3c/3e9715] process > kmer_freqs (1) [100%] 1 of 1 ✔
[26/7c4085] process > read_clustering (1) [100%] 1 of 1 ✔
[3c/d03ee2] process > split_by_cluster (1) [100%] 1 of 1 ✔
[96/2c6b76] process > read_correction (3) [100%] 3 of 3 ✔
[bb/f3035a] process > draft_selection (3) [100%] 3 of 3 ✔
[21/1e894c] process > racon_pass (3) [100%] 3 of 3 ✔
[bf/28314d] process > medaka_pass (3) [100%] 3 of 3 ✔
[90/ad3c87] process > consensus_classification (3) [100%] 3 of 3 ✔
[07/d23aa0] process > join_results (1) [100%] 1 of 1 ✔
[4f/f929af] process > get_abundances (1) [ 0%] 0 of 1
[- ] process > plot_abundances -
[fe/e2e60d] process > output_documentation [100%] 1 of 1 ✔
Error executing process > 'get_abundances (1)'

Caused by:
Process get_abundances (1) terminated with an error exit status (1)

Command executed [/home/cnr-strep/NanoCLUST/templates/get_abundance.py]:

#!/usr/bin/env python

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rc
import pandas as pd
from functools import reduce
import requests
import json
#https://unipept.ugent.be/apidocs/taxonomy

def get_taxname(tax_id,tax_level):
tags = {"S": "species_name","G": "genus_name","F": "family_name","O":'order_name', "C": "class_name"}
tax_level_tag = tags[tax_level]
#Avoids pipeline crash due to "nan" classification output. Thanks to Qi-Maria from Github
if str(tax_id) == "nan":
tax_id = 1

  path = 'http://api.unipept.ugent.be/api/v1/taxonomy.json?input[]=' + str(int(tax_id)) + '&extra=true&names=true'
  complete_tax = requests.get(path).text

  #Checks for API correct response (field containing the tax name). Thanks to devinbrown from Github
  try:
      name = json.loads(complete_tax)[0][tax_level_tag]
  except:
      name = str(int(tax_id))

  return json.loads(complete_tax)[0][tax_level_tag]

def get_abundance_values(names,paths):
dfs = []
for name,path in zip(names,paths):
data = pd.read_csv(path, index_col=False, sep=';').iloc[:,1:]

      total = sum(data['reads_in_cluster'])
      rel_abundance=[]

      for index,row in data.iterrows():
          rel_abundance.append(row['reads_in_cluster'] / total)
          
      data['rel_abundance'] = rel_abundance
      dfs.append(pd.DataFrame({'taxid': data['taxid'], 'rel_abundance': rel_abundance}))
      data.to_csv("" + name + "_nanoclust_out.txt")

executor > local (23)
[8b/15691c] process > QC (1) [100%] 1 of 1 ✔
[5e/36346c] process > fastqc (1) [100%] 1 of 1 ✔
[3c/3e9715] process > kmer_freqs (1) [100%] 1 of 1 ✔
[26/7c4085] process > read_clustering (1) [100%] 1 of 1 ✔
[3c/d03ee2] process > split_by_cluster (1) [100%] 1 of 1 ✔
[96/2c6b76] process > read_correction (3) [100%] 3 of 3 ✔
[bb/f3035a] process > draft_selection (3) [100%] 3 of 3 ✔
[21/1e894c] process > racon_pass (3) [100%] 3 of 3 ✔
[bf/28314d] process > medaka_pass (3) [100%] 3 of 3 ✔
[90/ad3c87] process > consensus_classification (3) [100%] 3 of 3 ✔
[07/d23aa0] process > join_results (1) [100%] 1 of 1 ✔
[4f/f929af] process > get_abundances (1) [100%] 1 of 1, failed: 1 ✘
[- ] process > plot_abundances -
[fe/e2e60d] process > output_documentation [100%] 1 of 1 ✔
Execution cancelled -- Finishing pending tasks before exit
[nf-core/nanoclust] Pipeline completed with errors
Error executing process > 'get_abundances (1)'

Caused by:
Process get_abundances (1) terminated with an error exit status (1)

Command executed [/home/cnr-strep/NanoCLUST/templates/get_abundance.py]:

#!/usr/bin/env python

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rc
import pandas as pd
from functools import reduce
import requests
import json
#https://unipept.ugent.be/apidocs/taxonomy

def get_taxname(tax_id,tax_level):
tags = {"S": "species_name","G": "genus_name","F": "family_name","O":'order_name', "C": "class_name"}
tax_level_tag = tags[tax_level]
#Avoids pipeline crash due to "nan" classification output. Thanks to Qi-Maria from Github
if str(tax_id) == "nan":
tax_id = 1

  path = 'http://api.unipept.ugent.be/api/v1/taxonomy.json?input[]=' + str(int(tax_id)) + '&extra=true&names=true'
  complete_tax = requests.get(path).text

  #Checks for API correct response (field containing the tax name). Thanks to devinbrown from Github
  try:
      name = json.loads(complete_tax)[0][tax_level_tag]
  except:
      name = str(int(tax_id))

  return json.loads(complete_tax)[0][tax_level_tag]

def get_abundance_values(names,paths):
dfs = []
for name,path in zip(names,paths):
data = pd.read_csv(path, index_col=False, sep=';').iloc[:,1:]

      total = sum(data['reads_in_cluster'])
      rel_abundance=[]

      for index,row in data.iterrows():
          rel_abundance.append(row['reads_in_cluster'] / total)
          
      data['rel_abundance'] = rel_abundance
      dfs.append(pd.DataFrame({'taxid': data['taxid'], 'rel_abundance': rel_abundance}))
      data.to_csv("" + name + "_nanoclust_out.txt")

executor > local (23)
[8b/15691c] process > QC (1) [100%] 1 of 1 ✔
[5e/36346c] process > fastqc (1) [100%] 1 of 1 ✔
[3c/3e9715] process > kmer_freqs (1) [100%] 1 of 1 ✔
[26/7c4085] process > read_clustering (1) [100%] 1 of 1 ✔
[3c/d03ee2] process > split_by_cluster (1) [100%] 1 of 1 ✔
[96/2c6b76] process > read_correction (3) [100%] 3 of 3 ✔
[bb/f3035a] process > draft_selection (3) [100%] 3 of 3 ✔
[21/1e894c] process > racon_pass (3) [100%] 3 of 3 ✔
[bf/28314d] process > medaka_pass (3) [100%] 3 of 3 ✔
[90/ad3c87] process > consensus_classification (3) [100%] 3 of 3 ✔
[07/d23aa0] process > join_results (1) [100%] 1 of 1 ✔
[4f/f929af] process > get_abundances (1) [100%] 1 of 1, failed: 1 ✘
[- ] process > plot_abundances -
[fe/e2e60d] process > output_documentation [100%] 1 of 1 ✔
Execution cancelled -- Finishing pending tasks before exit
[nf-core/nanoclust] Pipeline completed with errors
WARN: Graphviz is required to render the execution DAG in the given format -- See http://www.graphviz.org for more info.
Error executing process > 'get_abundances (1)'

Caused by:
Process get_abundances (1) terminated with an error exit status (1)

Command executed [/home/cnr-strep/NanoCLUST/templates/get_abundance.py]:

#!/usr/bin/env python

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rc
import pandas as pd
from functools import reduce
import requests
import json
#https://unipept.ugent.be/apidocs/taxonomy

def get_taxname(tax_id,tax_level):
tags = {"S": "species_name","G": "genus_name","F": "family_name","O":'order_name', "C": "class_name"}
tax_level_tag = tags[tax_level]
#Avoids pipeline crash due to "nan" classification output. Thanks to Qi-Maria from Github
if str(tax_id) == "nan":
tax_id = 1

  path = 'http://api.unipept.ugent.be/api/v1/taxonomy.json?input[]=' + str(int(tax_id)) + '&extra=true&names=true'
  complete_tax = requests.get(path).text

  #Checks for API correct response (field containing the tax name). Thanks to devinbrown from Github
  try:
      name = json.loads(complete_tax)[0][tax_level_tag]
  except:
      name = str(int(tax_id))

  return json.loads(complete_tax)[0][tax_level_tag]

def get_abundance_values(names,paths):
dfs = []
for name,path in zip(names,paths):
data = pd.read_csv(path, index_col=False, sep=';').iloc[:,1:]

      total = sum(data['reads_in_cluster'])
      rel_abundance=[]

      for index,row in data.iterrows():
          rel_abundance.append(row['reads_in_cluster'] / total)
          
      data['rel_abundance'] = rel_abundance
      dfs.append(pd.DataFrame({'taxid': data['taxid'], 'rel_abundance': rel_abundance}))
      data.to_csv("" + name + "_nanoclust_out.txt")

  return dfs

def merge_abundance(dfs,tax_level):
df_final = reduce(lambda left,right: pd.merge(left,right,on='taxid',how='outer').fillna(0), dfs)
df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
df_final_grp = df_final.groupby(["taxid"], as_index=False).sum()
return df_final_grp

def get_abundance(names,paths,tax_level):
if(not isinstance(paths, list)):
paths = [paths]
names = [names]

  dfs = get_abundance_values(names,paths)
  df_final_grp = merge_abundance(dfs, tax_level)
  df_final_grp.to_csv("rel_abundance_"+ names[0] + "_" + tax_level + ".csv", index = False)

paths = "barcode17.filtered.nanoclust_out.txt"
names = "barcode17.filtered"

get_abundance(names,paths, "G")
get_abundance(names,paths, "S")
get_abundance(names,paths, "O")
get_abundance(names,paths, "F")

Command exit status:
1

Command output:
(empty)

Command error:
Traceback (most recent call last):
File ".command.sh", line 65, in
get_abundance(names,paths, "G")
File ".command.sh", line 59, in get_abundance
df_final_grp = merge_abundance(dfs, tax_level)
File ".command.sh", line 49, in merge_abundance
df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
File ".command.sh", line 49, in
df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
File ".command.sh", line 28, in get_taxname
return json.loads(complete_tax)[0][tax_level_tag]
IndexError: list index out of range

Work dir:
/home/cnr-strep/NanoCLUST/work/4f/f929af73009d063bc5793e38804f62

Tip: view the complete command output by changing to the process work dir and entering the command cat .command.out
(Nextflow) cnr-strep@cnrstrep-Precision-3660:~/NanoCLUST$

Please, could you guide me on how to generate a database that can be interpreted by NanoCLUST from a FASTA file containing a list of selected 16S sequences?

Thank you very much!

Miguel Angel Hernandez

@miangher miangher changed the title Base de datos db a partir de una base de datos personalizada. Create a database (DB) from a custom database. Nov 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant