Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

singlem issue #159

Open
francesco-ricci opened this issue Jan 12, 2025 · 15 comments
Open

singlem issue #159

francesco-ricci opened this issue Jan 12, 2025 · 15 comments

Comments

@francesco-ricci
Copy link

Hello

I am getting the following issue when binchicken runs singlem, I think it is because /home/fricci/rp24/fra/database_files/S3.2.1.GTDB_r214.metapackage_20231006.smpkg.zb is being recognized as a directory, not as a single file by smafa. Do you have any suggestion on how to fix the following?

This is the error I get:

01/10/2025 04:33:04 PM INFO: SingleM v0.18.1
01/10/2025 04:33:04 PM INFO: Acquiring SingleM metapackage from Zenodo backpack directory specified ..
01/10/2025 04:33:04 PM INFO: Retrieval successful. Location of backpack is: /home/fricci/rp24/fra/database_files/S3.2.1.GTDB_r214.metapackage_20231006.smpkg.zb
01/10/2025 04:33:04 PM INFO: Loaded 59 SingleM packages
01/10/2025 04:33:04 PM INFO: Using as input 1 different pairs of sequence files e.g. /home/fricci/rp24/fra/analyses/Tess/raw_reads/DML-31_L3.1.fq.gz & /home/fricci/r>
01/10/2025 04:33:04 PM INFO: Filtering sequence files through DIAMOND blastx
01/10/2025 05:38:04 PM INFO: Finished DIAMOND prefilter phase
01/10/2025 05:38:04 PM INFO: Assigning sequences to SingleM packages with DIAMOND ..
01/10/2025 05:42:25 PM INFO: Running taxonomic assignment ..
01/10/2025 05:42:25 PM INFO: Assigning taxonomy by singlem query ..
01/10/2025 05:42:25 PM INFO: Querying against species database with 430 sequences, using method smafa-naive and max divergence 2
01/10/2025 05:42:25 PM INFO: Searching with SMAFA NAIVE by nucleotide sequence ..
01/10/2025 05:42:25 PM INFO: Querying index for S3.1.ribosomal_protein_L2_rplB
Traceback (most recent call last):
File "/fs04/rp24/fra/analyses/Tess/output/binning/binchicken/path_to_conda_envs/8af436ae67d59fa6a12aa49a286a3324_/bin/singlem", line 709, in
singlem.pipe.SearchPipe().run(
File "/fs04/rp24/fra/analyses/Tess/output/binning/binchicken/path_to_conda_envs/8af436ae67d59fa6a12aa49a286a3324_/lib/python3.12/site-packages/singlem/pipe.py", li>
otu_table_object = self.run_to_otu_table(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/fs04/rp24/fra/analyses/Tess/output/binning/binchicken/path_to_conda_envs/8af436ae67d59fa6a12aa49a286a3324_/lib/python3.12/site-packages/singlem/pipe.py", li>
otu_table_object = self.assign_taxonomy_and_process(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/fs04/rp24/fra/analyses/Tess/output/binning/binchicken/path_to_conda_envs/8af436ae67d59fa6a12aa49a286a3324_/lib/python3.12/site-packages/singlem/pipe.py", li>
assignment_result = self.assign_taxonomy(
^^^^^^^^^^^^^^^^^^^^^^
File "/fs04/rp24/fra/analyses/Tess/output/binning/binchicken/path_to_conda_envs/8af436ae67d59fa6a12aa49a286a3324
/lib/python3.12/site-packages/singlem/pipe.py", li>
query_based_assignment_result = PipeTaxonomyAssignerByQuery().assign_taxonomy(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/fs04/rp24/fra/analyses/Tess/output/binning/binchicken/path_to_conda_envs/8af436ae67d59fa6a12aa49a286a3324_/lib/python3.12/site-packages/singlem/pipe_taxonom>
query_single_set(queries[0], 0)
File "/fs04/rp24/fra/analyses/Tess/output/binning/binchicken/path_to_conda_envs/8af436ae67d59fa6a12aa49a286a3324_/lib/python3.12/site-packages/singlem/pipe_taxonom>
for hit in querier.query_with_queries(queries, sdb, max_species_divergence, method, SequenceDatabase.NUCLEOTIDE_TYPE, 1, None, False, None):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/fs04/rp24/fra/analyses/Tess/output/binning/binchicken/path_to_conda_envs/8af436ae67d59fa6a12aa49a286a3324_/lib/python3.12/site-packages/singlem/querier.py",>
smafa_stdout = extern.run(smafa_cmd, stdin='\n'.join([">{}\n{}".format(i, q.sequence) for i, q in enumerate(chunked_queries)]))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/fs04/rp24/fra/analyses/Tess/output/binning/binchicken/path_to_conda_envs/8af436ae67d59fa6a12aa49a286a3324_/lib/python3.12/site-packages/extern/init.py",>
raise ExternCalledProcessError(process, command)
extern.ExternCalledProcessError: Command smafa query --database '/home/fricci/rp24/fra/database_files/S3.2.1.GTDB_r214.metapackage_20231006.smpkg.zb/payload_director>
STDERR was: b'[2025-01-10T06:42:25Z INFO bird_tool_utils::clap_utils] Smafa version 0.8.0\n[2025-01-10T06:42:25Z INFO smafa] Decoding db file "/home/fricci/rp24/fr>

@AroneyS
Copy link
Owner

AroneyS commented Jan 12, 2025

Hi Francesco,

In the future, can you put the error in a code block (e.g. https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet#code)? Its a bit hard to read as is. I've done this below. It looks truncated. Is there anything after Decoding db file "/home/fricci/rp24/fr>?

Note that S3.2.1.GTDB_r214.metapackage_20231006 is a bit old. The latest version is S4.3.0.GTDB_r220.metapackage_20240523 (see https://zenodo.org/records/11323477).

01/10/2025 04:33:04 PM INFO: SingleM v0.18.1
01/10/2025 04:33:04 PM INFO: Acquiring SingleM metapackage from Zenodo backpack directory specified ..
01/10/2025 04:33:04 PM INFO: Retrieval successful. Location of backpack is: /home/fricci/rp24/fra/database_files/S3.2.1.GTDB_r214.metapackage_20231006.smpkg.zb
01/10/2025 04:33:04 PM INFO: Loaded 59 SingleM packages
01/10/2025 04:33:04 PM INFO: Using as input 1 different pairs of sequence files e.g. /home/fricci/rp24/fra/analyses/Tess/raw_reads/DML-31_L3.1.fq.gz & /home/fricci/r>
01/10/2025 04:33:04 PM INFO: Filtering sequence files through DIAMOND blastx
01/10/2025 05:38:04 PM INFO: Finished DIAMOND prefilter phase
01/10/2025 05:38:04 PM INFO: Assigning sequences to SingleM packages with DIAMOND ..
01/10/2025 05:42:25 PM INFO: Running taxonomic assignment ..
01/10/2025 05:42:25 PM INFO: Assigning taxonomy by singlem query ..
01/10/2025 05:42:25 PM INFO: Querying against species database with 430 sequences, using method smafa-naive and max divergence 2
01/10/2025 05:42:25 PM INFO: Searching with SMAFA NAIVE by nucleotide sequence ..
01/10/2025 05:42:25 PM INFO: Querying index for S3.1.ribosomal_protein_L2_rplB
Traceback (most recent call last):
File "/fs04/rp24/fra/analyses/Tess/output/binning/binchicken/path_to_conda_envs/8af436ae67d59fa6a12aa49a286a3324_/bin/singlem", line 709, in
singlem.pipe.SearchPipe().run(
File "/fs04/rp24/fra/analyses/Tess/output/binning/binchicken/path_to_conda_envs/8af436ae67d59fa6a12aa49a286a3324_/lib/python3.12/site-packages/singlem/pipe.py", li>
otu_table_object = self.run_to_otu_table(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/fs04/rp24/fra/analyses/Tess/output/binning/binchicken/path_to_conda_envs/8af436ae67d59fa6a12aa49a286a3324_/lib/python3.12/site-packages/singlem/pipe.py", li>
otu_table_object = self.assign_taxonomy_and_process(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/fs04/rp24/fra/analyses/Tess/output/binning/binchicken/path_to_conda_envs/8af436ae67d59fa6a12aa49a286a3324_/lib/python3.12/site-packages/singlem/pipe.py", li>
assignment_result = self.assign_taxonomy(
^^^^^^^^^^^^^^^^^^^^^^
File "/fs04/rp24/fra/analyses/Tess/output/binning/binchicken/path_to_conda_envs/8af436ae67d59fa6a12aa49a286a3324/lib/python3.12/site-packages/singlem/pipe.py", li>
query_based_assignment_result = PipeTaxonomyAssignerByQuery().assign_taxonomy(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/fs04/rp24/fra/analyses/Tess/output/binning/binchicken/path_to_conda_envs/8af436ae67d59fa6a12aa49a286a3324_/lib/python3.12/site-packages/singlem/pipe_taxonom>
query_single_set(queries[0], 0)
File "/fs04/rp24/fra/analyses/Tess/output/binning/binchicken/path_to_conda_envs/8af436ae67d59fa6a12aa49a286a3324_/lib/python3.12/site-packages/singlem/pipe_taxonom>
for hit in querier.query_with_queries(queries, sdb, max_species_divergence, method, SequenceDatabase.NUCLEOTIDE_TYPE, 1, None, False, None):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/fs04/rp24/fra/analyses/Tess/output/binning/binchicken/path_to_conda_envs/8af436ae67d59fa6a12aa49a286a3324_/lib/python3.12/site-packages/singlem/querier.py",>
smafa_stdout = extern.run(smafa_cmd, stdin='\n'.join([">{}\n{}".format(i, q.sequence) for i, q in enumerate(chunked_queries)]))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/fs04/rp24/fra/analyses/Tess/output/binning/binchicken/path_to_conda_envs/8af436ae67d59fa6a12aa49a286a3324_/lib/python3.12/site-packages/extern/init.py",>
raise ExternCalledProcessError(process, command)
extern.ExternCalledProcessError: Command smafa query --database '/home/fricci/rp24/fra/database_files/S3.2.1.GTDB_r214.metapackage_20231006.smpkg.zb/payload_director>
STDERR was: b'[2025-01-10T06:42:25Z INFO bird_tool_utils::clap_utils] Smafa version 0.8.0\n[2025-01-10T06:42:25Z INFO smafa] Decoding db file "/home/fricci/rp24/fr>

@francesco-ricci
Copy link
Author

francesco-ricci commented Jan 12, 2025

Damn, sorry Sam. Unfortunately I don't have that file anymore, it's been overwritten. How can I change the database that singlem sources from within binchicken?

@AroneyS
Copy link
Owner

AroneyS commented Jan 12, 2025

No problem. You can change it with: conda env config vars set SINGLEM_METAPACKAGE_PATH="/metapackage/dir" within the binchicken conda env.
Other env variables can be changed similarly (https://aroneys.github.io/binchicken/setup).

Otherwise, rerunning binchicken should give the same error message?

@francesco-ricci
Copy link
Author

Thanks! Yes, I am rerunning binchicken atm, but I'll probably stop it and install the new single database. When I was trying to troubleshoot the issue before I thought it could have been because the path to the single db is SINGLEM_METAPACKAGE_PATH='/home/fricci/rp24/fra/database_files/S3.2.1.GTDB_r214.metapackage_20231006.smpkg.zb'. Do you this could be the issue? Should I point the path do the subfolders within the db?

@AroneyS
Copy link
Owner

AroneyS commented Jan 12, 2025

Pointing to the .zb should work fine.

@wwood
Copy link
Collaborator

wwood commented Jan 13, 2025

Smafa v0.8.0, which is what you were running, requires a singlem S4.x database, it won't work with S3.x ones - that is likely the source of the issue - I suggest updating - not sure if this is a general thing @AroneyS ?

@francesco-ricci
Copy link
Author

Thanks Ben and Sam, I'll try what you guys recommended and let you know how it went!

@francesco-ricci
Copy link
Author

If the pipeline encounters an error, is it possible to restart it from where it stopped?

@AroneyS
Copy link
Owner

AroneyS commented Jan 13, 2025

Just rerun the command as is. Should be fine, depending on the error and what you did to fix it.

@francesco-ricci
Copy link
Author

Hello

since I've tried to fix the single issue I face a new issue earlier on in the pipeline, specifically:

Error in rule genome_transcripts:
jobid: 17
input: /home/fricci/rp24/fra/analyses/Tess/output/binning/binchicken/genomes.txt
output: /fs04/rp24/fra/analyses/Tess/output/binning/binchicken/coassemble/transcripts/genomes_protein.fna
log: /fs04/rp24/fra/analyses/Tess/output/binning/binchicken/coassemble/logs/transcripts/genomes_protein.log (check log file(s) for error details)
conda-env: /fs04/rp24/fra/analyses/Tess/output/binning/binchicken/path_to_conda_envs/811af9f0f5ab523b330dfd7114fe655c_
shell:
prodigal -i /home/fricci/rp24/fra/analyses/Tess/output/binning/binchicken/genomes.txt -d /fs04/rp24/fra/analyses/Tess/output/binning/binchicken/coassemble/transcripts/genomes_protein.fna &> /fs04/rp24/fra/analyses/Tess/output/binning/binchicken/coassemble/logs/transcripts/genomes_protein.log
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

when I check genomes_protein.log, this is the output:


PRODIGAL v2.6.3 [February, 2016]
Univ of Tenn / Oak Ridge National Lab
Doug Hyatt, Loren Hauser, et al.

Request: Single Genome, Phase: Training
Reading in the sequence(s) to train...

Sequence read failed (file must be Fasta, Genbank, or EMBL format).

This is the output of genomes.txt, and these are 100% fasta files:
/home/fricci/rp24/fra/analyses/Tess/output/binning/metawrap_bins_FINAL/metawrap_50_10_bins/bin.459.fa
/home/fricci/rp24/fra/analyses/Tess/output/binning/metawrap_bins_FINAL/metawrap_50_10_bins/bin.806.fa
/home/fricci/rp24/fra/analyses/Tess/output/binning/metawrap_bins_FINAL/metawrap_50_10_bins/bin.467.fa
/home/fricci/rp24/fra/analyses/Tess/output/binning/metawrap_bins_FINAL/metawrap_50_10_bins/bin.160.fa
/home/fricci/rp24/fra/analyses/Tess/output/binning/metawrap_bins_FINAL/metawrap_50_10_bins/bin.756.fa
/home/fricci/rp24/fra/analyses/Tess/output/binning/metawrap_bins_FINAL/metawrap_50_10_bins/bin.443.fa
/home/fricci/rp24/fra/analyses/Tess/output/binning/metawrap_bins_FINAL/metawrap_50_10_bins/bin.84.fa
/home/fricci/rp24/fra/analyses/Tess/output/binning/metawrap_bins_FINAL/metawrap_50_10_bins/bin.703.fa
/home/fricci/rp24/fra/analyses/Tess/output/binning/metawrap_bins_FINAL/metawrap_50_10_bins/bin.487.fa
/home/fricci/rp24/fra/analyses/Tess/output/binning/metawrap_bins_FINAL/metawrap_50_10_bins/bin.342.fa

e.g. head /home/fricci/rp24/fra/analyses/Tess/output/binning/metawrap_bins_FINAL/metawrap_50_10_bins/bin.459.fa:

S1Ck127_9289356_flag_1_multi_3.0000_len_2618
CTCCTCGGCCCGGCGCTCGTCGTCGTCGCCGACCTCGCCGTTCACGACGAGCTCCTTCAG
GTTGTGCATCACGTCGCGGCGGACGTTGCGGATCGCGATCCGCCCCTCCTCGGCGACGGA
GCGGGCGACCTTCCCGTACTCCCTGCGGCGCTCCTCGGTGAGCTGCGGGATCGGCAGCCG
GATCACCTTGCCGTCGTTCGACGGGTTCAGGCCGAGATCGGACTCGGTCACGGCCTTCTC
GATCGCGCGGATCTGCGTCGGGTCATACGGCTGCACGGTCAGCAGACGCGCCTCGCTCGC
GCTGATCGTCGCCATCTGGTTGAGCGGCGTCGCCGAGCCGTAGTAGTCGATCTGGATCCG
GTCGAGCAGCGACGCCGAGGCGCGGCCCGTGCGGACGCTGTTGAACTCGCTGCGGGTCTG
CTCGACGGACTTGTCCATCCGCCGCCCCGCGTCCTGCAAGAGCTCGTCGATCGAGGCCAT
CTACCGCCCTCCTCCGGTCGAGATGATCGTGCCGACCCGCTCGCCCGAGACGACGCGGCG

Do you guys have any clue what's going on here?

@AroneyS
Copy link
Owner

AroneyS commented Jan 14, 2025

What command are you using? It looks like you are providing a list of genomes to --genomes, which takes space separated genomes. Instead you can use --genomes-list.

@francesco-ricci
Copy link
Author

This is my command:
binchicken coassemble --forward-list "$OUTPUT/clean_forward_reads.txt"
--reverse-list "$OUTPUT/clean_reverse_reads.txt"
--genomes "$OUTPUT/genomes.txt"
--cores 128 --output "$OUTPUT" --run-aviary

I did not change it from before tho, and that step worked fine previously. So you are saying to modify it to:
binchicken coassemble --forward-list "$OUTPUT/clean_forward_reads.txt"
--reverse-list "$OUTPUT/clean_reverse_reads.txt"
--genomes-list "$OUTPUT/genomes.txt"
--cores 128 --output "$OUTPUT" --run-aviary

right?

@francesco-ricci
Copy link
Author

Thanks Sam, binchicken seems to work fine so far. I'll keep you posted!

@francesco-ricci
Copy link
Author

Hi Sam

I had to restart binchicken a few times cause node restriction walltime of 3 days. After last time I restarted it, I keep getting the following error:

[Fri Jan 24 08:04:02 2025]
rule checkm2:
input: bins/final_bins, bins/checkm.out
output: bins/checkm2_output, bins/checkm2_output/quality_report.tsv
log: logs/checkm2.log
jobid: 22
benchmark: benchmarks/checkm2.benchmark.txt
reason: Missing output files: bins/checkm2_output/quality_report.tsv
threads: 32
resources: tmpdir=/tmp, mem_mb=131072, mem_mib=125000, runtime=480, gpus=0

Activating conda environment: path_to_conda_envs/61ed490c404ac70f052761cb9a62d3f6_
[Fri Jan 24 08:04:06 2025]
Error in rule checkm2:
jobid: 22
input: bins/final_bins, bins/checkm.out
output: bins/checkm2_output, bins/checkm2_output/quality_report.tsv
log: logs/checkm2.log (check log file(s) for error details)
conda-env: /fs04/rp24/fra/analyses/Tess/output/binning/binchicken/coassemble/coassemble/coassembly_1/recover/path_to_conda_envs/61ed490c404ac70f052761cb9a62d3f6_
shell:
export CHECKM2DB=/home/fricci/rp24/fra/database_files/checkm2/uniref100.KO.1.dmnd; echo "Using CheckM2 database $CHECKM2DB" > logs/checkm2.log; checkm2 predic>
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job checkm2 since they might be corrupted:
bins/checkm2_output
Trying to restart job 22.
Select jobs to execute...

unfortunately I can't locate the input, output and log folders reported at this step. I think that's where the problem is coming from. Do you have any advice?

Thanks
Francesco

@AroneyS
Copy link
Owner

AroneyS commented Jan 28, 2025

What is in coassemble/coassemble? Where are you getting this output? Looks like an Aviary error, so should be in the Aviary logs? The CheckM2 log should be in something like coassemble/coassemble/coassembly_0/recover/logs/checkm2.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants