Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error calling tRNAscan-SE #290

Closed
asierFernandezP opened this issue Sep 4, 2023 · 19 comments
Closed

Error calling tRNAscan-SE #290

asierFernandezP opened this issue Sep 4, 2023 · 19 comments
Labels
enhancement New feature or request

Comments

@asierFernandezP
Copy link

asierFernandezP commented Sep 4, 2023

phrokka version: v1.4.1
Python version: Python 3.10.12
Operating System: Computing cluster

Description

I have run prodigal-gv on my viral genomes to get the ORFs (it also outputs a genbank formatted CSV file). Using the predicted ORFs as input I want to run Pharroka to annotate them (without running prodigal or Phannotate again).

What I Did

I run the following command:

/home2/p304845/Conda_envs/Pharokka_conda/bin/pharokka.py \
	-i Genbank_formatted_CSV \
	 --genbank \
	-o $output_dir \
	-d /home2/p304845/Conda_envs/Pharokka_conda/databases \
	-t ${SLURM_CPUS_PER_TASK}

And I got the following error:

2023-09-04 17:06:08.757 | INFO     | __main__:main:84 - Starting Pharokka v1.4.1
2023-09-04 17:06:08.759 | INFO     | __main__:main:85 - Command executed: Namespace(infile='/scratch/p304845/Annotation/Pharokka_annotation/viruses_present_output.csv', outdir='/scratch/p304845/Annotation/Pharokka_annotation/Results', database='/home2/p304845/Conda_envs/Pharokka_conda/databases', threads='24', force=False, prefix='Default', locustag='Default', gene_predictor='phanotate', meta=False, split=False, coding_table='11', evalue='1E-05', fast=False, mmseqs2_only=False, meta_hmm=False, dnaapler=False, custom_hmm='', genbank=True, terminase=False, terminase_strand='nothing', terminase_start='nothing', citation=False)
2023-09-04 17:06:08.759 | INFO     | __main__:main:86 - Repository homepage is https://github.com/gbouras13/pharokka
2023-09-04 17:06:08.759 | INFO     | __main__:main:87 - Written by George Bouras: [email protected]
2023-09-04 17:06:08.759 | INFO     | __main__:main:89 - Checking database installation in /home2/p304845/Conda_envs/Pharokka_conda/databases.
2023-09-04 17:06:08.767 | INFO     | __main__:main:92 - All databases have been successfully checked.
2023-09-04 17:06:08.767 | INFO     | __main__:main:108 - Checking dependencies.
2023-09-04 17:06:09.540 | INFO     | input_commands:check_dependencies:322 - Phanotate version found is v1.5.1
2023-09-04 17:06:09.541 | INFO     | input_commands:check_dependencies:331 - Phanotate version is ok.
2023-09-04 17:06:09.695 | INFO     | input_commands:check_dependencies:353 - MMseqs2 version found is v13.45111
2023-09-04 17:06:09.695 | INFO     | input_commands:check_dependencies:362 - MMseqs2 version is ok.
2023-09-04 17:06:10.410 | INFO     | input_commands:check_dependencies:386 - tRNAscan-SE version found is v2.0.12
2023-09-04 17:06:10.411 | INFO     | input_commands:check_dependencies:397 - tRNAscan-SE version is ok.
2023-09-04 17:06:11.627 | INFO     | input_commands:check_dependencies:421 - MinCED version found is v0.4.2
2023-09-04 17:06:11.628 | INFO     | input_commands:check_dependencies:432 - MinCED version is ok.
2023-09-04 17:06:11.675 | INFO     | input_commands:check_dependencies:458 - ARAGORN version found is v1.2.41
2023-09-04 17:06:11.676 | INFO     | input_commands:check_dependencies:469 - ARAGORN version is ok.
2023-09-04 17:06:12.063 | INFO     | input_commands:check_dependencies:485 - mash version found is v2.3
2023-09-04 17:06:12.064 | INFO     | input_commands:check_dependencies:492 - mash version is ok.
2023-09-04 17:06:12.843 | INFO     | input_commands:check_dependencies:511 - Dnaapler version found is v0.3.1
2023-09-04 17:06:12.843 | INFO     | input_commands:check_dependencies:518 - Dnaapler version is ok.
2023-09-04 17:06:12.843 | INFO     | input_commands:check_dependencies:529 - Pyrodigal version is v2.3.0
2023-09-04 17:06:12.843 | INFO     | input_commands:check_dependencies:530 - Pyrodigal version is ok.
2023-09-04 17:06:12.844 | INFO     | __main__:main:113 - You have specified --genbank.
2023-09-04 17:06:12.844 | INFO     | __main__:main:114 - Therefore, /scratch/p304845/Annotation/Pharokka_annotation/viruses_present_output.csv is a genbank file instead of a FASTA file.
2023-09-04 17:06:12.844 | INFO     | __main__:main:117 - Your custom CDS calls in this genbank file will be preserved.
2023-09-04 17:06:14.652 | INFO     | __main__:main:280 - Extracting CDS information from your genbank file.
2023-09-04 17:06:17.836 | INFO     | __main__:main:291 - Starting tRNA-scanSE.
2023-09-04 17:06:17.840 | INFO     | external_tools:run:50 - Started running tRNAscan-SE --thread 24 -G -Q -j /scratch/p304845/Annotation/Pharokka_annotation/Results/trnascan_out.gff /scratch/p304845/Annotation/Pharokka_annotation/Results/genbank.fasta ...
2023-09-04 17:06:18.037 | ERROR    | external_tools:run_tool:94 - Error calling tRNAscan-SE --thread 24 -G -Q -j /scratch/p304845/Annotation/Pharokka_annotation/Results/trnascan_out.gff /scratch/p304845/Annotation/Pharokka_annotation/Results/genbank.fasta (return code 2)
2023-09-04 17:06:18.037 | ERROR    | processes:run_trna_scan:532 - Error: tRNAscan-SE not found

Thanks in advance!

@gbouras13
Copy link
Owner

Hi @asierFernandezP,

Thanks for trying out --genbank mate, glad someone found it useful. The intended function was for manually curated CDS calls, not prodigal-gv, but I should have guessed this would come up too.

First of all I clearly need to change the error messages from '__ not found' to '__ failed' I think so thanks for that.

I have had a crack at replicating this and I think the issue is that the GenBank output format of prodigal-gv isn't fully specified. In other words, it does not include the CDS translations or the FASTA file, so I can't actually get any sequence information from it (first step pharokka does with --genbank is to save all amino acid CDS under translation and to output the input nucleotide contigs reconstructed from the GenBank as a fasta).

Therefore I have 2 options to fix this:

  1. Work with @althonos to get prodigal-gv's models added to pyrodigal Add additional gene models to the metagenome mode althonos/pyrodigal#24. But this seems like a lot of work to do when prodigal-gv already exists, plus it feels like more of a fork-related feature I think than a core pyrodigal feature (it's up to Martin anyway after all).
  2. Implement prodigal-gv support into pharokka as a gene prediction option (-g), with some possible help from @apcamargo.

I am aware that the Genbank format is a nightmare, so I don't wish to spend a lot of time down the GenBank rabbit hole, but I do think prodigal-gv would be a pretty common option for users so I am happy to spend some time to include it (indeed I have some genomad output I would like pharokka to smash through and annotate myself soon).

I will need to change the gff to gbk conversion function of pharokka regardless of what I do to deal with cases where different contigs have different translation codes, so I think 2. would be the most preferable option.

George

@asierFernandezP
Copy link
Author

Hi George,

Thank you so much for your quick answer!! Indeed, I think the 2nd option would be ideal and more convenient. As you mentioned, the problem is that prodigal-gv outputs a Genbank formatted CSV file without the CDS sequences.

Let me know if it would be possible for you to implement it or if any further information/help from my side is needed! And thanks again for the great tool :))

@gbouras13 gbouras13 added the enhancement New feature or request label Sep 5, 2023
@gbouras13
Copy link
Owner

No problems @asierFernandezP!

I will aim to implement this when I get some time hopefully in the next week or 2 - and this will make pharokka v1.5.0.

I'll let you know regarding help - I might get you to do some testing before I release it. But otherwise I think I know what I need to implement to make this work.

George

@althonos
Copy link
Contributor

althonos commented Sep 5, 2023

Work with @althonos to get prodigal-gv's models added to pyrodigal althonos/pyrodigal#24. But this seems like a lot of work to do when prodigal-gv already exists, plus it feels like more of a fork-related feature I think than a core pyrodigal feature (it's up to Martin anyway after all).

I don't think I will integrate it to "stock" Pyrodigal, but i can probably figure out a way to get that into a different dedicated package that has the prodigal-gv models, especially if you'd find it useful right away 👍

@gbouras13
Copy link
Owner

I think it would be very useful quite soon for lots of users in the gut metagenome/giant virus etc space.

Also more generally speaking as sort of mentioned in althonos/pyrodigal#24, not sure what you have in mind but there's some scope to build upon your methods for alternative coding prediction and within-genome stop codon reassignment (see e.g. https://github.com/gatech-genemark/Mgcod ).

Happy to help out and contribute whatever I can to help make this happen (and I'm sure @asierFernandezP and @apcamargo would be keen too by the looks).

George

@althonos
Copy link
Contributor

althonos commented Sep 7, 2023

@gbouras13 : pip install -U --pre pyrodigal-gv if you wanna test 😉

Also more generally speaking as sort of mentioned in althonos/pyrodigal#24, not sure what you have in mind but there's some scope to build upon your methods for alternative coding prediction and within-genome stop codon reassignment (see e.g. https://github.com/gatech-genemark/Mgcod ).

On that note, I think within-genome stop codon reassignment is probably out of scope for Prodigal, because the dynamic programming needs to have the codons from the entire genome to start finding the highest scoring gene path. However, alternative coding prediction could be feasible, similarly to the meta mode, trying to nodes with several different genetic codes before extracting genes for the highest scoring one.

@gbouras13
Copy link
Owner

gbouras13 commented Sep 9, 2023

@althonos I've given it a test run (with the pyrodigal v3 alpha installed from source), it looks phenomenal mate.

The only thing lacking I can think that would be useful (in terms of recapitulating the pyrodigal-gv output and generally) is 'genetic_code' in the description (it's between rbs_spacer and gc_cont).

prodigal-gv v2.11 output


MZ130495.1      Prodigal_v2.11.0-gv     CDS     98636   101128  474.7   +       0       ID=1_100;partial=01;start_type=ATG;rbs_motif=TAA;rbs_spacer=12bp;genetic_code=15;gc_cont=0.361;conf=99.99;score=474.69;cscore=467.02;sscore=7.67;rscore=5.99;uscore=1.68;tscore=0.00;

pyrodigal-gv output


MZ130495.1	pyrodigal_v3.0.0-alpha1	CDS	98636	101128	474.7	+	0	ID=MZ130495.1_100;partial=01;start_type=ATG;rbs_motif=TAA;rbs_spacer=12bp;gc_cont=0.361;conf=99.99;score=474.69;cscore=467.02;sscore=7.67;rscore=5.99;uscore=1.68;tscore=0.00;

the code:

def run_pyrodigal_gv(filepath_in, out_dir, coding_table):

    orf_finder = pyrodigal_gv.ViralGeneFinder(meta=True)

    with open(os.path.join(out_dir, "prodigal_out.gff"), "w") as dst:
        for i, record in enumerate(SeqIO.parse(filepath_in, "fasta")):
            genes = orf_finder.find_genes(str(record.seq))
            genes.write_gff(dst, sequence_id=record.id)

George

@althonos
Copy link
Contributor

@gbouras13 : I have made a new Pyrodigal pre-release that supports that (pip install -U --pre pyrodigal to get 3.0.0-alpha2), this way you get a new include_translation_table keyword argument among other things that outputs the translation table. (I'm using translation_table to remain consistent with Prodigal).

@gbouras13
Copy link
Owner

MZ130495.1	pyrodigal_v3.0.0-alpha2	CDS	98636	101128	474.7	+	0	ID=MZ130495.1_100;partial=01;start_type=ATG;rbs_motif=TAA;rbs_spacer=12bp;gc_cont=0.361;transl_table=15;conf=99.99;score=474.69;cscore=467.02;sscore=7.67;rscore=5.99;uscore=1.68;tscore=0.00;

Beautiful, outputs look great to me.

More than happy to do any testing that you would find useful Martin.

George

@althonos
Copy link
Contributor

You're most likely the main user of this with @apcamargo at the moment so I'll keep on alpha for a bit while you try it around 👍

@apcamargo
Copy link

I'm still testing it out. But no problems so far!

@gbouras13
Copy link
Owner

I've tested it out today on a number of single phages + a 673 contig gut phage test set (from Yutin et al), around 50MB fasta file.

The results look great, I get exactly what is expected and the output is easily parsable for me.

With pyrodigal v3.0.0-alpha2 & pyrodigal v0.1.0a1 implemented in Pharokka v1.5.0 (dev branch):

Running on Intel Core i7-10700K CPU @ 3.80GHz on a machine running Ubuntu 20.04.6 LTS, it took 305 seconds to run the pyrodigal-gv step (I didn't do proper measurements with time, this is just coming from the Pharokka output log).

This compares to 184 seconds for regular pyrodigal on the same dataset.

For single phage contigs runtime is negligible for both (sub 1 second).

Overall, I am happy with the output for my purposes, when you release pyrodigal v3 and pyrodigal_gv as a stable release, I will soon after release Pharokka v1.5.0 with support for pyrodigal-gv.

Thanks @althonos and @apcamargo amazing stuff!

George

@althonos
Copy link
Contributor

Nice to hear! 🙏 I also don't know if you've seen but there's an option to run meta mode only using viral models, in case you're looking for speedup 👍 Maybe something to expose as a flag?

@apcamargo
Copy link

Thanks @gbouras13! I don't remember having a slowdown this dramatic when comparing Prodigal to prodigal-gv. Can you check if execution time decreases when you use the viral_only argument? I wouldn't recommend it for production because it won't cover all phages.

@gbouras13
Copy link
Owner

Hey @apcamargo and @althonos ,

This indeed seems to explain the difference - it took 178 seconds with viral_only=True.

For production I will follow your advice so I will leave it as was.

Either way, it is more than quick enough for Pharokka's purposes, a few minutes to do CDS prediction on nearly a thousand contigs is pretty efficient.

George

@asierFernandezP
Copy link
Author

Thank you guys @althonos @gbouras13 @apcamargo! That's called efficiency! Just tested it on a set of ~2000 metagenomic viral contigs and no problems observed!

@gbouras13
Copy link
Owner

@althonos @apcamargo I'm releasing v1.5.0 today. I've added a citation to the citations section for pyrodigal-gv as this:

Larradle M. and Camargo A., (2023) Pyrodigal-gv: A Pyrodigal extension to predict genes in giant viruses and viruses with alternative genetic code. https://github.com/althonos/pyrodigal-gv.

If you want to change it just let me know.

George

@althonos
Copy link
Contributor

Just spell my name right and we're good 😄

@gbouras13
Copy link
Owner

gbouras13 commented Sep 21, 2023

Easiest commit I have made, sorry about that @althonos but fixed now :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants