Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The hints file(s) for GeneMark-EX contain less than 150 introns with multiplicity >= 4! #60

Open
yzliu01 opened this issue Jun 23, 2024 · 2 comments

Comments

@yzliu01
Copy link

yzliu01 commented Jun 23, 2024

Hi @tomasbruna,

I ran Braker to predict gene structure and got the problem in the step running Genmark-EX as below.
I used the reference genome and customized amino acid sequence database with the following command, which worked well with all species in the same genus except for the reference genome of one species. I am NOT use RNA-Seq data.
braker.pl --genome="$genome" --prot_seq="$Apodiea_gene_AA"

braker.log


#**********************************************************************************
#                              RUNNING GENEMARK-EX                                 
#**********************************************************************************
# Sat Jun 22 22:59:29 2024: Preparing genemark_evidence file hints from manual hints...
# Sat Jun 22 22:59:29 2024: Checking whether file /output/proj/data/bee_proj_data/gene_annotation/braker_BomMus_results_for_protseq/genemark_hintsfile.gff contains enough hints and sufficient multiplicity information...
#*********
# WARNING:
# The hints file(s) for GeneMark-EX contain less than 150 introns with multiplicity >= 4! (In total, 2658 unique introns are contained. 16 have a multiplicity >= 4.)
# Possibly, you are trying to run braker.pl on data that does not provide sufficient multiplicity information. This will e.g. happen if you try to use introns generated from assembled RNA-Seq transcripts; or if you try to run braker.pl in epmode with mappings from proteins without sufficient hits per locus. Or if you use the example data set.
# A low number of intron hints with sufficient multiplicity may result in a crash of GeneMark-EX (it should not crash with the example data set).
#*********
# Sat Jun 22 22:59:29 2024: Running GeneMark-EP
# Sat Jun 22 22:59:29 2024: changing into GeneMark-EP directory /output/proj/data/bee_proj_data/gene_annotation/braker_BomMus_results_for_protseq/GeneMark-EP
cd /output/proj/data/bee_proj_data/gene_annotation/braker_BomMus_results_for_protseq/GeneMark-EP
# Sat Jun 22 22:59:29 2024: Running gmes_petap.pl
/home/user/miniforge3/envs/braker3/bin/perl /home/user/proj/sofwtare/gmetp_linux_64/bin/gmes/gmes_petap.pl --verbose 
--seq /output/proj/data/bee_proj_data/gene_annotation/braker_BomMus_results_for_protseq/genome.fa 
--EP /output/proj/data/bee_proj_data/gene_annotation/braker_BomMus_results_for_protseq/genemark_hintsfile.gff 
--cores=8  --gc_donor 0.001 --evidence /output/proj/data/bee_proj_data/gene_annotation/braker_BomMus_results_for_protseq/genemark_evidence.gff  
--soft_mask auto 1>/output/proj/data/bee_proj_data/gene_annotation/braker_BomMus_results_for_protseq/GeneMark-EP.stdout 
2>/output/proj/data/bee_proj_data/gene_annotation/braker_BomMus_results_for_protseq/errors/GeneMark-EP.stderr

output error file
tail -30 gene_annotation_AndBic.40649391.e
The number of mairs aligned (8804/8804 (100%) pairs aligned) is much smaller than other reference genomes (448747/448747 (100%) pairs aligned). It seems that this reference genome is very distant from the protein database. Can you give any hints to address this issue?


[Sat Jun 22 22:58:48 2024] Enqueueing pair 8796/8804 (99.9%). Est. time left: 00:00:01 (hh:mm:ss)
[Sat Jun 22 22:59:27 2024] 8804/8804 (100%) pairs aligned
[Sat Jun 22 22:59:27 2024] Alignment of pairs finished
[Sat Jun 22 22:59:27 2024] Translating coordinates from local pair level to contig level
[Sat Jun 22 22:59:27 2024] Finished spliced alignment
[Sat Jun 22 22:59:27 2024] Flagging top chains
[Sat Jun 22 22:59:28 2024] Processing the output
[Sat Jun 22 22:59:29 2024] Output processed
[Sat Jun 22 22:59:29 2024] ProtHint finished.
ERROR in file /home/user/miniforge3/envs/braker3/bin/braker.pl at line 5414
Failed to execute: /home/user/miniforge3/envs/braker3/bin/perl /home/user/proj/sofwtare/gmetp_linux_64/bin/gmes/gmes_petap.pl --verbos ...


@tomasbruna
Copy link
Contributor

It's unlikely that the reference proteins would be close enough for some members of the genus but not for others.

This looks like some technical issue, possibly with the assembly of that one genome. You can send me the assembly of one of the genomes where the algorithm works well, the assembly of the problematic one, and the protein database. I'll take a look (please share by email, [email protected], if you don't want your data to appear here).

@yzliu01
Copy link
Author

yzliu01 commented Jun 27, 2024

OK, it is a bit too big and I just sent the data to you by email. Please check it. Appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants