Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

petap.pl fails on one sample but not others, too few introns #50

Open
ToriEggers opened this issue Oct 31, 2022 · 3 comments
Open

petap.pl fails on one sample but not others, too few introns #50

ToriEggers opened this issue Oct 31, 2022 · 3 comments

Comments

@ToriEggers
Copy link

Hi,
I have four nematode genome samples and I'm running BRAKER with genemark epmode to annotate with protein + genome, with RNA + genome, and then combine the two with TSEBRA. 3 of the genomes process perfectly fine but on another I keep running into a problem with the gmes_petap.pl step no matter the size or evolutionary distance of the protein file that I use (I've tried many, from sister species to all metazoa). Though the protein +genome run fails on this sample, the RNA + genome completes. When running esmode the protein annotation completes. I did these samples close in time, so there was no software updates or changes to my environment between samples.

Any idea as to why this one particular sample won't run like the others?

Error in braker.log:

                      RUNNING GENEMARK-EX

Preparing genemark_evidence file hints from manual hints...
Checking whether file /home/data/jfierst/veggers/DF5033_BRAKER_odb10/genemark_hintsfile.gff contains enough hints and sufficient multiplicity information...

WARNING:
The hints file(s) for GeneMark-EX contain less than 1000 introns. (In total, 6 unique introns are contained.)
Genemark-EX might fail due to the low number of hints.

WARNING:
The hints file(s) for GeneMark-EX contain less than 150 introns with multiplicity >= 4! (In total, 6 unique introns are contained. 0 have a multiplicity >= 4.)
Possibly, you are trying to run braker.pl on data that does not provide sufficient multiplicity information. This will e.g. happen if you try to use introns generated from assembled RNA-Seq transcripts; or if
you try to run braker.pl in epmode with mappings from proteins without sufficient hits per locus. Or if you use the example data set.
A low number of intron hints with sufficient multiplicity may result in a crash of GeneMark-EX (it should not crash with the example data set).

Running GeneMark-EP
changing into GeneMark-EP directory /home/data/jfierst/veggers/DF5033_BRAKER_odb10/GeneMark-EP
cd /home/data/jfierst/veggers/DF5033_BRAKER_odb10/GeneMark-EP
Running gmes_petap.pl
perl /home/data/jfierst/veggers/gmes_linux_64/gmes_petap.pl --verbose --seq /home/data/jfierst/veggers/DF5033_BRAKER_odb10/genome.fa --EP /home/data/jfierst/veggers/DF5033_BRAKER_odb10/genemark_hintsfile.gff --c
ores=8 --gc_donor 0.001 --evidence /home/data/jfierst/veggers/DF5033_BRAKER_odb10/genemark_evidence.gff --soft_mask auto 1>/home/data/jfierst/veggers/DF5033_BRAKER_odb10/GeneMark-EP.stdout 2>/home/data/jfierst
/veggers/DF5033_BRAKER_odb10/errors/GeneMark-EP.stderr

The GeneMark-EP.stderr file is empty

@tomasbruna
Copy link
Contributor

Sorry for the late reply. Is this still an issue or were you able to find a solution? Judging from these error messages, one problem could be that this genome is too small.

@vkeggers
Copy link

I ran it in es mode for protein+genome and then paired that data with the RNA+genome and ran it through TSEBRA. I don't know if this is necessarily 'correct' or best practice but I got an output. ~14000 genes were reported compared with ~19000 for the other species. The genome is ~74Mb. Is this too small?

I was going to try braker3 that was released recently and see if that changed anything but haven't had the time.

Ultimately I have data but I still don't know why it isn't working when given a protein file.

@tomasbruna
Copy link
Contributor

14,000 could be a bit low considering C. elegans (~100 Mbp) has ~20,000 genes in the annotation.

Ultimately I have data but I still don't know why it isn't working when given a protein file.

Apart from BRAKER3, you can also try a new protein-based pipeline, GALBA (preprint available here). It employs miniprot to align the reference proteins and uses the alignments directly to train AUGUSTUS, so it can be helpful in cases when GeneMark-EP fails for whatever reason. I'd recommend extracting nematode protein from the new OrthoDB v11 release and supplementing the protein set with additional nematodes from RefSeq - to get better protein coverage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants