-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No stop codons in spaln.gff #13
Comments
Hello Christian, as you correctly found, the real problem here is that there are almost no hints in the My suspicion is that something went wrong with GeneMark-ES, the step of ProtHint which generates gene seeds for alignment. Let's check that first. Can you check how many genes are in the file Can you also share the full ProtHint log? Thanks! |
Hi Tomas, Thanks for taking the time to help. In genemark.gtf there are 5696 unique genes. This seems quite low, right? Some basic stats about the genemark.gtf file: Average of ~15 exons/gene. Mean exon length: 117bp. Mean intron length: 103bp. For your information, here are the first two genes of genemark.gtf:. Besides the first line (single stop codon following intron, I don't see any outstanding issues).
Regarding the prothint log, are you referring to the stdout of the ProtHint command? I'm afraid I didn't redirect this to a file.. If this file is necessary for you, I can rerun ProtHint to generate this. Many thanks for your help! |
Hi Christian, this definitely seems like a problem with GeneMark-ES, the number of genes is too low and the genes have too many introns. Can you share more details about the genome itself? For example, how many contigs are longer than 50 Kb and how much of the sequence is soft-masked? Undermasking of repeats can cause issues with large genomes such as yours. To debug issues with GeneMark-ES, it will be helpful to run it separately and report what the log files and standard output say. You can run it like this:
Specifically, contents of the standard output and Best, |
Hi Tomas, The total assembly is ~220k contigs, and a total size ~7Gb. Due to limited computing resources and time limitations, I have selected as a subset all contigs of length >100kb (~13k contigs) for a combined length of ~1.9Gb for this primary analysis. For repeatmasking, RepeatModeler was run on a subset of these contigs (to limit runtime) to create a custom repeat library. From this library and the dfam library, 70% of the bases of the selected contigs were masked (~42% LTR elements, ~27% unclassified). RepeatMasker was run with with the -nolow option to avoid masking low complexity regions. I have now run GeneMark-ES as you suggested.
Thanks, |
Hello Christian, thanks for the useful report.
This could actually be the cause of this problem. In large genomes, low complexity repeats can "confuse" GeneMark-ES and make it converge to rather repeat-rich regions instead of real genes. I have seen the same behavior before when low-complexity repeats were not properly masked (on a similar-sized genome of X. tropicalis) so I hope this could also be the solution to your issue. You do not have to worry about genes being missed or split due to masking when they contain a short low-complexity stretch. With the option Also, are you hard-masking or soft-masking the repeats when running RepeastMasker? Please make sure to use soft-masking, hard-masking would disrupt many coding genes, especially with the recommended low-complexity masking. Best, |
Hi Tomas, I've had a chance at masking the genome anew, without the option It seems however that not that many additional bases were masked with this option (Simple repeats: 0.80%, Low complexity: 0.18%). Am I right that this is quite low? With this newly masked genome I ran GeneMarkES again. 5653 unique genes are predicted with an avg. of 13 introns pr. gene, which is quite similar to the previous run. Could it be that I'm not correctly masking low complexity regions? Here's
Thank you for continued support, |
Hello Christian, the amount of simple repeats indeed seems low. You could try to do simple repeat masking separately, outside of RepeatMasker, with Tandem Repeats Finder (TRF). RepeatMasker internally runs TRF as well, but you get more control over the results when you run it on your own. We described this procedure in this document https://www.biorxiv.org/content/10.1101/2020.08.10.245134v1.supplementary-material (Supplemental Materials) on page 15. I will also ask Alexandre Lomsadze, the original author of GeneMark-ES, to take a look at this issue, maybe he will have some other ideas what's wrong. Best, |
Hi Tomas, Thanks for this document. I will try following the approach you linked, and then run GeneMark-ES again. I will let you know how it goes. Best, |
Dear Thomas,
I am trying to run ProtHint on an assembly of a plant genome (~13k contigs, ~1.9Gbp). As a protein set, I used the OrthoDB Viridiplantae set (https://v100.orthodb.org/download/odb10_plants_fasta.tar.gz) supplemented by proteins from four more organisms of the same taxonomic family (Compositae).
My command looks like this:
~/software/ProtHint-2.5.0/bin/prothint.py --threads 4 ~/projects/ascaba/genomes/Ascaber/assembly/Ascaber_100k.fa.masked /home/christian/projects/ascaba/BRAKER_protein_sets/plant.proteins.fasta
ProtHint fails to complete the run and prints the following:
[Sun Aug 16 21:20:35 2020] error: ProtHint exited due to an error in command: grep stop_codon Spaln/spaln.gff > stops.gff || [[ $? == 1 ]]
So I checked spaln.gff and indeed there are no stop codons, only CDS and intron features. Moreover, the file seems surprisingly small (184 lines). I am not sure what this is caused by.
Best,
Christian
The text was updated successfully, but these errors were encountered: