-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can this be used as a replacement for MetaEuk's eukaryotic gene prediction capabilities? #64
Comments
I haven't tried miniprot on fragmented metagenomic contigs but I believe it should work with UniRef100 or UniRef90 protein sets. The most important parameter in your case is the max intron size (
I usually don't call this gene prediction, but MetaEuk and miniprot largely have the same applications. |
I plan on doing a deep dive with this tool tomorrow morning. I read through the documentation and noticed that the index applies to genome instead of the proteins. Is there any functionality that allows indexing proteins instead? For example, if I was using this to identify gene candidates in metagenome-assembled genomes based on a predetermined protein database. |
I thought MetaEuk also indexes the genome internally? I could be wrong. Anyway, in my opinion, indexing proteins makes more sense for aligning reads; indexing contigs is preferred for aligning contigs. |
Thanks for the responses. I really appreciate your insight on this.
I think you're right based on the docs:
In the past I've prebuilt the protein database and provided the contigs as fasta but it probably creates a contigsDB in the backend. Really enjoying the functionality of miniprot and is as easy to use as minimap2 (also, appreciate the informative log output w/ peak RSS). Now that I've tested miniprot out with a few test cases, I had a few more specific questions:
|
A CDS is effectively an exon as only coding sequences are involved in alignment in theory. Nonetheless, miniprot may introduce frameshift. I don't know a good tool to postprocess miniprot alignment. People are using in-house scripts for different purposes in my understanding. |
I've been developing a metagenomic/metatranscriptomics software suite called VEBA (https://github.com/jolespin/veba) that natively handles eukaryotic binning of metagenome-assembled genomes and exon-aware gene prediction.
Currently I'm using MetaEuk but encountering significant resource requirements for larger eukaryotic genomes (especially alga from targeted cultured assemblies).
I've seen the sensitivity and distant homology issues mentioned so thought this would be appropriate to ask in an issue.
I have a general microeukaryotic protein database that I've compiled from various source repositories and clustered by 100%, 90%, and 50% identity similar to UniRef (explained in Table 2 https://academic.oup.com/nar/article/52/14/e63/7697622).
In this database, there will be many proteins that are not related to the target genome.
My questions:
If so, are there any parameters I should adjust to help with any of those scenarios?
The text was updated successfully, but these errors were encountered: