Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can this be used as a replacement for MetaEuk's eukaryotic gene prediction capabilities? #64

Open
jolespin opened this issue Sep 29, 2024 · 5 comments
Labels
question Further information is requested

Comments

@jolespin
Copy link

I've been developing a metagenomic/metatranscriptomics software suite called VEBA (https://github.com/jolespin/veba) that natively handles eukaryotic binning of metagenome-assembled genomes and exon-aware gene prediction.

Currently I'm using MetaEuk but encountering significant resource requirements for larger eukaryotic genomes (especially alga from targeted cultured assemblies).

I've seen the sensitivity and distant homology issues mentioned so thought this would be appropriate to ask in an issue.

I have a general microeukaryotic protein database that I've compiled from various source repositories and clustered by 100%, 90%, and 50% identity similar to UniRef (explained in Table 2 https://academic.oup.com/nar/article/52/14/e63/7697622).

In this database, there will be many proteins that are not related to the target genome.

My questions:

  • Given a genome where I do not know the lineage a priori can I use miniprot with this "general" microeukaryotic protein database?
  • Can I use miniprot for exon-aware gene predictions as I do with MetaEuk?
  • Can this be used with fragmented genomes?

If so, are there any parameters I should adjust to help with any of those scenarios?

@lh3
Copy link
Owner

lh3 commented Sep 29, 2024

I haven't tried miniprot on fragmented metagenomic contigs but I believe it should work with UniRef100 or UniRef90 protein sets. The most important parameter in your case is the max intron size (-G) and possibly the min percent coverage (--outc). There will be many false hits. You will need to heavily filter the output. For example, if a region is covered by a good alignment, you will want to filter worse alignment.

Can I use miniprot for exon-aware gene predictions as I do with MetaEuk?

I usually don't call this gene prediction, but MetaEuk and miniprot largely have the same applications.

@jolespin
Copy link
Author

I plan on doing a deep dive with this tool tomorrow morning. I read through the documentation and noticed that the index applies to genome instead of the proteins. Is there any functionality that allows indexing proteins instead? For example, if I was using this to identify gene candidates in metagenome-assembled genomes based on a predetermined protein database.

@lh3
Copy link
Owner

lh3 commented Sep 30, 2024

I thought MetaEuk also indexes the genome internally? I could be wrong. Anyway, in my opinion, indexing proteins makes more sense for aligning reads; indexing contigs is preferred for aligning contigs.

@lh3 lh3 added the question Further information is requested label Sep 30, 2024
@jolespin
Copy link
Author

Thanks for the responses. I really appreciate your insight on this.

I thought MetaEuk also indexes the genome internally? I could be wrong. Anyway, in my opinion, indexing proteins makes more sense for aligning reads; indexing contigs is preferred for aligning contigs.

I think you're right based on the docs:

metaeuk easy-predict contigsFasta/contigsDB proteinsFasta/referenceDB predsResults tempFolder

In the past I've prebuilt the protein database and provided the contigs as fasta but it probably creates a contigsDB in the backend.

Really enjoying the functionality of miniprot and is as easy to use as minimap2 (also, appreciate the informative log output w/ peak RSS).

Now that I've tested miniprot out with a few test cases, I had a few more specific questions:

  • How does miniprot handle genes with multiple exons? I see that the GFF file has mRNA and CDS but not exons.
  • Are there any tools you would recommend to post process the output records in the GFF to enrich for higher confidence candidates? I found the following tool but not sure you have personally used it or not: https://github.com/tomasbruna/miniprot-boundary-scorer?tab=readme-ov-file The reason why I ask is because I ran Miniprot on a test genome and it produced >200k genes using the following command: miniprot --outc=0.25 -G 2000 -t 8 -P ${ID}_gene --gff ${ID}.mpi ${DB}

@lh3
Copy link
Owner

lh3 commented Oct 1, 2024

How does miniprot handle genes with multiple exons? I see that the GFF file has mRNA and CDS but not exons.

A CDS is effectively an exon as only coding sequences are involved in alignment in theory. Nonetheless, miniprot may introduce frameshift.

I don't know a good tool to postprocess miniprot alignment. People are using in-house scripts for different purposes in my understanding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants