Can this be used as a replacement for MetaEuk's eukaryotic gene prediction capabilities? #64

jolespin · 2024-09-29T16:34:44Z

I've been developing a metagenomic/metatranscriptomics software suite called VEBA (https://github.com/jolespin/veba) that natively handles eukaryotic binning of metagenome-assembled genomes and exon-aware gene prediction.

Currently I'm using MetaEuk but encountering significant resource requirements for larger eukaryotic genomes (especially alga from targeted cultured assemblies).

I've seen the sensitivity and distant homology issues mentioned so thought this would be appropriate to ask in an issue.

I have a general microeukaryotic protein database that I've compiled from various source repositories and clustered by 100%, 90%, and 50% identity similar to UniRef (explained in Table 2 https://academic.oup.com/nar/article/52/14/e63/7697622).

In this database, there will be many proteins that are not related to the target genome.

My questions:

Given a genome where I do not know the lineage a priori can I use miniprot with this "general" microeukaryotic protein database?
Can I use miniprot for exon-aware gene predictions as I do with MetaEuk?
Can this be used with fragmented genomes?

If so, are there any parameters I should adjust to help with any of those scenarios?

lh3 · 2024-09-29T19:21:03Z

I haven't tried miniprot on fragmented metagenomic contigs but I believe it should work with UniRef100 or UniRef90 protein sets. The most important parameter in your case is the max intron size (-G) and possibly the min percent coverage (--outc). There will be many false hits. You will need to heavily filter the output. For example, if a region is covered by a good alignment, you will want to filter worse alignment.

Can I use miniprot for exon-aware gene predictions as I do with MetaEuk?

I usually don't call this gene prediction, but MetaEuk and miniprot largely have the same applications.

jolespin · 2024-09-30T02:27:55Z

I plan on doing a deep dive with this tool tomorrow morning. I read through the documentation and noticed that the index applies to genome instead of the proteins. Is there any functionality that allows indexing proteins instead? For example, if I was using this to identify gene candidates in metagenome-assembled genomes based on a predetermined protein database.

lh3 · 2024-09-30T12:34:47Z

I thought MetaEuk also indexes the genome internally? I could be wrong. Anyway, in my opinion, indexing proteins makes more sense for aligning reads; indexing contigs is preferred for aligning contigs.

jolespin · 2024-09-30T21:03:25Z

Thanks for the responses. I really appreciate your insight on this.

I thought MetaEuk also indexes the genome internally? I could be wrong. Anyway, in my opinion, indexing proteins makes more sense for aligning reads; indexing contigs is preferred for aligning contigs.

I think you're right based on the docs:

metaeuk easy-predict contigsFasta/contigsDB proteinsFasta/referenceDB predsResults tempFolder

In the past I've prebuilt the protein database and provided the contigs as fasta but it probably creates a contigsDB in the backend.

Really enjoying the functionality of miniprot and is as easy to use as minimap2 (also, appreciate the informative log output w/ peak RSS).

Now that I've tested miniprot out with a few test cases, I had a few more specific questions:

How does miniprot handle genes with multiple exons? I see that the GFF file has mRNA and CDS but not exons.
Are there any tools you would recommend to post process the output records in the GFF to enrich for higher confidence candidates? I found the following tool but not sure you have personally used it or not: https://github.com/tomasbruna/miniprot-boundary-scorer?tab=readme-ov-file The reason why I ask is because I ran Miniprot on a test genome and it produced >200k genes using the following command: miniprot --outc=0.25 -G 2000 -t 8 -P ${ID}_gene --gff ${ID}.mpi ${DB}

lh3 · 2024-10-01T00:43:28Z

How does miniprot handle genes with multiple exons? I see that the GFF file has mRNA and CDS but not exons.

A CDS is effectively an exon as only coding sequences are involved in alignment in theory. Nonetheless, miniprot may introduce frameshift.

I don't know a good tool to postprocess miniprot alignment. People are using in-house scripts for different purposes in my understanding.

lh3 added the question Further information is requested label Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can this be used as a replacement for MetaEuk's eukaryotic gene prediction capabilities? #64

Can this be used as a replacement for MetaEuk's eukaryotic gene prediction capabilities? #64

jolespin commented Sep 29, 2024

lh3 commented Sep 29, 2024

jolespin commented Sep 30, 2024

lh3 commented Sep 30, 2024 •

edited

Loading

jolespin commented Sep 30, 2024

lh3 commented Oct 1, 2024

Can this be used as a replacement for MetaEuk's eukaryotic gene prediction capabilities? #64

Can this be used as a replacement for MetaEuk's eukaryotic gene prediction capabilities? #64

Comments

jolespin commented Sep 29, 2024

lh3 commented Sep 29, 2024

jolespin commented Sep 30, 2024

lh3 commented Sep 30, 2024 • edited Loading

jolespin commented Sep 30, 2024

lh3 commented Oct 1, 2024

lh3 commented Sep 30, 2024 •

edited

Loading