Skip to content

Commit

Permalink
r218: let -G override -I, not the reverse
Browse files Browse the repository at this point in the history
This is more intuitive. Also updated README
  • Loading branch information
lh3 committed Mar 7, 2023
1 parent 485ba5e commit 87f11cb
Show file tree
Hide file tree
Showing 4 changed files with 27 additions and 57 deletions.
64 changes: 17 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,12 @@ cd miniprot && make
./miniprot test/DPP3-hs.gen.fa.gz test/DPP3-mm.pep.fa.gz > aln.paf # PAF output
./miniprot --gff test/DPP3-hs.gen.fa.gz test/DPP3-mm.pep.fa.gz > aln.gff # GFF3+PAF output

# general command line: index and align in one go
./miniprot -ut16 --gff genome.fna protein.faa > aln.gff
# general command line: index and align in one go (-I sets max intron size based on genome size)
./miniprot -Iut16 --gff genome.fna protein.faa > aln.gff

# general command line: index first and then align (recommended)
./miniprot -t16 -d genome.mpi genome.fna
./miniprot -ut16 --gff genome.mpi protein.faa > aln.gff
./miniprot -Iut16 --gff genome.mpi protein.faa > aln.gff

# output format
man ./miniprot.1
Expand All @@ -29,7 +29,6 @@ man ./miniprot.1
- [Users' Guide](#uguide)
- [Installation](#install)
- [Usage](#usage)
- [Evaluation](#eval)
- [Algorithm overview](#algo)
- [Citing miniprot](#cite)
- [Limitations](#limit)
Expand Down Expand Up @@ -84,47 +83,12 @@ For convenience, miniprot can also output GFF3 with option `--gff`:
```sh
miniprot -t8 --gff -d ref.mpi ref.fna > out.gff
```
The detailed alignment is embedded in `##PAF` lines in the GFF3 output.

### <a name="eval"></a>Evaluation

We collected Ensembl canonical mouse proteins from Gencode vM30 and longest
proteins per gene for chicken and zebrafish (fish). We then aligned these proteins to
the human reference genome GRCh38. We say a junction is confirmed if it can be
found in the human Gencode annotation v41; a junction is non-overlapping if the
intron in the junction does not overlap with any introns in the Gencode
annotation.

We only evaluated miniprot-r173 and [spaln][spaln]-2.4.13a as these are the
only tools practical for whole genomes. Running other tools would require to
find approximate protein mapping first and then realign in each local region.
This procedure is complex and does not evaluate the mapping step. In addition,
[Iwata and Gotoh (2012)][spaln2] suggest that spaln2 consistently outperforms
exonerate, GeneWise, ProSplign and genBlastG.

In the evaluation, both miniprot (mp) and spaln (sp) were set to use 16 CPU
threads. We used option `-Q7 -O0 -Thomosapi -LS -yS` with spaln (local
alignment with the human-specific splicing model). It gives the best accuracy
on these dataset. Note spaln-2.4.13a crashed for a few zebrafish proteins. We
used 98% of zebrafish proteins in the evaluation. Miniprot uses a splice model
for mammals by default. 'mp-j1' applies a more general model and has lower
accuracy for aligning against the human genome.

|Metric |mouse/mp |mouse/sp |chicken/mp| fish/mp|fish/mp-j1| fish/sp|
|:---------------|--------:|--------:|--------:|--------:|---------:|--------:|
|Elapsed time (s)| 314 | 3,767 | 260 | 470 | 475 | 12703 |
|Peak RAM (Gb) | 15.3 | 5.6 | 14.6 | 18.7 | 18.9 | 5.5 |
|# proteins | 21,844 | 21,844 | 17,007 | 29,706 | 29,706 | 29,706 |
|# mapped | 19,303 | 18,840 | 13,421 | 19,594 | 19,594 | 17,491 |
|# single-exon | 2,810 | | 1,227 | 1,798 | 1,667 | |
|# predicted junc| 165,458 | 171,241 | 132,473 | 174,975 | 177,995 | 180,117 |
|# non-ovlp junc | 316 | 852 | 258 | 457 | 737 | 1,391 |
|# confirmed junc| 161,113 | 162,551 | 123,523 | 162,195 | 161,225 | 162,757 |
|% confirmed | 97.37 | 94.93 | 94.29 | 92.70 | 90.58 | 90.36 |

Generally, miniprot finds fewer novel splice junctions, implying higher
specificity, but spaln finds more confirmed junctions, implying higher
sensitivity.
The detailed alignment is embedded in `##PAF` lines in the GFF3 output. You can
also get detailed residue alignment with `--aln`.

If you are aligning proteins to a whole genome, it is recommended to add option
`-I` to let miniprot automatically set the maximum intron size. You can also
use `-G` to explicitly specify the max intron size.

### <a name="algo"></a>Algorithm overview

Expand All @@ -151,9 +115,14 @@ sensitivity.

### <a name="cite"></a>Citing miniprot

The miniprot algorithm is described in the following preprint:
If you use miniprot, please cite:

> Li, H. (2023) Protein-to-genome alignment with miniprot. *Bioinformatics*, **39**, btad014 [[PMID: 36648328]][mp-pmid].
> Li, H. (2022). Protein-to-genome alignment with miniprot. [arXiv:2210.08052](https://arxiv.org/abs/2210.08052).
The preprint is available at
[arXiv:2210.08052](https://arxiv.org/abs/2210.08052), which
additionally shows metrics on MetaEuk. Please note that the published paper
evaluated miniprot-0.7. The latest version may report different numbers.

## <a name="limit"></a>Limitations

Expand All @@ -167,6 +136,7 @@ The miniprot algorithm is described in the following preprint:

[exonerate]: https://pubmed.ncbi.nlm.nih.gov/15713233/
[genewise]: https://pubmed.ncbi.nlm.nih.gov/15123596/
[mp-pmid]: https://pubmed.ncbi.nlm.nih.gov/36648328/
[zlib]: https://zlib.net
[paftools]: https://github.com/lh3/minimap2/blob/master/misc/paftools.js
[minimap2]: https://github.com/lh3/minimap2
Expand Down
12 changes: 6 additions & 6 deletions main.c
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,8 @@ static void print_usage(FILE *fp, const mp_idxopt_t *io, const mp_mapopt_t *mo,
fprintf(fp, " Mapping:\n");
fprintf(fp, " -S no splicing (applying -G1k -J1k -e1k)\n");
fprintf(fp, " -c NUM max k-mer occurrence [%d]\n", mo->max_occ);
fprintf(fp, " -G NUM max intron size [200k]\n");
fprintf(fp, " -I set max intron size to 3.6*sqrt(refLen); override -G\n");
fprintf(fp, " -G NUM max intron size; override -I [200k]\n");
fprintf(fp, " -I set max intron size to 3.6*sqrt(refLen)\n");
fprintf(fp, " -w FLOAT weight of log gap penalty [%g]\n", mo->chn_coef_log);
fprintf(fp, " -n NUM minimum number of syncmers in a chain [%d]\n", mo->min_chn_cnt);
fprintf(fp, " -m NUM min chaining score [%d]\n", mo->min_chn_sc);
Expand Down Expand Up @@ -88,7 +88,7 @@ static void print_usage(FILE *fp, const mp_idxopt_t *io, const mp_mapopt_t *mo,

int main(int argc, char *argv[])
{
int32_t c, i, set_I = 0, n_threads = 4;
int32_t c, i, set_I = 0, set_G = 0, n_threads = 4;
ketopt_t o = KETOPT_INIT;
mp_mapopt_t mo;
mp_idxopt_t io;
Expand All @@ -107,14 +107,14 @@ int main(int argc, char *argv[])
else if (c == 't') n_threads = atoi(o.arg);
else if (c == 'l') mo.kmer2 = atoi(o.arg);
else if (c == 'c') mo.max_occ = mp_parse_num(o.arg);
else if (c == 'G') mo.bw = mo.max_intron = mp_parse_num(o.arg);
else if (c == 'G') mo.bw = mo.max_intron = mp_parse_num(o.arg), set_G = 1;
else if (c == 'I') set_I = 1;
else if (c == 'n') mo.min_chn_cnt = mp_parse_num(o.arg);
else if (c == 'm') mo.min_chn_sc = mp_parse_num(o.arg);
else if (c == 'K') mo.mini_batch_size = mp_parse_num(o.arg);
else if (c == 'p') mo.pri_ratio = atof(o.arg);
else if (c == 'N') mo.best_n = mp_parse_num(o.arg);
else if (c == 'S') mo.flag |= MP_F_NO_SPLICE, mo.bw = mo.max_intron = mo.max_ext = 1000, mo.io = mo.io_end = 10000;
else if (c == 'S') mo.flag |= MP_F_NO_SPLICE, mo.bw = mo.max_intron = mo.max_ext = 1000, mo.io = mo.io_end = 10000, set_G = 1;
else if (c == 'A') mo.flag |= MP_F_NO_ALIGN;
else if (c == 'O') mo.go = atoi(o.arg);
else if (c == 'E') mo.ge = atoi(o.arg);
Expand Down Expand Up @@ -167,7 +167,7 @@ int main(int argc, char *argv[])
fprintf(stderr, "[ERROR]\033[1;31m failed to open/build the index\033[0m\n");
return 1;
}
if (set_I) mp_mapopt_set_max_intron(&mo, mi->nt->l_seq);
if (set_I && !set_G) mp_mapopt_set_max_intron(&mo, mi->nt->l_seq);
if (mp_verbose >= 3) mp_idx_print_stat(mi, mo.max_occ);
if (fn_idx != 0) mp_idx_dump(fn_idx, mi);
for (i = o.ind + 1; i < argc; ++i)
Expand Down
6 changes: 3 additions & 3 deletions miniprot.1
Original file line number Diff line number Diff line change
Expand Up @@ -69,15 +69,15 @@ Ignore k-mers occurring
times or more [50k]
.TP
.BI -G \ NUM
Max intron size [200k]
Max intron size [200k]. This option overrides
.BR -I .
.TP
.BI -I
Set max intron size to
.RI min(max(3.6*sqrt( refLen ),10000),300000)
where
.I refLen
is the total length of the input genome. This option overrides
.BR -G .
is the total length of the input genome.
.TP
.BI -n \ NUM
Min number of syncmers in a chain [10]
Expand Down
2 changes: 1 addition & 1 deletion miniprot.h
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

#include <stdint.h>

#define MP_VERSION "0.7-r217-dirty"
#define MP_VERSION "0.7-r218-dirty"

#define MP_F_NO_SPLICE 0x1
#define MP_F_NO_ALIGN 0x2
Expand Down

0 comments on commit 87f11cb

Please sign in to comment.