Indexing is much slower on fragmented assemblies #10

twrightsman · 2022-09-22T18:16:52Z

I have noticed that indexing takes much longer (>1 hr) on fragmented assemblies that have a few million contigs. Assemblies of similar total assembly length but with hundreds of contigs are much faster, being indexed in minutes. The total assembly sizes are around a gigabase.

This may be expected behavior, and possibly related to #3.

lh3 · 2022-09-22T20:06:41Z

This is not expected. Do you happen to have the link to a fragmented assembly such that I can play with?

lh3 · 2022-09-22T21:57:14Z

I downloaded a human short-read assembly with 361,157 contigs. It took 2 minutes to index, comparable to indexing the human reference genome.

[M::[email protected]*1.00] read 2806031133 bases in 361157 contigs
[M::[email protected]*1.00] 22263202 blocks
[M::[email protected]*5.37] collected syncmers
[M::[email protected]*2.50] 847438891 kmer-block pairs
[M::mp_idx_print_stat] 1129657 distinct k-mers; mean occ of infrequent k-mers: 717.87; 313 frequent k-mers accounting for 36720278 occurrences
[M::main] Version: 0.3-r137

I guess something else is causing the slow indexing on your end. What sequences are you indexing?

twrightsman · 2022-09-23T13:15:57Z

I am indexing short-read-based assemblies of grass genomes, which are unfortunately not public yet; I will re-run two of the assemblies, one fragmented and one not, to verify this wasn't due to something like random storage latency on the compute cluster.

If you want me to run any diagnostic tools on the assemblies, like k-mer statistics, just let me know!

twrightsman · 2022-09-26T15:17:32Z

Here is the time output from indexing one of the "problematic" short-read assemblies:

[M::[email protected]*1.00] read 2526970524 bases in 5355222 contigs
[M::[email protected]*1.00] 25107504 blocks
[M::[email protected]*8.00] collected syncmers
[M::[email protected]*7.99] 772920789 kmer-block pairs
[M::mp_idx_print_stat] 1127614 distinct k-mers; mean occ of infrequent k-mers: 677.41; 120 frequent k-mers accounting for 9148986 occurrences
[M::main] Version: 0.3-r142-dirty
[M::main] CMD: ./miniprot/miniprot -t8 -d short-read-genome.mpi short-read-genome.fa
[M::main] Real time: 15842.331 sec; CPU: 126531.617 sec; Peak RSS: 19.614 GB
        Command being timed: "./miniprot/miniprot -t8 -d short-read-genome.mpi short-read-genome.fa"
        User time (seconds): 126512.43
        System time (seconds): 19.17
        Percent of CPU this job got: 798%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:24:02
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 20566688
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 997655
        Voluntary context switches: 29
        Involuntary context switches: 217125
        Swaps: 0
        File system inputs: 0
        File system outputs: 8986760
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

The long-read ones index just fine (<2 mins):

[M::[email protected]*1.00] read 4598175493 bases in 535 contigs
[M::[email protected]*1.00] 35923756 blocks
[M::[email protected]*6.15] collected syncmers
[M::[email protected]*4.79] 1479416713 kmer-block pairs
[M::mp_idx_print_stat] 1129322 distinct k-mers; mean occ of infrequent k-mers: 1288.04; 348 frequent k-mers accounting for 25252416 occurrences
[M::main] Version: 0.3-r142-dirty
[M::main] CMD: ./miniprot/miniprot -t8 -d long-read-genome.mpi long-read-genome.fa
[M::main] Real time: 90.484 sec; CPU: 249.604 sec; Peak RSS: 22.869 GB
        Command being timed: "./miniprot/miniprot -t8 -d long-read-genome.mpi long-read-genome.fa"
        User time (seconds): 238.54
        System time (seconds): 11.06
        Percent of CPU this job got: 275%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:30.48
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 23979400
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 2047523
        Voluntary context switches: 41
        Involuntary context switches: 728
        Swaps: 0
        File system inputs: 0
        File system outputs: 16310512
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

lh3 · 2022-09-26T16:14:51Z

There may be something unique to your short-read assembly. In #13, someone indexed a 22.5 Gb short-read assembly in half an hour, which is about the right time.

Could you run indexing under the perf command with:

perf record miniprot -t8 -d short-read-genome.mpi short-read-genome.fa
perf report

And then take a screenshot of perf report? This will help me to pinpoint the exact function that causes the slowness. Thank you. I will also have a look at the source code on my part but without actual data, it may be difficult to identify the problem.

twrightsman · 2022-09-26T17:18:48Z

Here are the screenshots for the short-read...

...and the long-read...

indexing. I let the short-read indexing run for about 15 minutes to collect plenty of samples, while the long-read indexing finished completely during sampling.

lh3 · 2022-09-26T18:16:18Z

This is very helpful. I will have a look when I get time.

lh3 · 2022-09-27T03:12:56Z

I believe this has been fixed on the github HEAD. Let me know if you still have the same issue. Thank you for the perf run. Very informative.

twrightsman · 2022-09-27T13:48:55Z

From 4 hours to 40 seconds! Thank you for your time and glad I could help in some way.

lh3 added the bug Something isn't working label Sep 26, 2022

lh3 closed this as completed in 4137611 Sep 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing is much slower on fragmented assemblies #10

Indexing is much slower on fragmented assemblies #10

twrightsman commented Sep 22, 2022

lh3 commented Sep 22, 2022

lh3 commented Sep 22, 2022

twrightsman commented Sep 23, 2022

twrightsman commented Sep 26, 2022 •

edited

Loading

lh3 commented Sep 26, 2022

twrightsman commented Sep 26, 2022

lh3 commented Sep 26, 2022

lh3 commented Sep 27, 2022

twrightsman commented Sep 27, 2022

Indexing is much slower on fragmented assemblies #10

Indexing is much slower on fragmented assemblies #10

Comments

twrightsman commented Sep 22, 2022

lh3 commented Sep 22, 2022

lh3 commented Sep 22, 2022

twrightsman commented Sep 23, 2022

twrightsman commented Sep 26, 2022 • edited Loading

lh3 commented Sep 26, 2022

twrightsman commented Sep 26, 2022

lh3 commented Sep 26, 2022

lh3 commented Sep 27, 2022

twrightsman commented Sep 27, 2022

twrightsman commented Sep 26, 2022 •

edited

Loading