Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing is much slower on fragmented assemblies #10

Closed
twrightsman opened this issue Sep 22, 2022 · 9 comments
Closed

Indexing is much slower on fragmented assemblies #10

twrightsman opened this issue Sep 22, 2022 · 9 comments
Labels
bug Something isn't working

Comments

@twrightsman
Copy link

I have noticed that indexing takes much longer (>1 hr) on fragmented assemblies that have a few million contigs. Assemblies of similar total assembly length but with hundreds of contigs are much faster, being indexed in minutes. The total assembly sizes are around a gigabase.

This may be expected behavior, and possibly related to #3.

@lh3
Copy link
Owner

lh3 commented Sep 22, 2022

This is not expected. Do you happen to have the link to a fragmented assembly such that I can play with?

@lh3
Copy link
Owner

lh3 commented Sep 22, 2022

I downloaded a human short-read assembly with 361,157 contigs. It took 2 minutes to index, comparable to indexing the human reference genome.

[M::[email protected]*1.00] read 2806031133 bases in 361157 contigs
[M::[email protected]*1.00] 22263202 blocks
[M::[email protected]*5.37] collected syncmers
[M::[email protected]*2.50] 847438891 kmer-block pairs
[M::mp_idx_print_stat] 1129657 distinct k-mers; mean occ of infrequent k-mers: 717.87; 313 frequent k-mers accounting for 36720278 occurrences
[M::main] Version: 0.3-r137

I guess something else is causing the slow indexing on your end. What sequences are you indexing?

@twrightsman
Copy link
Author

I am indexing short-read-based assemblies of grass genomes, which are unfortunately not public yet; I will re-run two of the assemblies, one fragmented and one not, to verify this wasn't due to something like random storage latency on the compute cluster.

If you want me to run any diagnostic tools on the assemblies, like k-mer statistics, just let me know!

@twrightsman
Copy link
Author

twrightsman commented Sep 26, 2022

Here is the time output from indexing one of the "problematic" short-read assemblies:

[M::[email protected]*1.00] read 2526970524 bases in 5355222 contigs
[M::[email protected]*1.00] 25107504 blocks
[M::[email protected]*8.00] collected syncmers
[M::[email protected]*7.99] 772920789 kmer-block pairs
[M::mp_idx_print_stat] 1127614 distinct k-mers; mean occ of infrequent k-mers: 677.41; 120 frequent k-mers accounting for 9148986 occurrences
[M::main] Version: 0.3-r142-dirty
[M::main] CMD: ./miniprot/miniprot -t8 -d short-read-genome.mpi short-read-genome.fa
[M::main] Real time: 15842.331 sec; CPU: 126531.617 sec; Peak RSS: 19.614 GB
        Command being timed: "./miniprot/miniprot -t8 -d short-read-genome.mpi short-read-genome.fa"
        User time (seconds): 126512.43
        System time (seconds): 19.17
        Percent of CPU this job got: 798%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:24:02
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 20566688
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 997655
        Voluntary context switches: 29
        Involuntary context switches: 217125
        Swaps: 0
        File system inputs: 0
        File system outputs: 8986760
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

The long-read ones index just fine (<2 mins):

[M::[email protected]*1.00] read 4598175493 bases in 535 contigs
[M::[email protected]*1.00] 35923756 blocks
[M::[email protected]*6.15] collected syncmers
[M::[email protected]*4.79] 1479416713 kmer-block pairs
[M::mp_idx_print_stat] 1129322 distinct k-mers; mean occ of infrequent k-mers: 1288.04; 348 frequent k-mers accounting for 25252416 occurrences
[M::main] Version: 0.3-r142-dirty
[M::main] CMD: ./miniprot/miniprot -t8 -d long-read-genome.mpi long-read-genome.fa
[M::main] Real time: 90.484 sec; CPU: 249.604 sec; Peak RSS: 22.869 GB
        Command being timed: "./miniprot/miniprot -t8 -d long-read-genome.mpi long-read-genome.fa"
        User time (seconds): 238.54
        System time (seconds): 11.06
        Percent of CPU this job got: 275%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:30.48
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 23979400
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 2047523
        Voluntary context switches: 41
        Involuntary context switches: 728
        Swaps: 0
        File system inputs: 0
        File system outputs: 16310512
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

@lh3
Copy link
Owner

lh3 commented Sep 26, 2022

There may be something unique to your short-read assembly. In #13, someone indexed a 22.5 Gb short-read assembly in half an hour, which is about the right time.

Could you run indexing under the perf command with:

perf record miniprot -t8 -d short-read-genome.mpi short-read-genome.fa
perf report

And then take a screenshot of perf report? This will help me to pinpoint the exact function that causes the slowness. Thank you. I will also have a look at the source code on my part but without actual data, it may be difficult to identify the problem.

@lh3 lh3 added the bug Something isn't working label Sep 26, 2022
@twrightsman
Copy link
Author

Here are the screenshots for the short-read...
short-read
...and the long-read...
long-read
indexing. I let the short-read indexing run for about 15 minutes to collect plenty of samples, while the long-read indexing finished completely during sampling.

@lh3
Copy link
Owner

lh3 commented Sep 26, 2022

This is very helpful. I will have a look when I get time.

@lh3 lh3 closed this as completed in 4137611 Sep 27, 2022
@lh3
Copy link
Owner

lh3 commented Sep 27, 2022

I believe this has been fixed on the github HEAD. Let me know if you still have the same issue. Thank you for the perf run. Very informative.

@twrightsman
Copy link
Author

From 4 hours to 40 seconds! Thank you for your time and glad I could help in some way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants