-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indexing is much slower on fragmented assemblies #10
Comments
This is not expected. Do you happen to have the link to a fragmented assembly such that I can play with? |
I downloaded a human short-read assembly with 361,157 contigs. It took 2 minutes to index, comparable to indexing the human reference genome. [M::[email protected]*1.00] read 2806031133 bases in 361157 contigs
[M::[email protected]*1.00] 22263202 blocks
[M::[email protected]*5.37] collected syncmers
[M::[email protected]*2.50] 847438891 kmer-block pairs
[M::mp_idx_print_stat] 1129657 distinct k-mers; mean occ of infrequent k-mers: 717.87; 313 frequent k-mers accounting for 36720278 occurrences
[M::main] Version: 0.3-r137 I guess something else is causing the slow indexing on your end. What sequences are you indexing? |
I am indexing short-read-based assemblies of grass genomes, which are unfortunately not public yet; I will re-run two of the assemblies, one fragmented and one not, to verify this wasn't due to something like random storage latency on the compute cluster. If you want me to run any diagnostic tools on the assemblies, like k-mer statistics, just let me know! |
Here is the
The long-read ones index just fine (<2 mins):
|
There may be something unique to your short-read assembly. In #13, someone indexed a 22.5 Gb short-read assembly in half an hour, which is about the right time. Could you run indexing under the perf command with: perf record miniprot -t8 -d short-read-genome.mpi short-read-genome.fa
perf report And then take a screenshot of |
This is very helpful. I will have a look when I get time. |
I believe this has been fixed on the github HEAD. Let me know if you still have the same issue. Thank you for the perf run. Very informative. |
From 4 hours to 40 seconds! Thank you for your time and glad I could help in some way. |
I have noticed that indexing takes much longer (>1 hr) on fragmented assemblies that have a few million contigs. Assemblies of similar total assembly length but with hundreds of contigs are much faster, being indexed in minutes. The total assembly sizes are around a gigabase.
This may be expected behavior, and possibly related to #3.
The text was updated successfully, but these errors were encountered: