Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seg-fault during index formation #3

Closed
rwhetten opened this issue Sep 12, 2022 · 5 comments
Closed

Seg-fault during index formation #3

rwhetten opened this issue Sep 12, 2022 · 5 comments
Labels
bug Something isn't working

Comments

@rwhetten
Copy link

I cloned the repo and compiled the code, but I get a segmentation fault when trying to index a fragmented genome with 1.75 million scaffolds. The executable works fine to make an index of GRCh38 (including all alternate scaffolds, so 63 Gb total), so it doesn't appear to be the software itself.
Is there a limit on the number of scaffolds in an assembly for indexing? Alternatively, are there characters that might cause problems if present in scaffold names?

@lh3
Copy link
Owner

lh3 commented Sep 12, 2022

Please try the latest version from github HEAD. There was a bug, though I am not sure if that would lead to segfault.

@rwhetten
Copy link
Author

I used git pull, make clean, and make; then tried the index building job again. It ran for longer this time, and wrote the following to stderr:
[M::[email protected]*0.99] read 22104357184 bases in 1755249 contigs
[M::[email protected]*0.99] 174414660 blocks
[M::[email protected]*14.65] collected syncmers
/var/spool/slurm/slurmd/job5292490/slurm_script: line 22: 777006 Segmentation fault
The command used was ~/miniprot -t16 -d $INDEX $GENOME; RAM use reached 100 Gb and runtime 21 minutes.

@lh3
Copy link
Owner

lh3 commented Sep 12, 2022

One potential cause is memory. The Ensembl version of GRCh38 has many ambiguous bases. Although the total contig length is 63 Gb, there are only ~3.2 Gb actual sequences. Your assembly is 7 times larger. I guess it will take 120-150 GB of memory for indexing.

@rwhetten
Copy link
Author

The node that was running the job had 370 Gb RAM allocated, and the output doesn't indicate an out-of-memory error in any way I recognize. The exit code was 139, and RAM use peaked at 100.5 Gb. Would non-alphanumeric, non-underscore characters (such as space or dot) in scaffold names be a problem?
Thinking of work-arounds - is there any way to merge indexes of genome subsets into a single index after they are created? I could split the genome into 8 subsets and index them separately. If indexes can't be joined, I could align them separately, with the loss of some information.

@lh3 lh3 added the bug Something isn't working label Sep 22, 2022
@lh3
Copy link
Owner

lh3 commented Sep 22, 2022

The segmentation fault should be caused by #4, which has been fixed. Let me know if you still have the problem. I am closing this issue for now.

@lh3 lh3 closed this as completed Sep 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants