-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes involving CRAM and header API for long references. #976
Open
jkbonfield
wants to merge
3
commits into
samtools:develop
Choose a base branch
from
jkbonfield:long_cigar2
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We can now have many cigar ops (eg ref-skips) that combine to more than 512Mb. This isn't a full fix for long references as CRAM sequence positions are still 32-bit. However this is a starting point down that road and there are still sizes between 28-bit and 32-bit where the CIGAR failed to operate correctly.
The h->sdict hash is used to track references that are > 4Gb in size. The dup code didn't copy this. This manifested itself as CRAM SQ headers being truncated (read SAM hdr, dup, write as CRAM hdr). To fix this a function was written that creates or updates the sdict from the hrecs parsed header structs. It's possible this should be called directly from the sam_hdr_create function (part of the SAM format parser) instead of manually keeping track of sdict itself, however doing so would require initialising the new header structs so I haven't done this. This is a general utility, so perhaps should be made a public part of the header API. However IMO the new header API should hide this nuance away and just return the correct data, also ensuring that header updates work correctly and honour the text form. Since c83c9e2 the header API also was using the 32-bit capped target_len in preference to the parsed text from SQ LN fields when they differed. I am assuming this was a decision in what takes priority in BAM where the sequence names and lengths exist in both text and binary form. This commit reverses this and makes the text form always take priority. As this is at least required in some scenarios (long references) it seems easier to simply make it apply in all scenarios.
jkbonfield
force-pushed
the
long_cigar2
branch
from
November 15, 2019 12:33
a914023
to
3c079af
Compare
valeriuo
reviewed
Nov 18, 2019
sam.c
Outdated
@@ -179,6 +226,14 @@ sam_hdr_t *sam_hdr_dup(const sam_hdr_t *h0) | |||
h->text[h->l_text] = '\0'; | |||
} | |||
|
|||
if (h0->sdict) { | |||
// We have a reason to use new API, so build it so we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkbonfield As discussed, this would be better addressed inside sam_hdr_update_target_arrays
.
…h `target_name` and `target_len`. Rename `rebuild_target_arrays` to `sam_hdr_rebuild_target_arrays`. Remove unnecessary external condition on `hrecs->refs_changed`. Rearrange the position of `sam_hrecs_rebuild_text`, to reflect its internal usage.
Moving the code to |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is obviously only partial because we need CRAM4 for long reads, but it's now better at reporting errors and also now copes correctly with the cases inbetween 28-bit and 32-bit sizes for long cigar ops.
The header API has also had some fixes, which were discovered in the process of updating CRAM. The main culprit was sam_hdr_dup which didn't honour h->sdict.