Minimal support for CRAM files with missing @RG headers. #1480
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The SAMtags spec states that RG:Z: lines should point match an RG ID
if RG headers are present, but doesn't explicitly require them to be
present. The SAM spec itself recommends that RG headers are present.
Sadly this means CRAM may need to cope with this semantically
inconsistent edge case.
Given CRAM stores RG as an integer data series as an index into the
corresponding header, in much the same way that BAM stores chromosomes
as numeric "tid" values, this makes things challenging. However CRAM
can also store text tags, so it's possible to round-trip with missing
headers by claiming RG is -1 (unspecified) and then adding a verbatim
RG:Z string tag. This is perhaps a bit of a CRAM spec loop hole so
it's questionable if this is the correct solution.
This works and is decodable by both htslib and htsjdk, but it'll break
things like cram_transcode_rg as used by samtools cat. I think this
is a pretty unlikely combination of events. Note picard's
SamFormatConverter also drops these RG fields.
This code also whinges, once for each and every problematic alignment
record, when RG is absent in the SAM header. It's considerably more
work to track which ones we've warned about before and to track all
that meta-data across threads in a robust manner, plus this really
could be considered to be a poor SAM file. Were it not for the SAM
spec explicitly permitting such things (even if recommending against
it) I'd reject it outright. Instead brow-beating the SAM creators
into fixing the headers could be considered to be a positive outcome.
Fixes #1479