Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimal support for CRAM files with missing @RG headers. #1480

Merged
merged 1 commit into from
Jul 27, 2022

Conversation

jkbonfield
Copy link
Contributor

The SAMtags spec states that RG:Z: lines should point match an RG ID
if RG headers are present, but doesn't explicitly require them to be
present. The SAM spec itself recommends that RG headers are present.
Sadly this means CRAM may need to cope with this semantically
inconsistent edge case.

Given CRAM stores RG as an integer data series as an index into the
corresponding header, in much the same way that BAM stores chromosomes
as numeric "tid" values, this makes things challenging. However CRAM
can also store text tags, so it's possible to round-trip with missing
headers by claiming RG is -1 (unspecified) and then adding a verbatim
RG:Z string tag. This is perhaps a bit of a CRAM spec loop hole so
it's questionable if this is the correct solution.

This works and is decodable by both htslib and htsjdk, but it'll break
things like cram_transcode_rg as used by samtools cat. I think this
is a pretty unlikely combination of events. Note picard's
SamFormatConverter also drops these RG fields.

This code also whinges, once for each and every problematic alignment
record
, when RG is absent in the SAM header. It's considerably more
work to track which ones we've warned about before and to track all
that meta-data across threads in a robust manner, plus this really
could be considered to be a poor SAM file. Were it not for the SAM
spec explicitly permitting such things (even if recommending against
it) I'd reject it outright. Instead brow-beating the SAM creators
into fixing the headers could be considered to be a positive outcome.

Fixes #1479

@jkbonfield jkbonfield marked this pull request as ready for review July 21, 2022 14:41
@jkbonfield
Copy link
Contributor Author

Note I'm also open to the simpler strategy of just dropping the RG:Z: tag (as we do now) when they miss a header, but with the addition of a spammy warning message. It makes it compatible with Picard.

It depends if we wish to be overly precise in accepting things in the specification, even when they're really a bad idea if not outright wrong, or whether we wish to put our own spin on things and enforce good behaviour.

The SAMtags spec states that RG:Z: lines should point match an RG ID
if RG headers are present, but doesn't explicitly *require* them to be
present.  The SAM spec itself recommends that RG headers are present.
Sadly this means CRAM may need to cope with this semantically
inconsistent edge case.

Given CRAM stores RG as an integer data series as an index into the
corresponding header, in much the same way that BAM stores chromosomes
as numeric "tid" values, this makes things challenging.  However CRAM
can also store text tags, so it's possible to round-trip with missing
headers by claiming RG is -1 (unspecified) and then adding a verbatim
RG:Z string tag.  This is perhaps a bit of a CRAM spec loop hole so
it's questionable if this is the correct solution.

This works and is decodable by both htslib and htsjdk, but it'll break
things like cram_transcode_rg as used by samtools cat.  I think this
is a pretty unlikely combination of events.  Note picard's
SamFormatConverter also drops these RG fields.

This code also whinges, *once for each and every problematic alignment
record*, when RG is absent in the SAM header.  It's considerably more
work to track which ones we've warned about before and to track all
that meta-data across threads in a robust manner, plus this really
could be considered to be a poor SAM file.  Were it not for the SAM
spec explicitly permitting such things (even if recommending against
it) I'd reject it outright.  Instead brow-beating the SAM creators
into fixing the headers could be considered to be a positive outcome.

Fixes samtools#1479
@daviesrob
Copy link
Member

Seems to work, and probably better than losing the tags completely. We'll see if anyone complains about the warning messages.

@daviesrob daviesrob merged commit c5508d5 into samtools:develop Jul 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RG tag gets lost in cram format
2 participants