-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cram embed ref2 #1442
Cram embed ref2 #1442
Conversation
Some stats based on 10 million NovaSeq alignments (without MD tags)
Conclusions:
|
We compute reference from seq + cigar + MD tag. This gets used immediately as it's part of the reference-based encoding in process_one_read. If we have inconsistent records where the inferred reference differs, then we would end up with an invalid embedded reference and decode errors. Instead we just accept the first one as we cannot go back and correct earlier mistakes. It's a little tricky when the sequence may contain "N" characters, as that's also the initial value of unassigned reference, so we have to be careful there. This also opens the way for handling CRAM files without any MD tags. In that scenario we just assume there are no SNPs and create a fake MD tag (just numbers merging match/mismatch and ^N* for deletions). So we can now do embedded reference on files with no reference and no MD. It's not quite consensus (which would be better), but that would require a two-pass encoding strategy in place of process_one_read. So it's a bit poor on high-error technologies. Note, when embedded auto-generated reference where MD tags may be absent, the auto-decoded MD may be incorrect as the reference we have embedded is now the sequence as-is rather than the real sequence.
This is used to indicate when MD and NM were not present and should not be regenerated during decode. Bit 1 is set when MD shouldn't be produced and bit 2 when NM shouldn't be. In both cases this tag is only created when embed_ref=2 and MD and/or NM is absent from the input data. In this scenario we cannot reproduce the reference from SEQ+MD and cannot therefore be certain that the value reproduced is correct. E.g. if the reference is produced by consensus alone, then MD is a diff of this read vs consensus and not this read vs the original reference used in by the aligner. The cF tag is automatically stripped out during decode, but only with this version of htslib and above. Older versions will just emit a private-space aux tag (which hopefully is harmless except for the unlikely event of it clashing with another private name-space tool).
59bff11
to
7039fc9
Compare
Some stats on the GIAB ONT data; HG002.GRCh38.UL-ONT.realign.bam chr1:10-20M. Total size, format, sequence-part and notes.
Here auto-generation of reference, even via the naive method, is much better than no reference at all. Better even than using slow LZMA to compress the no-ref full sequences. Once again this demonstrates that auto-reference generation works, but could be better if we built a proper consensus instead of the existing naive "first come first serve" generation. That would require a major change in code though, but it demonstrates (poor implementation aside) what's feasible within the CRAM format. Using a consensus is bad if we have MD/NM tags as we then have to store them verbatim when they differ, due to the consensus causing auto-generation to be incorrect. However if we have MD/NM then the existing embed_ref=2 using SEQ+MD is sufficient and marginally better anyway. |
cram/cram_encode.c
Outdated
// based on MD:Z tags. | ||
if ((c->ref_id = bam_ref(b)) >= 0) { | ||
c->ref_free = 1; | ||
if (c->ref) abort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there really be an abort here? It would just crash the calling program without any explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Err no! I'm not sure why that's there. Thanks for spotting it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was added temporarily to convince myself that c->ref
always will infact be NULL at this point anyway. I've stressed tested it on a variety of scenarios, and convinced myself it's true, which means the neighbour c->ref = NULL
call is redundant. However just incase there is a coding error I'm leaving that there are ditching the check which shouldn't ever happen anyway.
In stress testing it though I did spot a couple other bugs, which have now been fixed also.
Bah, I see it's also failing some tests with multi-ref enabled. I'll need to do more work on this then. |
Also fixed a potential illegal memory access caused by the return value of cram_encode_aux, and added more belt and braces memory free requests for the new embed_ref=2 option.
aff0aaf
to
9da79aa
Compare
For discussion.
#1429 exposes more of the CRAM API to allow a
samtools reference
command that can extract the embedded reference blocks from a CRAM file, permitting us to create a fasta file with which to reencode the CRAM again. This is laborious, but a necessary step when dealing with our current implementations and with third party tools.Better however would be on-the-fly reference construction. That's this PR. However it's more debatable in how far it goes. There are two main solutions.
Use SEQ field + MD:Z: aux tag to regenerate the reference during encode. Because we don't have a "load 10,000 reads, encode 10,000 reads" model and instead it's "load read, encode read, repeat until slice full" we don't have a logical place to generate the reference just before the slice is written. However MD should be consistent from one record to the next (and we check it is) so with MD:Z: we can build our reference a little bit at a time and use what we have so far to delta-encode each reading. This works well and it's basically the same as using an external reference file.
Using SEQ field alone. This is done when MD is absent. Due to the aforementioned algorithm structure we can't compute the consensus. What we can do however is just make a wild assumption that the first time we see a sequence aligned against a new reference coordinate, that the sequence is correct. This isn't optimal, but it does still work well (particularly for low-error rate data) and means we never go back to change a reference and hence we can use the same on-the-fly construction approach.
Both of these methods are enabled when using "embed_ref=2", instead of the old "embed_ref" / "embed_ref=1" option. I think maybe this should be merged to a single value so if the external reference is specified then we use it and if not then we just generate it as needed.
The only slight fly in the ointment comes from the second strategy above for generating an embedded reference when MD is absent. In this scenario any auto-generated MD/NM tags have a significant chance of being incorrect, as we simply don't know what the original reference used really was. CRAM 3.x has no way to indicate that MD/NM should not be created on-the-fly. (CRAM 4.0 adds this.) So to avoid recreating invalid fields, I've also added a new implementation-specific aux tag
cF
(CRAM flags). It's stripped out on decode, but only by this version and above obviously. Third party tools or old htslib versions will leave this in place (as well as potentially auto-generating incorrect MD/NM, but there's little we can do about that I fear). If anyone has suggestions for how to handle this better I'm all ears. I did think of maybe generating invalid or blank MD (egMD:Z:*
orMD:Z:
) as a marker for it not being valid, but some tools may well just choke. This is clearly one of those hindsight moments, where MD should always have been an explicit flag requesting auto-rebuild rather than an automatic assumption.