Cram embed ref2 #1442

jkbonfield · 2022-05-30T14:30:04Z

For discussion.

#1429 exposes more of the CRAM API to allow a samtools reference command that can extract the embedded reference blocks from a CRAM file, permitting us to create a fasta file with which to reencode the CRAM again. This is laborious, but a necessary step when dealing with our current implementations and with third party tools.

Better however would be on-the-fly reference construction. That's this PR. However it's more debatable in how far it goes. There are two main solutions.

Use SEQ field + MD:Z: aux tag to regenerate the reference during encode. Because we don't have a "load 10,000 reads, encode 10,000 reads" model and instead it's "load read, encode read, repeat until slice full" we don't have a logical place to generate the reference just before the slice is written. However MD should be consistent from one record to the next (and we check it is) so with MD:Z: we can build our reference a little bit at a time and use what we have so far to delta-encode each reading. This works well and it's basically the same as using an external reference file.
Using SEQ field alone. This is done when MD is absent. Due to the aforementioned algorithm structure we can't compute the consensus. What we can do however is just make a wild assumption that the first time we see a sequence aligned against a new reference coordinate, that the sequence is correct. This isn't optimal, but it does still work well (particularly for low-error rate data) and means we never go back to change a reference and hence we can use the same on-the-fly construction approach.

Both of these methods are enabled when using "embed_ref=2", instead of the old "embed_ref" / "embed_ref=1" option. I think maybe this should be merged to a single value so if the external reference is specified then we use it and if not then we just generate it as needed.

The only slight fly in the ointment comes from the second strategy above for generating an embedded reference when MD is absent. In this scenario any auto-generated MD/NM tags have a significant chance of being incorrect, as we simply don't know what the original reference used really was. CRAM 3.x has no way to indicate that MD/NM should not be created on-the-fly. (CRAM 4.0 adds this.) So to avoid recreating invalid fields, I've also added a new implementation-specific aux tag cF (CRAM flags). It's stripped out on decode, but only by this version and above obviously. Third party tools or old htslib versions will leave this in place (as well as potentially auto-generating incorrect MD/NM, but there's little we can do about that I fear). If anyone has suggestions for how to handle this better I'm all ears. I did think of maybe generating invalid or blank MD (eg MD:Z:* or MD:Z:) as a marker for it not being valid, but some tools may well just choke. This is clearly one of those hindsight moments, where MD should always have been an explicit flag requesting auto-rebuild rather than an automatic assumption.

jkbonfield · 2022-05-30T15:34:16Z

Some stats based on 10 million NovaSeq alignments (without MD tags)

                All             Seq                                             
embed_ref=0     207671265       38616499                                        
embed_ref=1     219073497       49956891 (from external reference)              
embed_ref=1     219439104       50364923 (samtools consensus "simple" mode)     
embed_ref=1     224051270       54003084 (samtools consensus "gap5" mode)       
embed_ref=2     219074860       49883110 (with MD added via calmd)              
embed_ref=2     231005366       61852549 (No MD tags present)                   
no_ref          256286463       87114535                                        
BAM             571937666       ?

Conclusions:

I need to fix something as it broke the tests (oops: TODO!)
Constructing our own external reference based on samtools consensus works best in "simple" mode, but it's still marginally behind the actual reference. Maybe that's something to do with reference bias from the aligner? It's very similar though.
On-the-fly reference generation using SEQ + MD is basically the same as embedding the actual reference. This is as expected given correct MD.
The naive on-the-fly ref generation without MD isn't great. It just uses the previous read, which means errors can get baked in and become deltas for all subsequent alignments. However that's a more major code reorganisation so can be dealt with later.
Overall, faking up a rough on-the-fly reference even without MD tags is a sizeable improvement over no_ref mode. Figures may change by depth too - untested.

We compute reference from seq + cigar + MD tag. This gets used immediately as it's part of the reference-based encoding in process_one_read. If we have inconsistent records where the inferred reference differs, then we would end up with an invalid embedded reference and decode errors. Instead we just accept the first one as we cannot go back and correct earlier mistakes. It's a little tricky when the sequence may contain "N" characters, as that's also the initial value of unassigned reference, so we have to be careful there. This also opens the way for handling CRAM files without any MD tags. In that scenario we just assume there are no SNPs and create a fake MD tag (just numbers merging match/mismatch and ^N* for deletions). So we can now do embedded reference on files with no reference and no MD. It's not quite consensus (which would be better), but that would require a two-pass encoding strategy in place of process_one_read. So it's a bit poor on high-error technologies. Note, when embedded auto-generated reference where MD tags may be absent, the auto-decoded MD may be incorrect as the reference we have embedded is now the sequence as-is rather than the real sequence.

This is used to indicate when MD and NM were not present and should not be regenerated during decode. Bit 1 is set when MD shouldn't be produced and bit 2 when NM shouldn't be. In both cases this tag is only created when embed_ref=2 and MD and/or NM is absent from the input data. In this scenario we cannot reproduce the reference from SEQ+MD and cannot therefore be certain that the value reproduced is correct. E.g. if the reference is produced by consensus alone, then MD is a diff of this read vs consensus and not this read vs the original reference used in by the aligner. The cF tag is automatically stripped out during decode, but only with this version of htslib and above. Older versions will just emit a private-space aux tag (which hopefully is harmless except for the unlikely event of it clashing with another private name-space tool).

jkbonfield · 2022-05-31T10:34:53Z

Some stats on the GIAB ONT data; HG002.GRCh38.UL-ONT.realign.bam chr1:10-20M. Total size, format, sequence-part and notes.

237672974 BAM                                                                   
135303919 CRAM  27583376  external ref                                          
139709759 CRAM  32008185  embed_ref=1 (official reference)                                          
139710386 CRAM  32007405  embed_ref=2 (with MD)                                 
149371457 CRAM  41670856  embed_ref=2 (without MD)                              
205899946 CRAM  74416907  no_ref                                                
190610765 CRAM  59128442  no_ref + lzma                                         
159601412 CRAM  32395046  embed_ref=1 (ref via simple cons)
                51883407  " with MD:Z and NM:i tags included                    
159512509 CRAM  32307131  embed_ref=1 (ref via bayesian cons)

Here auto-generation of reference, even via the naive method, is much better than no reference at all. Better even than using slow LZMA to compress the no-ref full sequences.

Once again this demonstrates that auto-reference generation works, but could be better if we built a proper consensus instead of the existing naive "first come first serve" generation. That would require a major change in code though, but it demonstrates (poor implementation aside) what's feasible within the CRAM format.

Using a consensus is bad if we have MD/NM tags as we then have to store them verbatim when they differ, due to the consensus causing auto-generation to be incorrect. However if we have MD/NM then the existing embed_ref=2 using SEQ+MD is sufficient and marginally better anyway.

whitwham · 2022-06-01T10:53:19Z

cram/cram_encode.c

+            // based on MD:Z tags.
+            if ((c->ref_id = bam_ref(b)) >= 0) {
+                c->ref_free = 1;
+                if (c->ref) abort();


Should there really be an abort here? It would just crash the calling program without any explanation.

Err no! I'm not sure why that's there. Thanks for spotting it.

I think this was added temporarily to convince myself that c->ref always will infact be NULL at this point anyway. I've stressed tested it on a variety of scenarios, and convinced myself it's true, which means the neighbour c->ref = NULL call is redundant. However just incase there is a coding error I'm leaving that there are ditching the check which shouldn't ever happen anyway.

In stress testing it though I did spot a couple other bugs, which have now been fixed also.

jkbonfield · 2022-06-01T13:58:53Z

Bah, I see it's also failing some tests with multi-ref enabled. I'll need to do more work on this then.

Also fixed a potential illegal memory access caused by the return value of cram_encode_aux, and added more belt and braces memory free requests for the new embed_ref=2 option.

jkbonfield added 2 commits May 31, 2022 10:09

jkbonfield force-pushed the cram_embed_ref2 branch from 59bff11 to 7039fc9 Compare May 31, 2022 09:10

whitwham reviewed Jun 1, 2022

View reviewed changes

Remove unneeded abort check in cram encoding.

9da79aa

Also fixed a potential illegal memory access caused by the return value of cram_encode_aux, and added more belt and braces memory free requests for the new embed_ref=2 option.

jkbonfield force-pushed the cram_embed_ref2 branch from aff0aaf to 9da79aa Compare June 1, 2022 14:10

whitwham merged commit e978164 into samtools:develop Jun 1, 2022

jkbonfield mentioned this pull request Jun 6, 2022

Future direction of CRAM embed_ref #1445

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cram embed ref2 #1442

Cram embed ref2 #1442

jkbonfield commented May 30, 2022 •

edited

Loading

jkbonfield commented May 30, 2022

jkbonfield commented May 31, 2022

whitwham Jun 1, 2022

jkbonfield Jun 1, 2022

jkbonfield Jun 1, 2022

jkbonfield commented Jun 1, 2022

Cram embed ref2 #1442

Cram embed ref2 #1442

Conversation

jkbonfield commented May 30, 2022 • edited Loading

jkbonfield commented May 30, 2022

jkbonfield commented May 31, 2022

whitwham Jun 1, 2022

Choose a reason for hiding this comment

jkbonfield Jun 1, 2022

Choose a reason for hiding this comment

jkbonfield Jun 1, 2022

Choose a reason for hiding this comment

jkbonfield commented Jun 1, 2022

jkbonfield commented May 30, 2022 •

edited

Loading