Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cram embed ref2 #1442

Merged
merged 3 commits into from
Jun 1, 2022
Merged

Cram embed ref2 #1442

merged 3 commits into from
Jun 1, 2022

Conversation

jkbonfield
Copy link
Contributor

@jkbonfield jkbonfield commented May 30, 2022

For discussion.

#1429 exposes more of the CRAM API to allow a samtools reference command that can extract the embedded reference blocks from a CRAM file, permitting us to create a fasta file with which to reencode the CRAM again. This is laborious, but a necessary step when dealing with our current implementations and with third party tools.

Better however would be on-the-fly reference construction. That's this PR. However it's more debatable in how far it goes. There are two main solutions.

  1. Use SEQ field + MD:Z: aux tag to regenerate the reference during encode. Because we don't have a "load 10,000 reads, encode 10,000 reads" model and instead it's "load read, encode read, repeat until slice full" we don't have a logical place to generate the reference just before the slice is written. However MD should be consistent from one record to the next (and we check it is) so with MD:Z: we can build our reference a little bit at a time and use what we have so far to delta-encode each reading. This works well and it's basically the same as using an external reference file.

  2. Using SEQ field alone. This is done when MD is absent. Due to the aforementioned algorithm structure we can't compute the consensus. What we can do however is just make a wild assumption that the first time we see a sequence aligned against a new reference coordinate, that the sequence is correct. This isn't optimal, but it does still work well (particularly for low-error rate data) and means we never go back to change a reference and hence we can use the same on-the-fly construction approach.

Both of these methods are enabled when using "embed_ref=2", instead of the old "embed_ref" / "embed_ref=1" option. I think maybe this should be merged to a single value so if the external reference is specified then we use it and if not then we just generate it as needed.

The only slight fly in the ointment comes from the second strategy above for generating an embedded reference when MD is absent. In this scenario any auto-generated MD/NM tags have a significant chance of being incorrect, as we simply don't know what the original reference used really was. CRAM 3.x has no way to indicate that MD/NM should not be created on-the-fly. (CRAM 4.0 adds this.) So to avoid recreating invalid fields, I've also added a new implementation-specific aux tag cF (CRAM flags). It's stripped out on decode, but only by this version and above obviously. Third party tools or old htslib versions will leave this in place (as well as potentially auto-generating incorrect MD/NM, but there's little we can do about that I fear). If anyone has suggestions for how to handle this better I'm all ears. I did think of maybe generating invalid or blank MD (eg MD:Z:* or MD:Z:) as a marker for it not being valid, but some tools may well just choke. This is clearly one of those hindsight moments, where MD should always have been an explicit flag requesting auto-rebuild rather than an automatic assumption.

@jkbonfield
Copy link
Contributor Author

Some stats based on 10 million NovaSeq alignments (without MD tags)

                All             Seq                                             
embed_ref=0     207671265       38616499                                        
embed_ref=1     219073497       49956891 (from external reference)              
embed_ref=1     219439104       50364923 (samtools consensus "simple" mode)     
embed_ref=1     224051270       54003084 (samtools consensus "gap5" mode)       
embed_ref=2     219074860       49883110 (with MD added via calmd)              
embed_ref=2     231005366       61852549 (No MD tags present)                   
no_ref          256286463       87114535                                        
BAM             571937666       ?                                               

Conclusions:

  1. I need to fix something as it broke the tests (oops: TODO!)

  2. Constructing our own external reference based on samtools consensus works best in "simple" mode, but it's still marginally behind the actual reference. Maybe that's something to do with reference bias from the aligner? It's very similar though.

  3. On-the-fly reference generation using SEQ + MD is basically the same as embedding the actual reference. This is as expected given correct MD.

  4. The naive on-the-fly ref generation without MD isn't great. It just uses the previous read, which means errors can get baked in and become deltas for all subsequent alignments. However that's a more major code reorganisation so can be dealt with later.

  5. Overall, faking up a rough on-the-fly reference even without MD tags is a sizeable improvement over no_ref mode. Figures may change by depth too - untested.

We compute reference from seq + cigar + MD tag.  This gets used
immediately as it's part of the reference-based encoding in
process_one_read.

If we have inconsistent records where the inferred reference differs,
then we would end up with an invalid embedded reference and decode
errors.  Instead we just accept the first one as we cannot go back and
correct earlier mistakes.  It's a little tricky when the sequence may
contain "N" characters, as that's also the initial value of unassigned
reference, so we have to be careful there.

This also opens the way for handling CRAM files without any MD tags.
In that scenario we just assume there are no SNPs and create a fake MD
tag (just numbers merging match/mismatch and ^N* for deletions).  So
we can now do embedded reference on files with no reference and no
MD.  It's not quite consensus (which would be better), but that would
require a two-pass encoding strategy in place of process_one_read.  So
it's a bit poor on high-error technologies.

Note, when embedded auto-generated reference where MD tags may be
absent, the auto-decoded MD may be incorrect as the reference we have
embedded is now the sequence as-is rather than the real sequence.
This is used to indicate when MD and NM were not present and should
not be regenerated during decode.

Bit 1 is set when MD shouldn't be produced and bit 2 when NM shouldn't
be.  In both cases this tag is only created when embed_ref=2 and MD
and/or NM is absent from the input data.  In this scenario we cannot
reproduce the reference from SEQ+MD and cannot therefore be certain
that the value reproduced is correct.  E.g. if the reference is
produced by consensus alone, then MD is a diff of this read vs
consensus and not this read vs the original reference used in by the
aligner.

The cF tag is automatically stripped out during decode, but only with
this version of htslib and above.  Older versions will just emit a
private-space aux tag (which hopefully is harmless except for the
unlikely event of it clashing with another private name-space tool).
@jkbonfield
Copy link
Contributor Author

Some stats on the GIAB ONT data; HG002.GRCh38.UL-ONT.realign.bam chr1:10-20M. Total size, format, sequence-part and notes.

237672974 BAM                                                                   
135303919 CRAM  27583376  external ref                                          
139709759 CRAM  32008185  embed_ref=1 (official reference)                                          
139710386 CRAM  32007405  embed_ref=2 (with MD)                                 
149371457 CRAM  41670856  embed_ref=2 (without MD)                              
205899946 CRAM  74416907  no_ref                                                
190610765 CRAM  59128442  no_ref + lzma                                         
159601412 CRAM  32395046  embed_ref=1 (ref via simple cons)
                51883407  " with MD:Z and NM:i tags included                    
159512509 CRAM  32307131  embed_ref=1 (ref via bayesian cons)

Here auto-generation of reference, even via the naive method, is much better than no reference at all. Better even than using slow LZMA to compress the no-ref full sequences.

Once again this demonstrates that auto-reference generation works, but could be better if we built a proper consensus instead of the existing naive "first come first serve" generation. That would require a major change in code though, but it demonstrates (poor implementation aside) what's feasible within the CRAM format.

Using a consensus is bad if we have MD/NM tags as we then have to store them verbatim when they differ, due to the consensus causing auto-generation to be incorrect. However if we have MD/NM then the existing embed_ref=2 using SEQ+MD is sufficient and marginally better anyway.

// based on MD:Z tags.
if ((c->ref_id = bam_ref(b)) >= 0) {
c->ref_free = 1;
if (c->ref) abort();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there really be an abort here? It would just crash the calling program without any explanation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err no! I'm not sure why that's there. Thanks for spotting it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was added temporarily to convince myself that c->ref always will infact be NULL at this point anyway. I've stressed tested it on a variety of scenarios, and convinced myself it's true, which means the neighbour c->ref = NULL call is redundant. However just incase there is a coding error I'm leaving that there are ditching the check which shouldn't ever happen anyway.

In stress testing it though I did spot a couple other bugs, which have now been fixed also.

@jkbonfield
Copy link
Contributor Author

Bah, I see it's also failing some tests with multi-ref enabled. I'll need to do more work on this then.

Also fixed a potential illegal memory access caused by the return
value of cram_encode_aux, and added more belt and braces memory free
requests for the new embed_ref=2 option.
@whitwham whitwham merged commit e978164 into samtools:develop Jun 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants