Future direction of CRAM embed_ref #1445

jkbonfield · 2022-06-06T11:31:18Z

Following merging of #1442, there is now potential for further improvements.

The current situation is:

Default is CRAM requires an external reference, and it's a hard failure if one isn't found.
If embed_ref is option is set (default to level 1) or embed_ref=1, then an external reference is required during encode but not decode, which will use the embedded one instead. However decoded files need the external reference specifying again if they are subsequently turned back into CRAM once more.
If embed_ref=2 option is set, an external reference is no longer required and either MD:Z tags are used to infer it or a faked up reference is created (with potential downstream MD:Z tag generation issues).

This raises a number of questions, which I am opening up to discussion and feedback from the community.

If we use the embed_ref option, should it automatically use level 2 if no external reference is found? If so should this only be for cases when MD:Z tags are found, or also for the more consensus-oriented approach? Ie should level 2 become the default?
If an external reference is not found during encode and no embed_ref option has been set, should we automatically switch to embed_ref=2 mode? If so should it be for both MD:Z aware encoding and consensus based encoding, or only for MD:Z encoding (which is directrly equivalent to embed_ref=1 when present)? This is basically saying unless we explicitly set embed_ref=0 then CRAM is permitted to embed it to work around the lack of an external reference? This will lead to slightly larger files, so it could be argued this isn't a good idea. The obvious middle ground is to amend the error message about lack of reference to provide a better hint for how to proceed if you cannot obtain one.
Should we be providing some hint mechanism in the SAM header, eg by adding @CO htslib_cram_options:embed_ref and similar. This opens up a lot of other option selection opportunities too. The idea being if your input file used an embedded reference then you probably want to be using it again at the end of your pipeline, so CRAM->BAM->CRAM would always transparently work without needing explicit modification of the pipeline options. This is perhaps a more palatable alternatve to question 2 above: only do this if we have previous evidence of embed_ref usage.

All comments welcomed.

The text was updated successfully, but these errors were encountered:

jkbonfield mentioned this issue Jun 10, 2022

Cram embed ref cons #1449

Merged

jkbonfield closed this as completed Aug 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Future direction of CRAM embed_ref #1445

Future direction of CRAM embed_ref #1445

jkbonfield commented Jun 6, 2022

Future direction of CRAM embed_ref #1445

Future direction of CRAM embed_ref #1445

Comments

jkbonfield commented Jun 6, 2022