Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future direction of CRAM embed_ref #1445

Closed
jkbonfield opened this issue Jun 6, 2022 · 0 comments
Closed

Future direction of CRAM embed_ref #1445

jkbonfield opened this issue Jun 6, 2022 · 0 comments

Comments

@jkbonfield
Copy link
Contributor

Following merging of #1442, there is now potential for further improvements.

The current situation is:

  1. Default is CRAM requires an external reference, and it's a hard failure if one isn't found.
  2. If embed_ref is option is set (default to level 1) or embed_ref=1, then an external reference is required during encode but not decode, which will use the embedded one instead. However decoded files need the external reference specifying again if they are subsequently turned back into CRAM once more.
  3. If embed_ref=2 option is set, an external reference is no longer required and either MD:Z tags are used to infer it or a faked up reference is created (with potential downstream MD:Z tag generation issues).

This raises a number of questions, which I am opening up to discussion and feedback from the community.

  1. If we use the embed_ref option, should it automatically use level 2 if no external reference is found? If so should this only be for cases when MD:Z tags are found, or also for the more consensus-oriented approach? Ie should level 2 become the default?

  2. If an external reference is not found during encode and no embed_ref option has been set, should we automatically switch to embed_ref=2 mode? If so should it be for both MD:Z aware encoding and consensus based encoding, or only for MD:Z encoding (which is directrly equivalent to embed_ref=1 when present)? This is basically saying unless we explicitly set embed_ref=0 then CRAM is permitted to embed it to work around the lack of an external reference? This will lead to slightly larger files, so it could be argued this isn't a good idea. The obvious middle ground is to amend the error message about lack of reference to provide a better hint for how to proceed if you cannot obtain one.

  3. Should we be providing some hint mechanism in the SAM header, eg by adding @CO htslib_cram_options:embed_ref and similar. This opens up a lot of other option selection opportunities too. The idea being if your input file used an embedded reference then you probably want to be using it again at the end of your pipeline, so CRAM->BAM->CRAM would always transparently work without needing explicit modification of the pipeline options. This is perhaps a more palatable alternatve to question 2 above: only do this if we have previous evidence of embed_ref usage.

All comments welcomed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant