You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Following merging of #1442, there is now potential for further improvements.
The current situation is:
Default is CRAM requires an external reference, and it's a hard failure if one isn't found.
If embed_ref is option is set (default to level 1) or embed_ref=1, then an external reference is required during encode but not decode, which will use the embedded one instead. However decoded files need the external reference specifying again if they are subsequently turned back into CRAM once more.
If embed_ref=2 option is set, an external reference is no longer required and either MD:Z tags are used to infer it or a faked up reference is created (with potential downstream MD:Z tag generation issues).
This raises a number of questions, which I am opening up to discussion and feedback from the community.
If we use the embed_ref option, should it automatically use level 2 if no external reference is found? If so should this only be for cases when MD:Z tags are found, or also for the more consensus-oriented approach? Ie should level 2 become the default?
If an external reference is not found during encode and no embed_ref option has been set, should we automatically switch to embed_ref=2 mode? If so should it be for both MD:Z aware encoding and consensus based encoding, or only for MD:Z encoding (which is directrly equivalent to embed_ref=1 when present)? This is basically saying unless we explicitly set embed_ref=0 then CRAM is permitted to embed it to work around the lack of an external reference? This will lead to slightly larger files, so it could be argued this isn't a good idea. The obvious middle ground is to amend the error message about lack of reference to provide a better hint for how to proceed if you cannot obtain one.
Should we be providing some hint mechanism in the SAM header, eg by adding @CO htslib_cram_options:embed_ref and similar. This opens up a lot of other option selection opportunities too. The idea being if your input file used an embedded reference then you probably want to be using it again at the end of your pipeline, so CRAM->BAM->CRAM would always transparently work without needing explicit modification of the pipeline options. This is perhaps a more palatable alternatve to question 2 above: only do this if we have previous evidence of embed_ref usage.
All comments welcomed.
The text was updated successfully, but these errors were encountered:
Following merging of #1442, there is now potential for further improvements.
The current situation is:
embed_ref
is option is set (default to level 1) orembed_ref=1
, then an external reference is required during encode but not decode, which will use the embedded one instead. However decoded files need the external reference specifying again if they are subsequently turned back into CRAM once more.embed_ref=2
option is set, an external reference is no longer required and either MD:Z tags are used to infer it or a faked up reference is created (with potential downstream MD:Z tag generation issues).This raises a number of questions, which I am opening up to discussion and feedback from the community.
If we use the
embed_ref
option, should it automatically use level 2 if no external reference is found? If so should this only be for cases when MD:Z tags are found, or also for the more consensus-oriented approach? Ie should level 2 become the default?If an external reference is not found during encode and no embed_ref option has been set, should we automatically switch to embed_ref=2 mode? If so should it be for both MD:Z aware encoding and consensus based encoding, or only for MD:Z encoding (which is directrly equivalent to embed_ref=1 when present)? This is basically saying unless we explicitly set
embed_ref=0
then CRAM is permitted to embed it to work around the lack of an external reference? This will lead to slightly larger files, so it could be argued this isn't a good idea. The obvious middle ground is to amend the error message about lack of reference to provide a better hint for how to proceed if you cannot obtain one.Should we be providing some hint mechanism in the SAM header, eg by adding
@CO htslib_cram_options:embed_ref
and similar. This opens up a lot of other option selection opportunities too. The idea being if your input file used an embedded reference then you probably want to be using it again at the end of your pipeline, so CRAM->BAM->CRAM would always transparently work without needing explicit modification of the pipeline options. This is perhaps a more palatable alternatve to question 2 above: only do this if we have previous evidence ofembed_ref
usage.All comments welcomed.
The text was updated successfully, but these errors were encountered: