Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cram references STILL not working #50

Closed
agraubert opened this issue Nov 29, 2020 · 2 comments · Fixed by #56
Closed

Cram references STILL not working #50

agraubert opened this issue Nov 29, 2020 · 2 comments · Fixed by #56
Labels
bug Something isn't working right

Comments

@agraubert
Copy link
Collaborator

I was pretty sure we fixed the issue with cram references back in walaj/SeqLib#38, but apparently travis builds are still failing. Somehow, despite a reference being handed to SeqLib and SeqLib setting the reference on the HTSlib bam object we still can't read Crams unless the fasta is in the hts cache or at the location indicated in the bam header.

I unfortunately don't have time to work on this right now, so for the time being I'm adding a warning message to RNA-SeQC and disabling tests on crams

@agraubert agraubert added the bug Something isn't working right label Nov 29, 2020
@agraubert agraubert mentioned this issue Mar 4, 2021
@agraubert
Copy link
Collaborator Author

agraubert commented Mar 4, 2021

Cram references are handled in a very confusing way, so I'll try and document this here for future reference:

  1. The call to SeqLib::BamReader::SetCramReference just stores the given reference filename on the BamReader and any already opened bams (although doing that has no effect on the latter)
  2. When SeqLib opens the bam, if a reference is set, it calls hts_set_fai_filename
  3. For bams/sams, this just sets .fn_aux on the hts bam object, and I'm not sure if that's ever used
  4. However, for crams, it calls out to cram_set_option, which eventually gets to cram_load_reference
  5. This is where it gets complicated. cram_load_reference calls refs_load_fai which serves two tasks:
    • It checks the REF_PATH and REF_CACHE environment variables as well as the user's provided reference. I'm not exactly sure of the order those are checked. Don't ask; this function is a mess. If a reference file is found somewhere between those three locations, htslib parses it into a reference table object
    • The chromosomes present in the reference table are loaded into a hashmap, which maps chromosome MD5 -> reference entry object
  6. Interestingly, even though refs_load_fai just populated the reference table, cram_load_reference later calls refs2id which immediately empties the table and repopulates it from the hashmap

This is where the problem arises: a user provided reference will always make it into the reference table and even the hash map, but because the final version of the table is populated from a hashmap, it might not be used. Ultimately, the reference table contains a reference object that htslib uses to decode crams, but entries only make it in iff the MD5 of the chromosome in the fasta matches the MD5 recorded for that chromosome in the bam header

@agraubert
Copy link
Collaborator Author

The new fix has RNA-SeQC preload bams when opened. It has htslib populate the reference table exactly as above, then checks to see which chromosomes (present in the bam header) ended up using the provided reference as their reference. This allows for RNA-SeQC to spit out some error messages that can help reveal this problem when it occurs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working right
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant