-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an optimised version of CramContainerIterator #1129
Add an optimised version of CramContainerIterator #1129
Conversation
@cmnbroad Could you review this one when you get a chance? |
… header for each container.
90be8ce
to
dbca999
Compare
Made the test extend |
Codecov Report
@@ Coverage Diff @@
## master #1129 +/- ##
===============================================
+ Coverage 65.963% 65.967% +0.004%
- Complexity 7663 7670 +7
===============================================
Files 537 538 +1
Lines 32553 32574 +21
Branches 5528 5530 +2
===============================================
+ Hits 21473 21488 +15
+ Misses 8930 8929 -1
- Partials 2150 2157 +7
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proposed an alternative implementation, and a couple of other minor comments. Also I'd like to avoid adding a test cram that has no companion reference if possible, since it can't be used for much else. There are a few existing crams that have multiple containers though - maybe you can use one of those. See src/test/resources/htsjdk/samtools/cram/NA12878.20.21.1-100.100-SeqsPerSlice.0-unMapped.cram
.
Assert.assertEquals(headerOnlyContainer.bases, fullContainer.bases); | ||
Assert.assertEquals(headerOnlyContainer.blockCount, fullContainer.blockCount); | ||
Assert.assertEquals(headerOnlyContainer.landmarks, fullContainer.landmarks); | ||
Assert.assertEquals(headerOnlyContainer.checksum, fullContainer.checksum); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add an equality assert here for container.offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the assert exposes the offset issue, and your fix resolves it.
|
||
public CramContainerHeaderIterator(final SeekableStream inputStream) throws IOException { | ||
cramHeader = CramIO.readCramHeader(inputStream); | ||
offset = inputStream.position(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In looking at this I noticed that the existing CramContainerIterator
doesn't correctly set the offset as you're ding here (it fails to account for the size of the Cram header, so the first real container has it's offset set to 0 - all of them are off after that). If we fix that, then this class can be a subclass of that one, and we won't have to duplicate the rest of the implementation. Plus we won't have to artificially constrain the stream to be a SeekableStream
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to explore this a bit since I'm a little surprised the existing CramContainerIterator
doesn't cause problems due to the bad offsets. Anyway, here is a commit with my proposed fix - see what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your fix is a great improvement - it simplifies the new code and fixes a previous issue with the code. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I had a quick look at why this wasn't detected before. I changed the code in Squark that uses CramContainerHeaderIterator to use CramContainerIterator - and for small CRAM files it still works, most likely because a Chunk that starts at the start of the file, rather than the first offset, works in the same way. For large CRAM files the offsets don't work however and the code throws an exception.
So it may be that testing on small CRAM files has masked this bug. CRAM indexes written with the old code may be invalid though.
* @throws IOException as per java IO contract | ||
* @throws EOFException if there is less than length bytes in the stream | ||
*/ | ||
public static void skipFully(final InputStream in, final long length) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a positive test, plus one that fails to consume the entire length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me now. Needs maintainer approval - @lbergelson.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@tomwhite Thank you. |
Description
An optimized version of CramContainerIterator that only reads the header for each container. The motivation for this change is to make it possible to find CRAM container offsets without having to decompress the blocks in the container. See #1112
Checklist