Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an optimised version of CramContainerIterator #1129

Merged
merged 2 commits into from
Jun 5, 2018

Conversation

tomwhite
Copy link
Contributor

Description

An optimized version of CramContainerIterator that only reads the header for each container. The motivation for this change is to make it possible to find CRAM container offsets without having to decompress the blocks in the container. See #1112

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)

@lbergelson
Copy link
Member

@cmnbroad Could you review this one when you get a chance?
@tomwhite The tests are failing because CramContainerHeaderIteratorTest doesn't extend HtsJdkTest. If it doesn't extend that then it may not be picked up by the test runner, so there is a test that checks for test files that don't extend it...

@lbergelson lbergelson requested a review from cmnbroad May 25, 2018 17:17
@tomwhite tomwhite force-pushed the cram_container_header_iterator branch from 90be8ce to dbca999 Compare May 29, 2018 07:56
@tomwhite
Copy link
Contributor Author

Made the test extend HtsJdkTest.

@codecov-io
Copy link

codecov-io commented May 29, 2018

Codecov Report

Merging #1129 into master will increase coverage by 0.004%.
The diff coverage is 91.304%.

@@               Coverage Diff               @@
##              master     #1129       +/-   ##
===============================================
+ Coverage     65.963%   65.967%   +0.004%     
- Complexity      7663      7670        +7     
===============================================
  Files            537       538        +1     
  Lines          32553     32574       +21     
  Branches        5528      5530        +2     
===============================================
+ Hits           21473     21488       +15     
+ Misses          8930      8929        -1     
- Partials        2150      2157        +7
Impacted Files Coverage Δ Complexity Δ
...jdk/samtools/cram/build/CramContainerIterator.java 84.848% <100%> (+0.473%) 10 <3> (+1) ⬆️
...mtools/cram/build/CramContainerHeaderIterator.java 100% <100%> (ø) 2 <2> (?)
...java/htsjdk/samtools/cram/io/InputStreamUtils.java 80.769% <81.818%> (+0.769%) 7 <3> (+3) ⬆️
...sjdk/samtools/util/Md5CalculatingOutputStream.java 69.231% <0%> (-7.692%) 8% <0%> (-1%)
src/main/java/htsjdk/samtools/CRAMFileWriter.java 60% <0%> (-4.103%) 10% <0%> (-1%)
src/main/java/htsjdk/samtools/SAMTextWriter.java 66.667% <0%> (-2.652%) 12% <0%> (ø)
...rc/main/java/htsjdk/samtools/SamInputResource.java 67.153% <0%> (ø) 16% <0%> (ø) ⬇️
...c/main/java/htsjdk/samtools/SAMFileWriterImpl.java 76.923% <0%> (+0.3%) 26% <0%> (+1%) ⬆️
...ava/htsjdk/samtools/cram/structure/CramHeader.java 44.444% <0%> (+3.704%) 7% <0%> (+1%) ⬆️
... and 1 more

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposed an alternative implementation, and a couple of other minor comments. Also I'd like to avoid adding a test cram that has no companion reference if possible, since it can't be used for much else. There are a few existing crams that have multiple containers though - maybe you can use one of those. See src/test/resources/htsjdk/samtools/cram/NA12878.20.21.1-100.100-SeqsPerSlice.0-unMapped.cram.

Assert.assertEquals(headerOnlyContainer.bases, fullContainer.bases);
Assert.assertEquals(headerOnlyContainer.blockCount, fullContainer.blockCount);
Assert.assertEquals(headerOnlyContainer.landmarks, fullContainer.landmarks);
Assert.assertEquals(headerOnlyContainer.checksum, fullContainer.checksum);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an equality assert here for container.offset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the assert exposes the offset issue, and your fix resolves it.


public CramContainerHeaderIterator(final SeekableStream inputStream) throws IOException {
cramHeader = CramIO.readCramHeader(inputStream);
offset = inputStream.position();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In looking at this I noticed that the existing CramContainerIterator doesn't correctly set the offset as you're ding here (it fails to account for the size of the Cram header, so the first real container has it's offset set to 0 - all of them are off after that). If we fix that, then this class can be a subclass of that one, and we won't have to duplicate the rest of the implementation. Plus we won't have to artificially constrain the stream to be a SeekableStream.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to explore this a bit since I'm a little surprised the existing CramContainerIterator doesn't cause problems due to the bad offsets. Anyway, here is a commit with my proposed fix - see what you think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your fix is a great improvement - it simplifies the new code and fixes a previous issue with the code. Thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I had a quick look at why this wasn't detected before. I changed the code in Squark that uses CramContainerHeaderIterator to use CramContainerIterator - and for small CRAM files it still works, most likely because a Chunk that starts at the start of the file, rather than the first offset, works in the same way. For large CRAM files the offsets don't work however and the code throws an exception.

So it may be that testing on small CRAM files has masked this bug. CRAM indexes written with the old code may be invalid though.

* @throws IOException as per java IO contract
* @throws EOFException if there is less than length bytes in the stream
*/
public static void skipFully(final InputStream in, final long length) throws IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a positive test, plus one that fails to consume the entire length.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me now. Needs maintainer approval - @lbergelson.

@yfarjoun yfarjoun added the cram label Jun 4, 2018
Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@lbergelson lbergelson merged commit f007ed3 into samtools:master Jun 5, 2018
@lbergelson
Copy link
Member

@tomwhite Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants