Add support for reading and writing splitting BAM index files. #1138

tomwhite · 2018-06-04T10:58:43Z

Move logic to read and write splitting-bai files to htsjdk from Hadoop-BAM, since they are useful for distributed processing in general (and perhaps other tools). The format here is different to (and incompatible with) the one in Hadoop-BAM: it adds a header with a magic number (for file discovery and versioning), as well as a field for granularity. Also, the offsets are written in little-endian format for consistency with the rest of the BAM spec.

Checklist

Code compiles correctly
New tests covering changes and new functionality
All tests passing
Extended the README / documentation, if necessary
Is not backward compatible (breaks binary or source compatibility)

tomwhite · 2018-06-04T10:59:03Z

See #1112

codecov-io · 2018-06-04T11:11:00Z

Codecov Report

Merging #1138 into master will increase coverage by 0.108%.
The diff coverage is 63.83%.

@@              Coverage Diff               @@
##             master     #1138       +/-   ##
==============================================
+ Coverage     68.38%   68.488%   +0.108%     
- Complexity     8013      8063       +50     
==============================================
  Files           541       545        +4     
  Lines         32717     32902      +185     
  Branches       5531      5554       +23     
==============================================
+ Hits          22372     22534      +162     
- Misses         8134      8144       +10     
- Partials       2211      2224       +13

Impacted Files	Coverage Δ	Complexity Δ
src/main/java/htsjdk/samtools/BAMFileReader.java	`64.306% <100%> (+0.877%)`	`41 <2> (+3)`	⬆️
src/main/java/htsjdk/samtools/SAMUtils.java	`59.148% <33.333%> (-0.196%)`	`126 <1> (+1)`
src/main/java/htsjdk/samtools/SBIIndex.java	`59.091% <59.091%> (ø)`	`16 <16> (?)`
src/main/java/htsjdk/samtools/SBIIndexWriter.java	`69.231% <69.231%> (ø)`	`8 <8> (?)`
src/main/java/htsjdk/samtools/BAMSBIIndexer.java	`75% <75%> (ø)`	`1 <1> (?)`
.../variant/variantcontext/StructuralVariantType.java	`80% <0%> (-20%)`	`2% <0%> (+1%)`
...samtools/util/AsyncBlockCompressedInputStream.java	`72% <0%> (-4%)`	`12% <0%> (-1%)`
src/main/java/htsjdk/samtools/SamReader.java	`80% <0%> (-1.25%)`	`0% <0%> (ø)`
...dk/samtools/seekablestream/SeekableHTTPStream.java	`57.353% <0%> (-0.223%)`	`13% <0%> (ø)`
...rc/main/java/htsjdk/samtools/util/BinaryCodec.java	`69.266% <0%> (ø)`	`56% <0%> (+1%)`	⬆️
... and 21 more

magicDGS · 2018-06-04T12:23:23Z

@tomwhite - do you think that it might be worthy to add to the specs (https://github.com/samtools/hts-specs)? That will ensure that the format is well-defined and not specific to HTSJDK (and allows integration with other frameworks in other languages).

yfarjoun · 2018-06-04T12:42:42Z

@magicDGS This was actually just discussed in the last GA4GH meeting. I think that @lbergelson will be proposing a change to the spec to document this type of indexing scheme

magicDGS · 2018-06-04T12:43:49Z

Thanks @yfarjoun - nice to hear that!

tomwhite · 2018-06-04T13:21:39Z

@magicDGS @yfarjoun absolutely. I'm in the process of drafting something. I'll work with @lbergelson and others on this.

that used in Hadoop-BAM. A htsjdk implementation can be found in samtools/htsjdk#1138.

lbergelson · 2018-06-05T17:11:02Z

We should hold off on reviewing this until we work through the issues in the spec discussion.

tomwhite · 2018-07-12T14:45:38Z

@lbergelson I think this is ready for review now as the spec discussion has converged. Also, I've successfully used the code in this PR to build SBI files (with granularity 1 - i.e. every read start position) for benchmarking count reads on BAM for speed and accuracy (see https://github.com/tomwhite/disq-benchmarks).

src/main/java/htsjdk/samtools/BAMFileReader.java

yfarjoun

some comments

src/main/java/htsjdk/samtools/SBIIndexWriter.java

src/main/java/htsjdk/samtools/SBIIndex.java

src/main/java/htsjdk/samtools/BAMSBIIndexer.java

src/main/java/htsjdk/samtools/SBIIndex.java

yfarjoun · 2018-10-01T18:34:07Z

src/main/java/htsjdk/samtools/SBIIndex.java

+    }
+
+    private static SBIIndex readIndex(final InputStream in) {
+        BinaryCodec binaryCodec = new BinaryCodec(in);


final here and below

src/main/java/htsjdk/samtools/BAMSBIIndexer.java

yfarjoun · 2018-10-01T18:46:55Z

src/main/java/htsjdk/samtools/SBIIndex.java

+     * @return a list of contiguous, non-overlapping, sorted chunks that cover the whole data file
+     * @see #getChunk(long, long)
+     */
+    public List<Chunk> split(long splitSize) {


finals here and below

src/main/java/htsjdk/samtools/SBIIndexWriter.java

tomwhite · 2018-10-02T14:39:00Z

Thanks for the reviews @lbergelson and @yfarjoun. I've addressed all the comments.

pshapiro4broad · 2018-10-03T17:48:42Z

src/main/java/htsjdk/samtools/SBIIndexWriter.java

+            // the end, once we know the number of offsets. This is more efficient than using a List<Long> for very
+            // large numbers of offsets (e.g. 10^8, which is possible for low granularity), since the list resizing
+            // operation is slow.
+            this.tempOffsetsFile = File.createTempFile("offsets-", ".headerless.sbi");


It would be nice to use Path over File in new code where possible. E.g. use Files.createTempFile() here and Files.newOutputStream() below.

Thanks for the suggestion @pshapiro4broad. I've made the change in the latest version of this PR.

yfarjoun

Thanks for this PR @tomwhite.

Could ass final to the variables that can be? I commented next to some of the places...but not all.

yfarjoun · 2018-10-16T01:59:11Z

src/main/java/htsjdk/samtools/BAMSBIIndexer.java

+            blockIn.seek(recordStart);
+            // Create a buffer for reading the BAM record lengths. BAM is little-endian.
+            final ByteBuffer byteBuffer = ByteBuffer.allocate(4).order(ByteOrder.LITTLE_ENDIAN);
+            SBIIndexWriter indexWriter = new SBIIndexWriter(out, granularity);


yfarjoun · 2018-10-16T02:00:00Z

src/main/java/htsjdk/samtools/SBIIndex.java

+        public boolean equals(Object o) {
+            if (this == o) return true;
+            if (o == null || getClass() != o.getClass()) return false;
+            Header header = (Header) o;


src/main/java/htsjdk/samtools/SBIIndex.java

yfarjoun · 2018-10-16T02:01:23Z

src/main/java/htsjdk/samtools/SBIIndex.java

+            throw new RuntimeException(String.format("Cannot read SBI with more than %s offsets.", Integer.MAX_VALUE));
+        }
+        final int numOffsets = (int) numOffsetsLong;
+        long[] virtualOffsets = new long[numOffsets];


yfarjoun · 2018-10-16T02:02:57Z

src/main/java/htsjdk/samtools/SBIIndex.java

+        if (!Arrays.equals(buffer, SBI_MAGIC)) {
+            throw new RuntimeException("Invalid file header in SBI: " + new String(buffer) + " (" + Arrays.toString(buffer) + ")");
+        }
+        long fileLength = binaryCodec.readLong();


more finals

yfarjoun · 2018-10-16T02:04:37Z

src/main/java/htsjdk/samtools/SBIIndexWriter.java

+        finish(header, finalVirtualOffset);
+    }
+
+    void finish(SBIIndex.Header header, long finalVirtualOffset) {


yfarjoun · 2018-10-16T02:05:05Z

src/test/java/htsjdk/samtools/BAMSBIIndexerTest.java

+        List<SAMRecord> allReads = Iterables.slurp(samReader);
+
+        List<SAMRecord> allReadsFromChunks = new ArrayList<>();
+        for (Chunk chunk : chunks) {


final chunk

yfarjoun · 2018-10-16T02:05:14Z

src/test/java/htsjdk/samtools/BAMSBIIndexerTest.java

+        }
+        Assert.assertEquals(allReadsFromChunks, allReads);
+
+        List<Chunk> optimizedChunks = Chunk.optimizeChunkList(chunks, 0);


final lists

yfarjoun · 2018-10-16T02:05:22Z

src/test/java/htsjdk/samtools/BAMSBIIndexerTest.java

+    public void testIndexersProduceSameIndexes() throws Exception {
+        long bamFileSize = BAM_FILE.length();
+        for (long g : new long[] { 1, 2, 10, SBIIndexWriter.DEFAULT_GRANULARITY }) {
+            SBIIndex index1 = fromBAMFile(BAM_FILE, g);


yfarjoun · 2018-10-16T02:05:28Z

src/test/java/htsjdk/samtools/BAMSBIIndexerTest.java

+
+    @Test
+    public void testIndexersProduceSameIndexes() throws Exception {
+        long bamFileSize = BAM_FILE.length();


tomwhite · 2018-10-16T15:38:17Z

@yfarjoun thanks for taking a look. I've added the final modifier to the places you suggested, and many more that I found.

yfarjoun

another few nits. thanks!

src/test/java/htsjdk/samtools/BAMSBIIndexerTest.java

tomwhite · 2018-11-19T15:55:01Z

@yfarjoun @lbergelson this should be ready to go in. I've addressed all the feedback. Thanks!

lbergelson

This looks good to me. We should i

Tom responded to Yossi's comments.

lbergelson · 2018-12-11T01:39:58Z

We should integrate this with SamFileWriterFactory so it's easy to write an SBI as part of writing a Bam file.

tomwhite · 2018-12-11T09:53:06Z

Thanks @lbergelson

…e added in samtools/htsjdk#1138

…e added in samtools/htsjdk#1138 (#79)

tomwhite force-pushed the splitting_bai branch from b832542 to 75e30e3 Compare June 4, 2018 17:03

tomwhite added a commit to tomwhite/hts-specs that referenced this pull request Jun 5, 2018

Initial draft of a spec for splitting BAM index files, similar to

6d4f054

that used in Hadoop-BAM. A htsjdk implementation can be found in samtools/htsjdk#1138.

tomwhite mentioned this pull request Jun 5, 2018

Add splitting BAM index to spec samtools/hts-specs#321

Open

tomwhite force-pushed the splitting_bai branch from 75e30e3 to 001865b Compare June 18, 2018 10:36

tomwhite force-pushed the splitting_bai branch from 001865b to a851c3a Compare June 25, 2018 14:00

tomwhite added a commit to tomwhite/disq-original that referenced this pull request Jul 12, 2018

Use latest SBI code from htsjdk PR (samtools/htsjdk#1138)

3ff47b6

tomwhite added a commit to broadinstitute/gatk that referenced this pull request Jul 12, 2018

Use latest SBI code from htsjdk PR (samtools/htsjdk#1138)

6a545e7

This was referenced Jul 19, 2018

Identify initial code contributions disq-bio/disq#10

Closed

[DISQ-10] Initial Disq code contribution. disq-bio/disq#14

Merged

tomwhite added a commit to broadinstitute/gatk that referenced this pull request Jul 26, 2018

Use latest SBI code from htsjdk PR (samtools/htsjdk#1138)

e375860

tomwhite added a commit to broadinstitute/gatk that referenced this pull request Aug 27, 2018

Use latest SBI code from htsjdk PR (samtools/htsjdk#1138)

998251e

tomwhite added a commit to broadinstitute/gatk that referenced this pull request Sep 6, 2018

Use latest SBI code from htsjdk PR (samtools/htsjdk#1138)

0be0976

tomwhite added a commit to broadinstitute/gatk that referenced this pull request Sep 10, 2018

Use latest SBI code from htsjdk PR (samtools/htsjdk#1138)

e7afd79

tomwhite and others added 3 commits September 10, 2018 12:22

Add support for reading and writing SBI files for BAMs.

20d76cc

Use temp file for storing offsets

ffa5fca

Add toString and accessor for the header

8123579

tomwhite force-pushed the splitting_bai branch from 1f56f0c to 8123579 Compare September 10, 2018 11:40

tomwhite added a commit to broadinstitute/gatk that referenced this pull request Sep 18, 2018

Use latest SBI code from htsjdk PR (samtools/htsjdk#1138)

ef89ee0

lbergelson reviewed Oct 1, 2018

View reviewed changes

src/main/java/htsjdk/samtools/BAMFileReader.java Show resolved Hide resolved

yfarjoun reviewed Oct 2, 2018

View reviewed changes

Address review feedback.

bc81716

pshapiro4broad reviewed Oct 3, 2018

View reviewed changes

tomwhite added a commit to broadinstitute/gatk that referenced this pull request Oct 8, 2018

Use latest SBI code from htsjdk PR (samtools/htsjdk#1138)

fddf5d3

yfarjoun added Waiting for revisions This PR has received comments from reviewers and is waiting for the Author to respond and removed Waiting for revisions This PR has received comments from reviewers and is waiting for the Author to respond labels Oct 16, 2018

yfarjoun reviewed Oct 16, 2018

View reviewed changes

tomwhite added 2 commits October 16, 2018 16:22

Use Java NIO File API.

b3de9c0

Make variables final where possible.

34e6c99

yfarjoun previously requested changes Oct 22, 2018

View reviewed changes

Address more review feedback

74cfe73

lbergelson approved these changes Dec 10, 2018

View reviewed changes

lbergelson merged commit 28dde96 into samtools:master Dec 11, 2018

tomwhite added a commit to tomwhite/disq that referenced this pull request Jan 17, 2019

Upgrade to htsjdk 2.18.2 and remove htsjdk classes from Disq that wer…

c5454d2

…e added in samtools/htsjdk#1138

tomwhite mentioned this pull request Jan 17, 2019

Upgrade to htsjdk 2.18.2 disq-bio/disq#79

Merged

tomwhite added a commit to disq-bio/disq that referenced this pull request Jan 19, 2019

Upgrade to htsjdk 2.18.2 and remove htsjdk classes from Disq that wer…

bd9ebb3

…e added in samtools/htsjdk#1138 (#79)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for reading and writing splitting BAM index files. #1138

Add support for reading and writing splitting BAM index files. #1138

tomwhite commented Jun 4, 2018 •

edited

Loading

tomwhite commented Jun 4, 2018

codecov-io commented Jun 4, 2018 •

edited

Loading

magicDGS commented Jun 4, 2018

yfarjoun commented Jun 4, 2018

magicDGS commented Jun 4, 2018

tomwhite commented Jun 4, 2018

lbergelson commented Jun 5, 2018

tomwhite commented Jul 12, 2018

yfarjoun left a comment

yfarjoun Oct 1, 2018

tomwhite Oct 2, 2018

yfarjoun Oct 1, 2018

tomwhite Oct 2, 2018

tomwhite commented Oct 2, 2018

pshapiro4broad Oct 3, 2018

tomwhite Oct 16, 2018

yfarjoun left a comment

yfarjoun Oct 16, 2018

yfarjoun Oct 16, 2018

yfarjoun Oct 16, 2018

yfarjoun Oct 16, 2018

yfarjoun Oct 16, 2018

yfarjoun Oct 16, 2018

yfarjoun Oct 16, 2018

yfarjoun Oct 16, 2018

yfarjoun Oct 16, 2018

tomwhite commented Oct 16, 2018

yfarjoun left a comment

tomwhite commented Nov 19, 2018

lbergelson left a comment

lbergelson commented Dec 11, 2018

tomwhite commented Dec 11, 2018

Add support for reading and writing splitting BAM index files. #1138

Add support for reading and writing splitting BAM index files. #1138

Conversation

tomwhite commented Jun 4, 2018 • edited Loading

Checklist

tomwhite commented Jun 4, 2018

codecov-io commented Jun 4, 2018 • edited Loading

Codecov Report

magicDGS commented Jun 4, 2018

yfarjoun commented Jun 4, 2018

magicDGS commented Jun 4, 2018

tomwhite commented Jun 4, 2018

lbergelson commented Jun 5, 2018

tomwhite commented Jul 12, 2018

yfarjoun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomwhite commented Oct 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yfarjoun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomwhite commented Oct 16, 2018

yfarjoun left a comment

Choose a reason for hiding this comment

tomwhite commented Nov 19, 2018

lbergelson left a comment

Choose a reason for hiding this comment

lbergelson commented Dec 11, 2018

tomwhite commented Dec 11, 2018

tomwhite commented Jun 4, 2018 •

edited

Loading

codecov-io commented Jun 4, 2018 •

edited

Loading