-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for reading and writing splitting BAM index files. #1138
Conversation
See #1112 |
Codecov Report
@@ Coverage Diff @@
## master #1138 +/- ##
==============================================
+ Coverage 68.38% 68.488% +0.108%
- Complexity 8013 8063 +50
==============================================
Files 541 545 +4
Lines 32717 32902 +185
Branches 5531 5554 +23
==============================================
+ Hits 22372 22534 +162
- Misses 8134 8144 +10
- Partials 2211 2224 +13
|
@tomwhite - do you think that it might be worthy to add to the specs (https://github.com/samtools/hts-specs)? That will ensure that the format is well-defined and not specific to HTSJDK (and allows integration with other frameworks in other languages). |
@magicDGS This was actually just discussed in the last GA4GH meeting. I think that @lbergelson will be proposing a change to the spec to document this type of indexing scheme |
Thanks @yfarjoun - nice to hear that! |
@magicDGS @yfarjoun absolutely. I'm in the process of drafting something. I'll work with @lbergelson and others on this. |
that used in Hadoop-BAM. A htsjdk implementation can be found in samtools/htsjdk#1138.
We should hold off on reviewing this until we work through the issues in the spec discussion. |
@lbergelson I think this is ready for review now as the spec discussion has converged. Also, I've successfully used the code in this PR to build SBI files (with granularity 1 - i.e. every read start position) for benchmarking count reads on BAM for speed and accuracy (see https://github.com/tomwhite/disq-benchmarks). |
1f56f0c
to
8123579
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments
} | ||
|
||
private static SBIIndex readIndex(final InputStream in) { | ||
BinaryCodec binaryCodec = new BinaryCodec(in); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final here and below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* @return a list of contiguous, non-overlapping, sorted chunks that cover the whole data file | ||
* @see #getChunk(long, long) | ||
*/ | ||
public List<Chunk> split(long splitSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finals here and below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Thanks for the reviews @lbergelson and @yfarjoun. I've addressed all the comments. |
// the end, once we know the number of offsets. This is more efficient than using a List<Long> for very | ||
// large numbers of offsets (e.g. 10^8, which is possible for low granularity), since the list resizing | ||
// operation is slow. | ||
this.tempOffsetsFile = File.createTempFile("offsets-", ".headerless.sbi"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to use Path
over File
in new code where possible. E.g. use Files.createTempFile()
here and Files.newOutputStream()
below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion @pshapiro4broad. I've made the change in the latest version of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR @tomwhite.
Could ass final
to the variables that can be? I commented next to some of the places...but not all.
blockIn.seek(recordStart); | ||
// Create a buffer for reading the BAM record lengths. BAM is little-endian. | ||
final ByteBuffer byteBuffer = ByteBuffer.allocate(4).order(ByteOrder.LITTLE_ENDIAN); | ||
SBIIndexWriter indexWriter = new SBIIndexWriter(out, granularity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final
public boolean equals(Object o) { | ||
if (this == o) return true; | ||
if (o == null || getClass() != o.getClass()) return false; | ||
Header header = (Header) o; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final
throw new RuntimeException(String.format("Cannot read SBI with more than %s offsets.", Integer.MAX_VALUE)); | ||
} | ||
final int numOffsets = (int) numOffsetsLong; | ||
long[] virtualOffsets = new long[numOffsets]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final
if (!Arrays.equals(buffer, SBI_MAGIC)) { | ||
throw new RuntimeException("Invalid file header in SBI: " + new String(buffer) + " (" + Arrays.toString(buffer) + ")"); | ||
} | ||
long fileLength = binaryCodec.readLong(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more finals
finish(header, finalVirtualOffset); | ||
} | ||
|
||
void finish(SBIIndex.Header header, long finalVirtualOffset) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finals
List<SAMRecord> allReads = Iterables.slurp(samReader); | ||
|
||
List<SAMRecord> allReadsFromChunks = new ArrayList<>(); | ||
for (Chunk chunk : chunks) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final chunk
} | ||
Assert.assertEquals(allReadsFromChunks, allReads); | ||
|
||
List<Chunk> optimizedChunks = Chunk.optimizeChunkList(chunks, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final lists
public void testIndexersProduceSameIndexes() throws Exception { | ||
long bamFileSize = BAM_FILE.length(); | ||
for (long g : new long[] { 1, 2, 10, SBIIndexWriter.DEFAULT_GRANULARITY }) { | ||
SBIIndex index1 = fromBAMFile(BAM_FILE, g); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final
|
||
@Test | ||
public void testIndexersProduceSameIndexes() throws Exception { | ||
long bamFileSize = BAM_FILE.length(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finals
@yfarjoun thanks for taking a look. I've added the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another few nits. thanks!
@yfarjoun @lbergelson this should be ready to go in. I've addressed all the feedback. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. We should i
Tom responded to Yossi's comments.
We should integrate this with SamFileWriterFactory so it's easy to write an SBI as part of writing a Bam file. |
Thanks @lbergelson |
Move logic to read and write splitting-bai files to htsjdk from Hadoop-BAM, since they are useful for distributed processing in general (and perhaps other tools). The format here is different to (and incompatible with) the one in Hadoop-BAM: it adds a header with a magic number (for file discovery and versioning), as well as a field for granularity. Also, the offsets are written in little-endian format for consistency with the rest of the BAM spec.
Checklist