Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to support Hadoop-BAM/Squark #1112

Open
tomwhite opened this issue May 3, 2018 · 0 comments
Open

Improvements to support Hadoop-BAM/Squark #1112

tomwhite opened this issue May 3, 2018 · 0 comments

Comments

@tomwhite
Copy link
Contributor

tomwhite commented May 3, 2018

This describes some of the improvements, changes and additions to htsjdk to better support Hadoop-BAM and its proposed new incarnation codenamed Squark.

Individual issues or PRs can be created later to implement the changes as needed.

Small changes

  • Code to write a BAM header to a stream. Currently resides in htsjdk's BAMFileWriter#writeHeader, which is protected (and the class is package private). Needed by BamSink in Squark.
  • More block compressed file pointer utils. BlockCompressedFilePointerUtil#makeFilePointer should be public. It would also be useful to have an overloaded version that doesn't have an offset (i.e. is 0). Squark has BgzfVirtualFilePointerUtil
  • Expose CRAMIntervalIterator. Currently private, it's needed in Squark to get reads overlapping intervals, much like BAMFileReader#createIndexIterator.

Improvements

  • Seeks within SeekableBufferedStream's buffer should not create a new buffer. See Squark's ExtSeekableBufferedStream for the changes.
  • An optimized version of CramContainerIterator that only reads the header for each container. See Squark's CramContainerHeaderIterator
  • A way to read a VCFHeader from a stream without knowing if the file is VCF or BCF, or compressed or not. Implementation in Hadoop-BAM's VCFHeaderReader.
  • A way to use ReferenceSequenceFileFactory with streams. It should be possible to open a reference sequence by passing an input stream to a FASTA (and to its index). This would allow reading from Hadoop filesystems without having to use the file NIO library. (One of the goals of Squark is to make it a user-controllable option as to whether to use NIO or Hadoop filesystems.)

New features

  • Splitting-bai. Hadoop-BAM introduced a simple index format to locate read boundaries after arbitrary offsets in a file, which helps reads BAMs in parallel. It would be beneficial to have the logic to read and write splitting-bai files in htsjdk, since they are useful for distributed processing in general.
lbergelson pushed a commit that referenced this issue May 15, 2018
* make a previously package protected method public and add a new overload
* partially addresses #1112
lbergelson pushed a commit that referenced this issue May 25, 2018
…ffer (#1121)

* Previously seeking on a SeekableBufferedStream would always perform a seek on the underlying stream and create a new buffer afterwards.
* Now, any seek that falls within the existing buffer moves within the buffer without performing an expensive seek operation on the underlying stream.  This is a performance improvement when making many repetitive seek calls that fall close to each other.  
* part of #1112
lbergelson pushed a commit that referenced this issue Jun 5, 2018
* Add an optimised version of CramContainerIterator that only reads the header for each container.
* addresses part of #1112
lbergelson added a commit that referenced this issue Jun 14, 2018
* adding a new class VCFHeaderReader to read a VCFHeader from a stream without knowing in advance if the file is VCF or BCF, or compressed or not.
* part of #1112
lbergelson pushed a commit that referenced this issue Jun 15, 2018
* added a new method  CRAMFileReader.createIndexIterator get reads overlapping intervals in a cram file, analogous to BAMFileReader#createIndexIterator
* part of #1112
lbergelson pushed a commit that referenced this issue Jun 18, 2018
* Add a method to open a ReferenceSequence by passing a FASTA and it's index as SeekableStreams 
* This is useful for clients that are using filesystems that don't have an nio.Path provider available but can produce a stream
* part of #1112
tomwhite added a commit to tomwhite/disq-original that referenced this issue Jun 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant