-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bgzip text compression mode #1369
Conversation
I like this change. Although I expect it was accidental, being forced down hopen vs open in order to get kgetline functionality, it's worth noting in the commit message that this also has the side effect of supporting URIs as the input filename to bgzip. A worthy addition in its own right. There is a bug however when given data that doesn't end on a newline. A newline is added. Eg:
I think the Would you mind also updating the man page? (I can do this instead if you prefer as it's a pretty minor addition.) It would need documenting the heuristics handling special header symbols used too, as this surprised me when reading the code. I can see why you want to include it though. Similarly a simple test adding to test/test.pl. Edit: also maybe a warning about binary data. If it has a nul character then it'll be truncated, due to the way kgetline works. So -n on a binary file will silently trash the data. That's not ideal, albeit inappropriate usage. At least the man page should be explicit that -n behaviour is undefined on non-text documents. Ideally it'd work in all cases, but that's probably a challenging operation unless we replace kgetline with something different. |
@jkbonfield Thanks, great feedback and great point about the last line. I worry that It looks to me like Acknowledge the need for man & tests -- I'm first going to noodle on these technical issues though. |
If something claiming to be a text file does not end in a newline, is it really a bug to emit one? The two representations are surely equivalent for a text file. (This is certainly the attitude of C compilers and Unix, which don't really believe in text files without a final newline character — cf POSIX 3.403 Text File and 3.206 Line. Windows users may have a different view.) What is the mnemonic for using |
That's one possible solution. I am a bit uneasy given I could imagine us getting bug reports for files not round-tripping correctly. Given as I understand it the intention is simply to permit BGZF block boundaries to occur after newline characters, IMO it would be preferable if that's the only impact it had. However we could just document it as only working on valid text files, ending in a newline and containing no binary data (UTF8 permitted, but not nuls). |
WDYT @jkbonfield @jmarshall ? (I've not forgotten about man & tests.) |
Upon further reflection, James is of course correct that not preserving the characters exactly would be a recipe for bug reports and trouble. So the better approach for this is to aim to flush blocks after newlines, and always produce identical contents even given non-text binary data. At that point, this is just a compression knob (the extra flushes reduce the compression slightly) rather than a major mode of operation. So instead of having a Applying this text mode automatically would mean that many more .vcf.gz etc files will have this technique used than would if it needed to be specified by users and scripts on the command line — which is a good thing. (We might want to have a |
Related to this, I do wonder what the implementation would look like if, instead of using any form of |
I've been experimenting with adding explicit flush calls, as the simplest option (to-do: headers). It's trivial and seems minimal of CPU cost.
I have a hack of bgzf and a bgzip debug binary that uses it to report the first 5 and last 5 chars in each block. Sure enough it's very noticable:
We could likely do the same with fasta/fastq too. Next up though is seeing whether we know on output files whether they're text or not. In htslib's usage via an htsFile this is doable. However it won't solve this particular PR which I think is a separate issue. The bgzip utility doesn't use htsFile. It's driving bgzf natively. Of course it could be wrapped up to use htsFile, but firstly I don't know if it then handles unrecognised formats (eg any random data stream), and even if so there aren't any format-agnostic hts_write functions which can do the necessary flush logic. So I think we need two changes here. I'll look at the format specific side of things. (That said, bgzip should probably still use htsFile for input, even if it's then ignoring the higher level structure components and is driving |
This makes bgzipped SAM, VCF, FASTA and FASTQ start blocks on a new record (except for the case of a single record being too large to fit in a single block). It is a companion PR to samtools#1369
This makes bgzipped SAM, VCF, FASTA and FASTQ start blocks on a new record (except for the case of a single record being too large to fit in a single block). It is a companion PR to samtools#1369
This makes bgzipped SAM, VCF, FASTA and FASTQ start blocks on a new record (except for the case of a single record being too large to fit in a single block). It is a companion PR to #1369
Looking at this again, I see one minor glitch due to the option rename:
With that change it appears to work, although I want to do more stress testing to look for corner cases. Listening to @jmarshall's comments above I did try replacing the kstring code with something to use This replacement does work, and it's much less overhead:
At -l0 the old code was a full 92% more instructions, reduced to just +2%. However, there are other factors to consider:
That said, I suspect there are basic optimisations that can be done to the PR as-is to improve the overhead which may be worth doing, such as the |
Thanks @jkbonfield, I've not had a chance to revisit this in some time, but really appreciate your looking into it |
Compressing with -a/--text promotes alignment of BGZF blocks with the uncompressed text lines. BGZF blocks start at the beginning of an input line and end after some subsequent newline (except when the block's first line overflows the BGZF block size). This ensures it's possible to specify byte ranges of a BGZF file that decompress into complete text records -- useful for parallel processing and "slicing" from remote servers.
We don't validate it does break at newlines as that needs low level Deflate stream processing, but do a basic round-trip test.
I've squashed and rebased it off current develop (including the one char fix for -a vs -n), added a minimal test (just duplicated the round-trip but with -a) and updated the man page. This can be seen in my copy of your PR at https://github.com/jkbonfield/htslib/tree/bgzip-text-mode-1 If you're happy with this squashing and the extra commit then I can force to your branch so this PR updates. (It'll likely miss the upcoming release, but we should be able to then finish reviewing and merge early in next release cycle.) |
@jkbonfield thanks, I reset my branch to yours |
Re option letters, see also the comments in #1369 (comment) re detecting text/binary and using this text mode automatically as appropriate (probably with --binary to override it explicitly). Using text mode automatically would mean more text files would get this treatment. |
The other advantage of the strrchr-equivalent version is that you don't have to invent yet another getline variant. Headers can be incorporated by a mode that scans forward for newlines while lines start with the header character. Then the mode switches to the strrchr mode when it sees a non-header character at the start of a line. |
Good point on automatic detection. As for strrchr, that itself is a bad function as it needs a null terminated string and it scans forward first before scanning backwards, as it's just a plain dumb API! However rolling your own is easy (as I did). I'm well aware of how we could do headers, but as I mentioned it's basically doubling up the work as we have a scan-forward variant followed by a scan-backward variant, and some trickery in the middle to switch. Doable and I have something part way already, but debateable if it's really worth the effort and extra code to maintain. I'll explore some more though to see just how messy it ends up looking. |
Compressing with
-n
/--text
promotes alignment of BGZF blocks with the uncompressed text lines. BGZF blocks start at the beginning of an input line and end after some subsequent newline (except when the block's first line overflows the BGZF block size).This ensures it's possible to specify byte ranges of a BGZF file that decompress into complete text records -- useful for parallel processing and "slicing" from remote servers.
Inspiration:
bam_write1()
uses the same trick to align BGZF blocks with BAM records.