Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bgzip text compression mode #1369

Closed
wants to merge 2 commits into from
Closed

Conversation

mlin
Copy link
Member

@mlin mlin commented Jan 2, 2022

Compressing with -n/--text promotes alignment of BGZF blocks with the uncompressed text lines. BGZF blocks start at the beginning of an input line and end after some subsequent newline (except when the block's first line overflows the BGZF block size).

This ensures it's possible to specify byte ranges of a BGZF file that decompress into complete text records -- useful for parallel processing and "slicing" from remote servers.

Inspiration: bam_write1() uses the same trick to align BGZF blocks with BAM records.

@jkbonfield
Copy link
Contributor

jkbonfield commented Jan 5, 2022

I like this change. Although I expect it was accidental, being forced down hopen vs open in order to get kgetline functionality, it's worth noting in the commit message that this also has the side effect of supporting URIs as the input filename to bgzip. A worthy addition in its own right.

There is a bug however when given data that doesn't end on a newline. A newline is added. Eg:

$ echo -n foo | ./bgzip | zcat |hd
00000000  66 6f 6f                                          |foo|
00000003

$ echo -n foo | ./bgzip -n | zcat |hd
00000000  66 6f 6f 0a                                       |foo.|
00000004

I think the kputc('\n', &line); statement needs an eof check (eg if (!f_src->at_eof)).

Would you mind also updating the man page? (I can do this instead if you prefer as it's a pretty minor addition.) It would need documenting the heuristics handling special header symbols used too, as this surprised me when reading the code. I can see why you want to include it though.

Similarly a simple test adding to test/test.pl.

Edit: also maybe a warning about binary data. If it has a nul character then it'll be truncated, due to the way kgetline works. So -n on a binary file will silently trash the data. That's not ideal, albeit inappropriate usage. At least the man page should be explicit that -n behaviour is undefined on non-text documents. Ideally it'd work in all cases, but that's probably a challenging operation unless we replace kgetline with something different.

@mlin
Copy link
Member Author

mlin commented Jan 6, 2022

@jkbonfield Thanks, great feedback and great point about the last line. I worry that if (!f_src->at_eof) does not distinguish the two cases though. kgetline() consumes any newline and trims it afterwards, so I think at_eof will indicate regardless. Maybe at EOF I need to htell() and test if it's off-by-one from the total bytes I got from kgetline() plus number of lines. Yikes...

It looks to me like kgetline2() handles the nul character case, but otherwise it has the same issue at EOF. (Its purpose had been mysterious to me until your comment -- thank you!) And the memory usage could become erratic if the binary data happens not to use newline characters for long stretches.

Acknowledge the need for man & tests -- I'm first going to noodle on these technical issues though.

@jmarshall
Copy link
Member

jmarshall commented Jan 6, 2022

If something claiming to be a text file does not end in a newline, is it really a bug to emit one? The two representations are surely equivalent for a text file. (This is certainly the attitude of C compilers and Unix, which don't really believe in text files without a final newline character — cf POSIX 3.403 Text File and 3.206 Line. Windows users may have a different view.)

What is the mnemonic for using -n for this? Did you consider using -T?

@jkbonfield
Copy link
Contributor

That's one possible solution.

I am a bit uneasy given I could imagine us getting bug reports for files not round-tripping correctly. Given as I understand it the intention is simply to permit BGZF block boundaries to occur after newline characters, IMO it would be preferable if that's the only impact it had.

However we could just document it as only working on valid text files, ending in a newline and containing no binary data (UTF8 permitted, but not nuls).

@mlin
Copy link
Member Author

mlin commented Jan 8, 2022

  1. I've factored a kgetline3() out of kgetline2() where the former includes the line terminator, if present. kgetline2() now just calls this kgetline3(), then trims the terminator as before.
  2. Using kgetline3() instead of kgetline() should address both the end-of-file and inline-null problems.
  3. Added a comment to kgetline() mentioning the inline-null caveat.
  4. -T would be okay but I'm not a fan of unrelated case-sensitive short flags (nightmares about vg's CLI!). -n was meant to evoke \n but I admit that's cheesy. I've now tried -a seeing that GNU gzip has an -a flag, which does something different but at least vaguely related. But then maybe it should understand the corresponding long form --ascii too?

WDYT @jkbonfield @jmarshall ? (I've not forgotten about man & tests.)

@jmarshall
Copy link
Member

Upon further reflection, James is of course correct that not preserving the characters exactly would be a recipe for bug reports and trouble. So the better approach for this is to aim to flush blocks after newlines, and always produce identical contents even given non-text binary data.

At that point, this is just a compression knob (the extra flushes reduce the compression slightly) rather than a major mode of operation. So instead of having a --text command line option at all, it could take advantage of using hFILE by using hts_detect_format to select this textual approach or the pure binary approach automatically as appropriate.

Applying this text mode automatically would mean that many more .vcf.gz etc files will have this technique used than would if it needed to be specified by users and scripts on the command line — which is a good thing. (We might want to have a --binary option to always just use the pure binary approach. Sadly the good short option letters for that are also already taken…)

@jmarshall
Copy link
Member

And the memory usage could become erratic if the binary data happens not to use newline characters for long stretches.

Related to this, I do wonder what the implementation would look like if, instead of using any form of getline, it used the same 64K buffer as the pure binary approach and used strrchr() to search backwards from the end for \n characters. There's no need to process every line separately… (though setting off the header lines would be a bit harder…)

@jkbonfield
Copy link
Contributor

jkbonfield commented Jan 20, 2022

I've been experimenting with adding explicit flush calls, as the simplest option (to-do: headers). It's trivial and seems minimal of CPU cost.

diff --git a/sam.c b/sam.c
index 393a3b2..23da760 100644
--- a/sam.c
+++ b/sam.c
@@ -4348,6 +4348,8 @@ int sam_write1(htsFile *fp, const sam_hdr_t *h, const bam1_t *b)
             if (sam_format1(h, b, &fp->line) < 0) return -1;
             kputc('\n', &fp->line);
             if (fp->is_bgzf) {
+                if (bgzf_flush_try(fp->fp.bgzf, fp->line.l) < 0)
+                    return -1;
                 if ( bgzf_write(fp->fp.bgzf, fp->line.s, fp->line.l) != fp->line.l ) return -1;
             } else {
                 if ( hwrite(fp->fp.hfile, fp->line.s, fp->line.l) != fp->line.l ) return -1;
diff --git a/vcf.c b/vcf.c
index 0d59be6..3d0643e 100644
--- a/vcf.c
+++ b/vcf.c
@@ -3275,10 +3275,13 @@ int vcf_write(htsFile *fp, const bcf_hdr_t *h, bcf1_t *v)
     fp->line.l = 0;
     if (vcf_format1(h, v, &fp->line) != 0)
         return -1;
-    if ( fp->format.compression!=no_compression )
+    if ( fp->format.compression!=no_compression ) {
+        if (bgzf_flush_try(fp->fp.bgzf, fp->line.l) < 0)
+            return -1;
         ret = bgzf_write(fp->fp.bgzf, fp->line.s, fp->line.l);
-    else
+    } else {
         ret = hwrite(fp->fp.hfile, fp->line.s, fp->line.l);
+    }
 
     if (fp->idx) {
         int tid;

I have a hack of bgzf and a bgzip debug binary that uses it to report the first 5 and last 5 chars in each block. Sure enough it's very noticable:

@ [samtools.../htslib]; bgzip -d < /tmp/a0.vcf.gz 2>&1 > /dev/null  | head
Block ##fil ... 	CCAC
Block A	222 ... .	DP=
Block 32;VD ... ;AN=2
Block ;DP4= ... ,18,2
Block 8;MQ= ... =0.25
Block 1315; ... S=0;M
Block Q0F=0 ... QBZ=0
Block ;MQSB ... :193,
Block 0,255 ... 18;FS
Block =0;MQ ... 3,16;

@ [samtools.../htslib]; bgzip -d < /tmp/a1.vcf.gz 2>&1 > /dev/null  | head
Block ##fil ... ,255

Block 1	568 ... 23,0

Block 1	852 ... ,255

Block 1	985 ... 96,0

Block 1	129 ... ,215


-rw-r--r-- 1 jkb team117 2958781 Jan 20 15:54 /tmp/a0.vcf.gz
-rw-r--r-- 1 jkb team117 2958560 Jan 20 16:15 /tmp/a1.vcf.gz

-rw-r--r-- 1 jkb team117 11148963 Jan 20 15:53 /tmp/a0.sam.gz
-rw-r--r-- 1 jkb team117 11152924 Jan 20 15:51 /tmp/a1.sam.gz

We could likely do the same with fasta/fastq too. Next up though is seeing whether we know on output files whether they're text or not. In htslib's usage via an htsFile this is doable. fp->format indicates it's a vcf, although it also supports generics such as text_format. So we could detect there.

However it won't solve this particular PR which I think is a separate issue. The bgzip utility doesn't use htsFile. It's driving bgzf natively. Of course it could be wrapped up to use htsFile, but firstly I don't know if it then handles unrecognised formats (eg any random data stream), and even if so there aren't any format-agnostic hts_write functions which can do the necessary flush logic. So I think we need two changes here. I'll look at the format specific side of things.

(That said, bgzip should probably still use htsFile for input, even if it's then ignoring the higher level structure components and is driving fp->fp.bgzf at the low level again, as it's a convenient way of querying the input format.)

jkbonfield added a commit to jkbonfield/htslib that referenced this pull request Jan 24, 2022
This makes bgzipped SAM, VCF, FASTA and FASTQ start blocks on a new
record (except for the case of a single record being too large to fit
in a single block).

It is a companion PR to samtools#1369
@jkbonfield jkbonfield mentioned this pull request Jan 24, 2022
daviesrob pushed a commit to daviesrob/htslib that referenced this pull request Feb 25, 2022
This makes bgzipped SAM, VCF, FASTA and FASTQ start blocks on a new
record (except for the case of a single record being too large to fit
in a single block).

It is a companion PR to samtools#1369
daviesrob pushed a commit that referenced this pull request Feb 25, 2022
This makes bgzipped SAM, VCF, FASTA and FASTQ start blocks on a new
record (except for the case of a single record being too large to fit
in a single block).

It is a companion PR to #1369
@jkbonfield
Copy link
Contributor

jkbonfield commented Aug 4, 2022

Looking at this again, I see one minor glitch due to the option rename:

diff --git a/bgzip.c b/bgzip.c
index c7d61b5..0d51c4a 100644
--- a/bgzip.c
+++ b/bgzip.c
@@ -173,7 +173,7 @@ int main(int argc, char **argv)
         case '@': threads = atoi(optarg); break;
         case 't': test = 1; compress = 0; reindex = 0; break;
         case 'k': keep = 1; break;
-        case 'n': text = 1; break;
+        case 'a': text = 1; break;
         case 1:

With that change it appears to work, although I want to do more stress testing to look for corner cases.

Listening to @jmarshall's comments above I did try replacing the kstring code with something to use hread direct and a strrchr equivalent: strrchr doesn't have an end limit and it scans forward all the way so it's inefficient, so I use a proper reverse mechanism.

This replacement does work, and it's much less overhead:

bgzip -l1 -c ecoli60.sam

          cycles        inst            bytes                                                       
no -a     14781612438   24837545274     152298241                                                   
with -a   15327935908   26137789859     152270394                                                   
          +3.7%         +5.2%                                                                       
-a new    14882182612   24875449321     152267214                                                   
          +0.7%         +0.2%                                                                       

At -l0 the old code was a full 92% more instructions, reduced to just +2%.

However, there are other factors to consider:

  1. My replacement doesn't yet support headers. As previously noted It'd need to do line-by-line reading forward to find the first non-header line and then reverse searching for the last newline for data components, so it's basically doubling the complexity with two different implementations present.
  2. I only have \n and not \n\r detection. It gets a little messy when the \n and \r may span a block boundary, as then maybe we need to find the previous \n\r instead or treat it as not-fitting). Lots of boundary cases which kgetline already solves. (Edit: I'm being dumb - it's \r\n, so works fine as is)
  3. When we use the default compression level, the cycle difference was under 0.1% with Mike's code, so trying to optimise this is probably overkill.

That said, I suspect there are basic optimisations that can be done to the PR as-is to improve the overhead which may be worth doing, such as the !strchr("@#", *line.s).

@mlin
Copy link
Member Author

mlin commented Aug 5, 2022

Thanks @jkbonfield, I've not had a chance to revisit this in some time, but really appreciate your looking into it

mlin and others added 2 commits August 5, 2022 14:48
Compressing with -a/--text promotes alignment of BGZF blocks with the uncompressed text lines. BGZF blocks start at the beginning of an input line and end after some subsequent newline (except when the block's first line overflows the BGZF block size).

This ensures it's possible to specify byte ranges of a BGZF file that decompress into complete text records -- useful for parallel processing and "slicing" from remote servers.
We don't validate it does break at newlines as that needs low level
Deflate stream processing, but do a basic round-trip test.
@jkbonfield
Copy link
Contributor

I've squashed and rebased it off current develop (including the one char fix for -a vs -n), added a minimal test (just duplicated the round-trip but with -a) and updated the man page.

This can be seen in my copy of your PR at https://github.com/jkbonfield/htslib/tree/bgzip-text-mode-1

If you're happy with this squashing and the extra commit then I can force to your branch so this PR updates. (It'll likely miss the upcoming release, but we should be able to then finish reviewing and merge early in next release cycle.)

@mlin mlin force-pushed the bgzip-text-mode branch from 8395d89 to a01393c Compare August 6, 2022 08:44
@mlin
Copy link
Member Author

mlin commented Aug 6, 2022

@jkbonfield thanks, I reset my branch to yours

@jmarshall
Copy link
Member

Re option letters, see also the comments in #1369 (comment) re detecting text/binary and using this text mode automatically as appropriate (probably with --binary to override it explicitly). Using text mode automatically would mean more text files would get this treatment.

@jmarshall
Copy link
Member

The other advantage of the strrchr-equivalent version is that you don't have to invent yet another getline variant. Headers can be incorporated by a mode that scans forward for newlines while lines start with the header character. Then the mode switches to the strrchr mode when it sees a non-header character at the start of a line.

@jkbonfield
Copy link
Contributor

Good point on automatic detection.

As for strrchr, that itself is a bad function as it needs a null terminated string and it scans forward first before scanning backwards, as it's just a plain dumb API! However rolling your own is easy (as I did). I'm well aware of how we could do headers, but as I mentioned it's basically doubling up the work as we have a scan-forward variant followed by a scan-backward variant, and some trickery in the middle to switch. Doable and I have something part way already, but debateable if it's really worth the effort and extra code to maintain. I'll explore some more though to see just how messy it ends up looking.

@mlin mlin closed this Sep 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants