Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using bgzf_read_small() breaks subsequent bgzf_useek() calls #1798

Closed
jmarshall opened this issue Jun 28, 2024 · 4 comments · Fixed by #1800
Closed

Using bgzf_read_small() breaks subsequent bgzf_useek() calls #1798

jmarshall opened this issue Jun 28, 2024 · 4 comments · Fixed by #1800

Comments

@jmarshall
Copy link
Member

jmarshall commented Jun 28, 2024

PR #1772 added a simplified inline version of bgzf_read() that uses inline code when the request can be satisfied directly from the buffer, otherwise punts to the real bgzf_read(). Very laudable. However I encountered seek problems when using it.

I rederived the inline function by starting with the code of the full bgzf_read(), assuming the invariant that length < fp‑>block_length - fp‑>block_offset, and simplifying the code accordingly. I ended up with a function that is fairly similar to bgzf_read_small() as added to htslib/bgzf.h but with some additions:

static inline ssize_t bgzf_read_small(BGZF *fp, void *data, size_t length) {
     // A block length of 0 implies current block isn't loaded (see
     // bgzf_seek_common).  That gives negative available so careful on types
     if ((ssize_t)length < fp->block_length - fp->block_offset) {
         // Short cut the common and easy mode
         memcpy((uint8_t *)data,
                (uint8_t *)fp->uncompressed_block + fp->block_offset,
                length);
         fp->block_offset += length;
+
+        // For raw gzip streams this avoids short reads.
+        if (fp->block_offset == fp->block_length) {
+            fp->block_address = bgzf_htell(fp);
+            fp->block_offset = fp->block_length = 0;
+        }
+
+        fp->uncompressed_address += length;
+
         return length;
     } else {
         return bgzf_read(fp, data, length);
     }
 }

Because bgzf_read_small() does not currently update fp‑>uncompressed_address, subsequent bgzf_useek() calls may jump to the wrong location. And probably other functions use fp‑>uncompressed_address too and are affected.

The if block looks harder to deal with, because it calls bgzf_htell() which is private within bgzf.c. However it turns out that the invariant implies that this if will never be true, so in fact I should have simplified it away to nothing too. Phew.

So bgzf_read_small() just needs fp->uncompressed_address += length; added to it to make it equivalent to bgzf_read().

(I have not analysed bgzf_write_small() to see if it has any similar infelicities. I checked and believe there are no similar problems in bgzf_write_small().)

@daviesrob
Copy link
Member

I'd checked that the fp->block_address update wasn't needed while going over the PR, but evidently missed fp->uncompressed_address. What did you do to spot this? As the test harness currently passes with the incorrect change, it could do with extending to cover this problem.

@jkbonfield
Copy link
Contributor

Oops. I was sure I'd got both of those incremented, but I confess I had many alternative versions while benchmarking and I guess I lost something along the way. Sorry.

I'll try and get a test that fails before applying the fix. Thanks for reporting it.

@jmarshall
Copy link
Member Author

jmarshall commented Jul 1, 2024

I spotted this while changing fai_retrieve() from bgzf_useek+bgzf_getcxLOTS to bgzf_useek + bgzf_read_smallxSOME — see PR #1799. As that PR mixes bgzf_useek and bgzf_read_small, with it applied there are already faidx test cases that fail indirectly due to this problem. But it would be good to test it directly too, rather than indirectly. The sequence of events is basically:

bgzf_read_small(…, buf1, 5);  // read five characters
bgzf_useek(…, 10); // seek to offset 10
bgzf_read_small(…, buf2, 3); // read another few characters; this one could be bgzf_read()
assert(buf2 is the three characters at offset 1012);

This direct test case will fail because it will have read three characters from a different offset.

jkbonfield added a commit to jkbonfield/htslib that referenced this issue Jul 1, 2024
This bug crept in with samtools#1772 which was added since last release, so
there is no regression.

Fixes samtools#1798 with thanks to John Marshall
@jkbonfield
Copy link
Contributor

I was already working on something similar in test_bgzf.c which I've now added to verify I could trigger the bug, and then also applied the trivial one line fix. Thanks.

jkbonfield added a commit to jkbonfield/htslib that referenced this issue Jul 1, 2024
This bug crept in with samtools#1772 which was added since last release, so
there is no regression.

Fixes samtools#1798 with thanks to John Marshall
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants