Reading self-terminating streams #28

joshuawarner32 · 2015-10-30T04:07:04Z

First off, awesome library, thanks!

My use-case: reading git pack files, which contain multiple concatenated, yet undelimited, zlib/deflate streams of data. In other words, the pack files don't contain any information about the length of the streams - so to detect where the start of the next object in the pack is, I have to know exactly how many bytes the underlying deflate implementation consumed, so I can know where in the underlying file to start reading the next segment of data from.

This poses two separate problems:

Constructing a ZlibDecoder takes ownership of the underlying file stream, which means I have to re-open the pack file to read the next object from it. This is an annoyance, but it works fine for my scenario.
I need to get the number of bytes the ZlibDecoder consumed (which is likely less that the number of bytes it read off the underlying stream, due to buffering).

In an ideal world, I think ZlibDecoder would only take a &mut reference to the underlying Read, and by some magic, when it's destroyed, it leaves that underlying stream positioned at the exact end of zlib data. This will likely involve requiring the underling stream to also implement Seek, which in turn either requires code duplication in the API (so far as I am aware), or imposes unwanted restrictions on everyone else. I don't think that's realistic, so let's move on to option 2:

Make ZlibDecoder take a &mut reference to the underlying Read. It does nothing special when it's destroyed, but it exposes an extra zlibDecoder.consumed_bytes() method (or field, or whatever), calculated from the total bytes it's read, less what's remaining in miniz's input buffer.

The last option for me is to scrap the higher-level API and directly use your miniz-sys bindings, which is ugly for me, but less ugly for everyone else.

Thoughts?

The text was updated successfully, but these errors were encountered:

alexcrichton · 2015-10-30T05:06:24Z

Thanks for the report! This came up in the past with #14 (I think with even the same use case!), which eventually prompted the creation of the Decompress struct for dealing with raw in-memory decompression (e.g. no extra buffering). That being said I can see where it's much nicer to use the stream API, so I'd be totally down for beefing it up!

First, although all streams and such have R: Read as a type parameter, &mut R also satisfies this which means you don't actually have to pass ownership of the file into the deflate streams. Instead you can pass a mutable reference and then once the deflate stream is destroyed you can continue to use the underlying stream (e.g. seek it to the right position and whatnot).

I agree that dealing with Seek directly would be a little unfortunate, so I think a good way to move forward here would be to expose a method like consumed_bytes you mentioned. Does that sound ok for your use case?

joshuawarner32 · 2015-10-31T00:44:25Z

I think with even the same use case.

Yep, indeed. I'm glad other people are thinking along the same lines; it means I'm not totally crazy :)

First, although all streams and such have R: Read as a type parameter, &mut R also satisfies this which means you don't actually have to pass ownership of the file into the deflate streams.

Ah, perfect! Today I learned...

Does that sound ok for your use case?

Yep, that'll do nicely. I'll take a stab at putting that together.

alexcrichton closed this as completed in 3147e02 Oct 31, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading self-terminating streams #28

Reading self-terminating streams #28

joshuawarner32 commented Oct 30, 2015

alexcrichton commented Oct 30, 2015

joshuawarner32 commented Oct 31, 2015

Reading self-terminating streams #28

Reading self-terminating streams #28

Comments

joshuawarner32 commented Oct 30, 2015

alexcrichton commented Oct 30, 2015

joshuawarner32 commented Oct 31, 2015