Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading Multiple Blocks #41

Open
iamlemec opened this issue Mar 14, 2019 · 10 comments
Open

Reading Multiple Blocks #41

iamlemec opened this issue Mar 14, 2019 · 10 comments

Comments

@iamlemec
Copy link

I'm encountering a slight issue when using the library: I'm only getting the first block of bz2 files. In my case, the block size is 900k, and when reading off BzDecoder, I get a total of 900k then EOF (size zero reads thereafter).

This happens on both the crates.io release and master here on github. Am I confused here? Would appreciate any suggestions. Thanks!

@alexcrichton
Copy link
Collaborator

This seems like it's likely a bug in this library, but unfortunately I wouldn't know where to start. Do you have sample files that I can help poke around at?

@iamlemec
Copy link
Author

Ok, new clue. When preparing the example file, I realized that I had compressed my files with pbzip2 (parallel bzip2). If I compress things with regular bzip2, everything works fine and bzip2-rs yields the entire file. I wonder if it's that pbzip2 is using a different blocking scheme or some unusual options? The pbzip2 files decompress as expected with regular bunzip2.

This is probably kind of an edge case, so I would understand if it doesn't have high priority, but here is an sample file compressed with pbzip2: https://send.firefox.com/download/51320afe37/#ECAJud4iGH-7UMjJ7M8Pjw

@alexcrichton
Copy link
Collaborator

Ah an interesting observation! I wonder if this has to do with a format that the bunzip2 tool specifically allows?

One thing we ran into with flate2-rs is that you can literally concatenate two compressed gzip files to create a new one, and then when decoding you'll decode both back-to-back. I wonder if that's what's happening here? Is pbzip2 creating concatenated compressed streams?

@iamlemec
Copy link
Author

Yup, looks like it! Just grepping through the file, I'm seeing multiple BZ headers (which, appropriately for today, is BZh9 + 0x314159265359). If I change the logic in BzDecoder::read to only stop on EOF and keep going after StreamEnd, it works like a charm.

@iamlemec
Copy link
Author

Spoke too soon there. You need to call bzlib restart too. Unfortunately, this results in a large memory leak for me, and I know almost nothing about rust memory management, especially with ffi. Here's what I have so far: iamlemec@c0c501d

@alexcrichton
Copy link
Collaborator

@iamlemec oh I think you can solve that by replacing self.data.restart(false) with self.data = Decompress::new(false)

@iamlemec
Copy link
Author

Works great now. Thanks!

@alexcrichton
Copy link
Collaborator

@iamlemec were you thinking of sending a PR akin to MultiGzDecoder for this crate perhaps?

@iamlemec
Copy link
Author

Definitely. I'll check out MultiGzDecoder and submit something analogous relatively soon.

@andreycizov
Copy link
Contributor

Thanks @iamlemec for the initial work!

I have to admit though that I wasn't happy with the results, as the decompression speed at one thread sets me at 20MB/s so it's still faster to transcode to gzip with pbzip2 first and then process the results with rust.

#44

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants