Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explain parameters #111

Open
martindurant opened this issue Jan 20, 2023 · 3 comments
Open

Explain parameters #111

martindurant opened this issue Jan 20, 2023 · 3 comments

Comments

@martindurant
Copy link

Thanks for putting this together! The kerchunk will make great use of it.

I am still trying to get my head around how it works, given that "gzip/zlib streams are unsplittable" has been matra for a long time.

In this issue, however, I'd like to ask for more documentation around the arguments to IndexedGzipFile, and the tradeoffs they entail:

  • I understand spacing: the more points in the file you index, the better random seeks will tend to be (needing less scrolling), but the bigger the index file will get. I expect this can be any number up to the size of the target file, at which point seeking is equivalent to not using indexed_gzip at all
  • window_size: something to do with how much data is stored with each point? Can it be made small to keep the index file small, and what would be the downside of this? I don't seem to be able to pick just any number without ZranError, is 2**15 the minimum, or is this file dependent?
  • readbuf_size: if I know that I will always be reading an exact byte range every time or I implement buffering elsewhere, can this be zero?
@martindurant
Copy link
Author

Ah, I found this comment:

    /*
     * Number of bytes of uncompressed data to store
     * for each index point. This must be a minimum
     * of 32768 bytes.
     */

@pauldmccarthy
Copy link
Owner

Hi @martindurant, the readbuf_size is the size of the memory used by the zran module to read in from the compressed stream, so it can't be set to 0. Setting it to a small value will result in more back-and-forth between reading data and passing it to the zlib inflate function. Setting it to a larger value will result in fewer calls to inflate, at the cost of increased memory usage (the readbuf is allocated/destroyed on each decompression cycle).

The IndexedGzipFile class inherits from the io.BufferedReader which implements an additional buffering layer (the size of which can be controlled via the buffer_size argument).

The window_size controls the number bytes of uncompressed data stored alongside each index point - it is used to initialise zlib decompression. I've not come across any scenario where it would make sense to set this to anything but 32KiB.

@martindurant
Copy link
Author

Thanks for the explanation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants