Explain parameters #111

martindurant · 2023-01-20T21:00:52Z

Thanks for putting this together! The kerchunk will make great use of it.

I am still trying to get my head around how it works, given that "gzip/zlib streams are unsplittable" has been matra for a long time.

In this issue, however, I'd like to ask for more documentation around the arguments to IndexedGzipFile, and the tradeoffs they entail:

I understand spacing: the more points in the file you index, the better random seeks will tend to be (needing less scrolling), but the bigger the index file will get. I expect this can be any number up to the size of the target file, at which point seeking is equivalent to not using indexed_gzip at all
window_size: something to do with how much data is stored with each point? Can it be made small to keep the index file small, and what would be the downside of this? I don't seem to be able to pick just any number without ZranError, is 2**15 the minimum, or is this file dependent?
readbuf_size: if I know that I will always be reading an exact byte range every time or I implement buffering elsewhere, can this be zero?

The text was updated successfully, but these errors were encountered:

martindurant · 2023-01-20T21:09:59Z

Ah, I found this comment:

    /*
     * Number of bytes of uncompressed data to store
     * for each index point. This must be a minimum
     * of 32768 bytes.
     */

pauldmccarthy · 2023-01-23T09:54:34Z

Hi @martindurant, the readbuf_size is the size of the memory used by the zran module to read in from the compressed stream, so it can't be set to 0. Setting it to a small value will result in more back-and-forth between reading data and passing it to the zlib inflate function. Setting it to a larger value will result in fewer calls to inflate, at the cost of increased memory usage (the readbuf is allocated/destroyed on each decompression cycle).

The IndexedGzipFile class inherits from the io.BufferedReader which implements an additional buffering layer (the size of which can be controlled via the buffer_size argument).

The window_size controls the number bytes of uncompressed data stored alongside each index point - it is used to initialise zlib decompression. I've not come across any scenario where it would make sense to set this to anything but 32KiB.

martindurant · 2023-01-23T14:13:33Z

Thanks for the explanation!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explain parameters #111

Explain parameters #111

martindurant commented Jan 20, 2023

martindurant commented Jan 20, 2023

pauldmccarthy commented Jan 23, 2023

martindurant commented Jan 23, 2023

Explain parameters #111

Explain parameters #111

Comments

martindurant commented Jan 20, 2023

martindurant commented Jan 20, 2023

pauldmccarthy commented Jan 23, 2023

martindurant commented Jan 23, 2023