Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: BitPackedCompressor allows signed arrays #1699

Merged
merged 7 commits into from
Dec 17, 2024

Conversation

a10y
Copy link
Contributor

@a10y a10y commented Dec 16, 2024

Most of the work to support signed integers has been done in BitPackedArray.

This PR removes some assertions and branches in the compressor to make it possible to bit-pack an array of signed ints.

a10y added 2 commits December 16, 2024 18:13
Most of the work to support signed integers has been done in BitPackedArray.

This PR removes some assertions and branches in the compressor to make it possible to bit-pack
an array of signed ints.
@gatesn
Copy link
Contributor

gatesn commented Dec 16, 2024

Is it worth lumping in the change to disallow a no-op FoR array too?

Or better yet, disallow FoR when it has minimal impact on bitpacking width.

@a10y
Copy link
Contributor Author

a10y commented Dec 17, 2024

I've changed the condition in FoRCompressor::can_compress to now always reject if min=0 and shift=0. Previously if the type was signed it would not bail.

I did not change FoRCompressor to determine how many bits of savings is enough to make up for the extra unfor step. I don't have a good intuition on what that would be.

@a10y a10y marked this pull request as ready for review December 17, 2024 00:43
docs/quickstart.rst Outdated Show resolved Hide resolved
@lwwmanning
Copy link
Member

I've changed the condition in FoRCompressor::can_compress to now always reject if min=0 and shift=0. Previously if the type was signed it would not bail.

I did not change FoRCompressor to determine how many bits of savings is enough to make up for the extra unfor step. I don't have a good intuition on what that would be.

I can live with excluding if min and shift are zero (as a way to short-circuit), but it's a bit unnecessary. The sampling compressor already penalizes extra array depth. So FoR(0,0) + Bitpacking on signed ints will be strictly worse than just Bitpacking and won't be chosen anyway

@gatesn
Copy link
Contributor

gatesn commented Dec 17, 2024

I've changed the condition in FoRCompressor::can_compress to now always reject if min=0 and shift=0. Previously if the type was signed it would not bail.
I did not change FoRCompressor to determine how many bits of savings is enough to make up for the extra unfor step. I don't have a good intuition on what that would be.

I can live with excluding if min and shift are zero (as a way to short-circuit), but it's a bit unnecessary. The sampling compressor already penalizes extra array depth. So FoR(0,0) + Bitpacking on signed ints will be strictly worse than just Bitpacking and won't be chosen anyway

By this logic we shouldn't exclude any compressors based on metrics, e.g. run-end encoding should ignore the avg_run_length.

@lwwmanning
Copy link
Member

lwwmanning commented Dec 17, 2024

I've changed the condition in FoRCompressor::can_compress to now always reject if min=0 and shift=0. Previously if the type was signed it would not bail.
I did not change FoRCompressor to determine how many bits of savings is enough to make up for the extra unfor step. I don't have a good intuition on what that would be.

I can live with excluding if min and shift are zero (as a way to short-circuit), but it's a bit unnecessary. The sampling compressor already penalizes extra array depth. So FoR(0,0) + Bitpacking on signed ints will be strictly worse than just Bitpacking and won't be chosen anyway

By this logic we shouldn't exclude any compressors based on metrics, e.g. run-end encoding should ignore the avg_run_length.

Yeah, I think that's correct. More importantly, we should also filter codecs based on whether can_compress the full array, THEN do the sampling compression, then compress-like the best on sample. Right now, we check can_compress on the sample, which may fail on the full array, causing a resample... which could fail arbitrarily many times.

E.g., if we theoretically implemented BitPackedArray::can_compress for signed ints by checking if min >= 0 on an array with 1 million integers, of which only 1 was negative, we'd expect it to have to resample ~1000 times on average in order to successfully compress.

@lwwmanning lwwmanning enabled auto-merge (squash) December 17, 2024 16:41
@lwwmanning lwwmanning merged commit 8e0e25c into develop Dec 17, 2024
19 checks passed
@lwwmanning lwwmanning deleted the aduffy/bitpacked-pos-signed branch December 17, 2024 16:57
a10y added a commit that referenced this pull request Dec 18, 2024
…es (#1705)

Following up from #1699.

In the previous PR we allowed signed arrays to be bit-packed directly.
However, we did not explicitly reject arrays with negative values. We
**need** to do this because it is critical for ensuring we have fast
`search_sorted` over BitPacked data with patches, only when the patches
sort to the right-most side of the array can we do efficient binary
search.

I've added explicit preconditions that values are non-negative, and made
the BitPackedArray constructor unsafe to make it clear to callers that
they must explicitly check this themselves (the recommended safe way to
create a BPA is via the `BPA::encode()` method, which returns an Error
if there are negative values).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants