Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: correct floating point ChunkHasher #5778

Merged
merged 7 commits into from
Jul 18, 2024

Conversation

devinrsmith
Copy link
Member

This adds documentation and a new hashCode method to the io.deephaven.util.compare package that is consistent with Deephaven equality. Specifically, this ensures that the floating points values -0.0 and 0.0 hash to the same value.

Testing was added around aggregation keys, join keys, aggregations, min/max formulas, and sort results. Of note is that unique and distinct aggregations currently rely on io.deephaven.chunk.WritableChunk#sort(), which treats -0.0 < 0.0. This effects the encounter order returned by these aggregations, and could be seen as inconsistent with our stated goals of treating -0.0 == 0.0. This does not affect the consistency of the sort operation. It is hard to be confident that the testing coverage around this issue is fully complete; that said, this does make us much more consistent in the tested operations wrt Deephaven equality.

This is a prerequisite for #5605

Fixes #3768

This adds documentation and a new hashCode method to the `io.deephaven.util.compare` package that is consistent with Deephaven equality. Specifically, this ensures that the floating points values -0.0 and 0.0 hash to the same value.

Testing was added around aggregation keys, join keys, aggregations, min/max formulas, and sort results. Of note is that unique and distinct aggregations currently rely on `io.deephaven.chunk.WritableChunk#sort()`, which treats -0.0 < 0.0. This effects the encounter order returned by these aggregations, and could be seen as inconsistent with our stated goals of treating -0.0 == 0.0. This does not affect the consistency of the sort operation. It is hard to be confident that the testing coverage around this issue is fully complete; that said, this does make us much more consistent in the tested operations wrt Deephaven equality.

This is a prerequisite for deephaven#5605

Fixes deephaven#3768
@devinrsmith devinrsmith added this to the 0.36.0 milestone Jul 16, 2024
@devinrsmith devinrsmith requested a review from rcaudy July 16, 2024 20:41
@devinrsmith devinrsmith self-assigned this Jul 16, 2024
//
package io.deephaven.util.compare;

public class BooleanComparisons {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? We don't ever allow a column to contain primitive booleans.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows us to not have to special case some boolean code generation cases; namely, BooleanChunkEquals and BooleanChunkHasher will now call this class (just like the other ChunkEquals / ChunkHasher impls). Maybe we can make the argument that BooleanComparisons should only have eq and hashCode, and not compare, gt, etc...?

@@ -250,7 +251,7 @@ public void andEqualPairs(IntChunk<ChunkPositions> chunkPositionsToCheckForEqual

// region eq
static private boolean eq(Object lhs, Object rhs) {
return lhs == rhs;
return ObjectComparisons.eq(lhs, rhs);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed. Surprised we didn't have any failing test.

}

@Test
public void testDhc3768_double_join_key() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rangeJoin and multiJoin?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added multijoin. Spent a little time on rangeJoin, didn't quite grok the correct incantations.

}

@Test
public void testDhc3768_min_max_result() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List of aggs to test:

  • min (looking to preserve encounter order)
  • max (looking to preserve encounter order)
  • percentile
  • median
  • count distinct
  • distinct
  • unique
  • sorted first
  • sorted last

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've got these bases covered now.

@devinrsmith devinrsmith requested a review from rcaudy July 18, 2024 18:36
@devinrsmith devinrsmith requested a review from rcaudy July 18, 2024 19:05

@Test
@Ignore("Unstable sort?")
public void testFloatMedianAggResult() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably less well-defined than we'd hope. I bet that the SSM bucket gets created for -0.0 or 0.0, whichever is encountered first, and we'll keep using that "forever".

@devinrsmith devinrsmith requested a review from rcaudy July 18, 2024 20:09
@devinrsmith devinrsmith enabled auto-merge (squash) July 18, 2024 20:20
@devinrsmith devinrsmith merged commit 2ea2363 into deephaven:main Jul 18, 2024
15 checks passed
@devinrsmith devinrsmith deleted the floating-point-hasher branch July 18, 2024 20:45
@github-actions github-actions bot locked and limited conversation to collaborators Jul 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Floating point comparisons are inconsistent with hashCode
2 participants