-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Minor]: Add data based sort expression test #12992
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @akurmustafa for providing these utils and how they are used to visualize complex ordering cases. I have just one question about the test
rng: &mut StdRng, | ||
) -> ArrayRef { | ||
let values: Vec<f64> = (0..n_elems) | ||
.map(|_| rng.gen_range(0..n_distinct) as f64 / 2.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why / 2.0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rng.gen_range(0..n_distinct)
generates number as integer. This is done to convert them to the float (as f64
would also work but when we divide by 2.0 floating point is more visible).
(col_d, option_asc), | ||
]; | ||
let ordering = convert_to_orderings(&[ordering])[0].clone(); | ||
assert!(!is_table_same_after_sort(ordering, batch.clone())?); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, if there are two identical tables, and one with sorted on [a ASC, b ASC, c ASC, d ASC]
and [a ASC, c ASC, b ASC, d ASC]
, and the other one sorted on [a ASC, c ASC, d ASC], the resulting tables can be again identical (moreover, it could be also possible for [a ASC, b ASC, d ASC] at the same time). If they come up so, the test will give error. Can we avoid it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This depends on the test batch generation parameters. For sufficiently large table sizes, and with enough cardinality it is really hard to hit this case. Hence, I can say that statistically this is very low possibility. However, we can totally encounter this for other use cases. Hence, if the expected result is counter intuitive, we should test the hypothesis with multiple different runs with various parameters.
@@ -1065,4 +1066,63 @@ mod tests { | |||
|
|||
Ok(()) | |||
} | |||
|
|||
#[test] | |||
fn test_ordering_satisfy_on_data() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks to me like a fuzz test -- perhaps we could move it to https://github.com/apache/datafusion/tree/main/datafusion/core/tests/fuzz_cases/equivalence
ordering.rs perhaps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me. I moved this test to under .../fuzz_cases/equivalence/ordering.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- thank you for the follow up @akurmustafa
@@ -158,3 +159,62 @@ fn test_ordering_satisfy_with_equivalence_complex_random() -> Result<()> { | |||
|
|||
Ok(()) | |||
} | |||
|
|||
#[test] | |||
fn test_ordering_satisfy_on_data() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would help to add some comments about the rationale for this test -- specifically I think it is showing that data sorted on [a,b,c,d]
or [a,c,b,d]
is not also sorted on [a ASC, b ASC, d ASC]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a comment to explain rationale. Also, I put a link to original discussion for background. Thanks @alamb
Which issue does this PR close?
Closes #.
Rationale for this change
As in the discussion, sometimes theoretically deriving the sort expressions that can be deduced for given sort expressions is not trivial.
In this PR, I added a new util function to create a test data that satisfy given ordering expressions. After constructing the data, we can test our hypothesis on the constructed data.
I added the example in the comment as a test case.
What changes are included in this PR?
Are these changes tested?
Yes,
Are there any user-facing changes?