Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Minor]: Add data based sort expression test #12992

Merged
merged 9 commits into from
Oct 21, 2024

Conversation

akurmustafa
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

As in the discussion, sometimes theoretically deriving the sort expressions that can be deduced for given sort expressions is not trivial.

In this PR, I added a new util function to create a test data that satisfy given ordering expressions. After constructing the data, we can test our hypothesis on the constructed data.

I added the example in the comment as a test case.

What changes are included in this PR?

Are these changes tested?

Yes,

Are there any user-facing changes?

@github-actions github-actions bot added the physical-expr Physical Expressions label Oct 18, 2024
Copy link
Contributor

@berkaysynnada berkaysynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @akurmustafa for providing these utils and how they are used to visualize complex ordering cases. I have just one question about the test

rng: &mut StdRng,
) -> ArrayRef {
let values: Vec<f64> = (0..n_elems)
.map(|_| rng.gen_range(0..n_distinct) as f64 / 2.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why / 2.0 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rng.gen_range(0..n_distinct) generates number as integer. This is done to convert them to the float (as f64 would also work but when we divide by 2.0 floating point is more visible).

(col_d, option_asc),
];
let ordering = convert_to_orderings(&[ordering])[0].clone();
assert!(!is_table_same_after_sort(ordering, batch.clone())?);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, if there are two identical tables, and one with sorted on [a ASC, b ASC, c ASC, d ASC] and [a ASC, c ASC, b ASC, d ASC], and the other one sorted on [a ASC, c ASC, d ASC], the resulting tables can be again identical (moreover, it could be also possible for [a ASC, b ASC, d ASC] at the same time). If they come up so, the test will give error. Can we avoid it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This depends on the test batch generation parameters. For sufficiently large table sizes, and with enough cardinality it is really hard to hit this case. Hence, I can say that statistically this is very low possibility. However, we can totally encounter this for other use cases. Hence, if the expected result is counter intuitive, we should test the hypothesis with multiple different runs with various parameters.

@@ -1065,4 +1066,63 @@ mod tests {

Ok(())
}

#[test]
fn test_ordering_satisfy_on_data() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to me like a fuzz test -- perhaps we could move it to https://github.com/apache/datafusion/tree/main/datafusion/core/tests/fuzz_cases/equivalence

ordering.rs perhaps

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me. I moved this test to under .../fuzz_cases/equivalence/ordering.rs

@github-actions github-actions bot added the core Core DataFusion crate label Oct 18, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you for the follow up @akurmustafa

@@ -158,3 +159,62 @@ fn test_ordering_satisfy_with_equivalence_complex_random() -> Result<()> {

Ok(())
}

#[test]
fn test_ordering_satisfy_on_data() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would help to add some comments about the rationale for this test -- specifically I think it is showing that data sorted on [a,b,c,d] or [a,c,b,d] is not also sorted on [a ASC, b ASC, d ASC]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a comment to explain rationale. Also, I put a link to original discussion for background. Thanks @alamb

@berkaysynnada berkaysynnada merged commit 69a4648 into apache:main Oct 21, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants