[Minor]: Add data based sort expression test #12992

akurmustafa · 2024-10-18T03:54:42Z

Which issue does this PR close?

Closes #.

Rationale for this change

As in the discussion, sometimes theoretically deriving the sort expressions that can be deduced for given sort expressions is not trivial.

In this PR, I added a new util function to create a test data that satisfy given ordering expressions. After constructing the data, we can test our hypothesis on the constructed data.

I added the example in the comment as a test case.

What changes are included in this PR?

Are these changes tested?

Yes,

Are there any user-facing changes?

berkaysynnada

Thank you @akurmustafa for providing these utils and how they are used to visualize complex ordering cases. I have just one question about the test

berkaysynnada · 2024-10-18T10:35:18Z

datafusion/physical-expr/src/equivalence/mod.rs

+        rng: &mut StdRng,
+    ) -> ArrayRef {
+        let values: Vec<f64> = (0..n_elems)
+            .map(|_| rng.gen_range(0..n_distinct) as f64 / 2.0)


Why / 2.0 ?

rng.gen_range(0..n_distinct) generates number as integer. This is done to convert them to the float (as f64 would also work but when we divide by 2.0 floating point is more visible).

berkaysynnada · 2024-10-18T10:53:17Z

datafusion/physical-expr/src/equivalence/ordering.rs

+            (col_d, option_asc),
+        ];
+        let ordering = convert_to_orderings(&[ordering])[0].clone();
+        assert!(!is_table_same_after_sort(ordering, batch.clone())?);


IMO, if there are two identical tables, and one with sorted on [a ASC, b ASC, c ASC, d ASC] and [a ASC, c ASC, b ASC, d ASC], and the other one sorted on [a ASC, c ASC, d ASC], the resulting tables can be again identical (moreover, it could be also possible for [a ASC, b ASC, d ASC] at the same time). If they come up so, the test will give error. Can we avoid it?

This depends on the test batch generation parameters. For sufficiently large table sizes, and with enough cardinality it is really hard to hit this case. Hence, I can say that statistically this is very low possibility. However, we can totally encounter this for other use cases. Hence, if the expected result is counter intuitive, we should test the hypothesis with multiple different runs with various parameters.

alamb · 2024-10-18T19:10:10Z

datafusion/physical-expr/src/equivalence/ordering.rs

@@ -1065,4 +1066,63 @@ mod tests {

        Ok(())
    }
+
+    #[test]
+    fn test_ordering_satisfy_on_data() -> Result<()> {


This looks to me like a fuzz test -- perhaps we could move it to https://github.com/apache/datafusion/tree/main/datafusion/core/tests/fuzz_cases/equivalence

ordering.rs perhaps

This makes sense to me. I moved this test to under .../fuzz_cases/equivalence/ordering.rs

alamb

Looks good to me -- thank you for the follow up @akurmustafa

alamb · 2024-10-20T13:00:39Z

datafusion/core/tests/fuzz_cases/equivalence/ordering.rs

@@ -158,3 +159,62 @@ fn test_ordering_satisfy_with_equivalence_complex_random() -> Result<()> {

    Ok(())
 }
+
+#[test]
+fn test_ordering_satisfy_on_data() -> Result<()> {


Maybe it would help to add some comments about the rationale for this test -- specifically I think it is showing that data sorted on [a,b,c,d] or [a,c,b,d] is not also sorted on [a ASC, b ASC, d ASC]

I have added a comment to explain rationale. Also, I put a link to original discussion for background. Thanks @alamb

akurmustafa and others added 3 commits October 17, 2024 14:25

Initial commit

216ba29

Fix formatting, minor changes

e7b3481

Minor changes

6a6d7b1

github-actions bot added the physical-expr Physical Expressions label Oct 18, 2024

Merge

380a5b9

berkaysynnada reviewed Oct 18, 2024

View reviewed changes

alamb reviewed Oct 18, 2024

View reviewed changes

Move test to fuzz tests

9ad83a1

github-actions bot added the core Core DataFusion crate label Oct 18, 2024

akurmustafa and others added 2 commits October 18, 2024 16:22

Merge branch 'main' into feature/lex_sort_fuzz

a54e4b5

Merge branch 'main' into feature/lex_sort_fuzz

70fad66

alamb approved these changes Oct 20, 2024

View reviewed changes

akurmustafa added 2 commits October 20, 2024 15:04

Add comment to test

26353bb

Merge branch 'main' into feature/lex_sort_fuzz

7bda4e4

akurmustafa mentioned this pull request Oct 21, 2024

[minor]: remove same util functions from the code base. #13026

Merged

berkaysynnada merged commit 69a4648 into apache:main Oct 21, 2024
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Minor]: Add data based sort expression test #12992

[Minor]: Add data based sort expression test #12992

akurmustafa commented Oct 18, 2024

berkaysynnada left a comment

berkaysynnada Oct 18, 2024

akurmustafa Oct 18, 2024

berkaysynnada Oct 18, 2024

akurmustafa Oct 18, 2024

alamb Oct 18, 2024

akurmustafa Oct 18, 2024

alamb left a comment

alamb Oct 20, 2024

akurmustafa Oct 20, 2024

[Minor]: Add data based sort expression test #12992

[Minor]: Add data based sort expression test #12992

Conversation

akurmustafa commented Oct 18, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

berkaysynnada left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment