Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: enable fuzz for filtered anti-semi NLJoin #12360

Merged
merged 3 commits into from
Sep 9, 2024

Conversation

korowa
Copy link
Contributor

@korowa korowa commented Sep 6, 2024

Which issue does this PR close?

Closes #11537.

Rationale for this change

It turned out to be not a NLJoin issue, but fuzz tests -- during NLjoin construction, join filter was completely ignored, and all NLjoin tests were running with only equijoin conditions in filter.

In the same time filtered inner/left/right/full join fuzz tests were passing even without the filter -- the reason is that they've been using less_than_100_join_filter, which always returns True, as generation of input data for column, used in this filter, always returns values less than 100 -- make_staggered_batches comment:

/// two sorted int32 columns 'a', 'b' ranged from 0..99 as join columns

so it seems to be more useful to place col_lt_col_filter in all tests, since it has non 100% selectivity.

Finally, after replacing filter in all test cases, some of HjSmj tests turned out to be flaky, so they are temporarily disabled, until #12359 fix.

What changes are included in this PR?

  • NLJoin construction in fuzz tests now takes into account join_filter, and constructs composite filter as a union of join_filter and equijoin conditions, all fuzz tests are enabled for NljHj case
  • all _filtered test cases now use col_lt_col_filter
  • some _filtered testcases disabled for SortMergeJoin due to their flakiness

Are these changes tested?

Yes

Are there any user-facing changes?

No, it's related exclusively to the tests

@github-actions github-actions bot added the core Core DataFusion crate label Sep 6, 2024
.await
}

fn less_than_100_join_filter(schema1: Arc<Schema>, _schema2: Arc<Schema>) -> JoinFilter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

during NLjoin construction, join filter was completely ignored, and all NLjoin tests were running with only equijoin conditions in filter.

Just to double check my understanding, your analysis is that this test is invalid because NestedLoopJoinExec doesn't appy the less_than_100_join_filter correctly?

I reread the docs for NestedLoopJoinExec and it seems to imply the point is to run joins without a equality predicate 🤔

https://docs.rs/datafusion/latest/datafusion/physical_plan/joins/struct.NestedLoopJoinExec.html

So is the difference that less_than_100_join_filter only referred to one join input (aka only refers to a) where NLJ needs filters that refer to both sides ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems to imply the point is to run joins without a equality predicate

That's true, but in order to somehow perform fuzz testing against other join implementations, NLJoin runs (intended to run) in fuzz tests with the following filter <on / equijoin conditions> and <optional join filter>

Just to double check my understanding, your analysis is that this test is invalid because NestedLoopJoinExec doesn't appy the less_than_100_join_filter correctly?

The test is invalid because of how NLJoin was built in these tests -- if optional join filter was passed as a test case argument, it was not included into NLJoin filter expression, and the test compared join on <equijoin conditions> and <filter> for HashJoin vs join on <equijoin condition> for NLJoin -- this problem is fixed by the first commit.

So is the difference that less_than_100_join_filter only referred to one join input (aka only refers to a) where NLJ needs filters that refer to both sides ?

less_than_100_join_filter is removed because it's expression (t.a < 100) doesn't make much sense, since a produced by make_staggered_batches is always less than 100 -- that's why only semi/anti_filtered joins were failing (they used different filter expression -- col_lt_col_filter, which is not always true), and all other filtered test cases (inner/left/right/full) were passing even with missing filter for NLJoin (from p.1).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think I introduced the col_lt_col_filter when testing flaky semi join.
We probably should have used this filter for other filtered use cases.

But if the tests becoming flaky, it would be dangerous to merge the PR without ignoring tests

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the responses @korowa -- this PR makes sense to me

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To find if test is flaky I followed the approach of running the test 1000 times in the loop. Like for AntiJoin

async fn test_anti_join_1k_filtered() {
    // NLJ vs HJ gives wrong result
    // Tracked in https://github.com/apache/datafusion/issues/11537
   for i in 0..1000 {
        println!("Iteration {i}");
        JoinFuzzTestCase::new(
            make_staggered_batches(1000),
            make_staggered_batches(1000),
            JoinType::LeftAnti,
            Some(Box::new(col_lt_col_filter)),
        )
        .run_test(&[JoinTestType::HjSmj], false)
        .await;
    }
}

But again if the test becomes flaky with col_lt_col_filter we should probably ignore the test to avoid random broken CI

@korowa
Copy link
Contributor Author

korowa commented Sep 9, 2024

To find if test is flaky I followed the approach of running the test 1000 times in the loop. Like for AntiJoin

I've run all them in a loop with hundred iterations, and all currently present tests are passing. The only failed tests after filter change were SMJ related -- they are disabled in f9d2a90

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @korowa

@comphead comphead merged commit 4569cbb into apache:main Sep 9, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The HashJoin and NestedLoopJoin gives different results for filtered joins fuzz tests.
3 participants