Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending join fuzz tests to support join filtering #10728

Merged
merged 14 commits into from
Jun 12, 2024

Conversation

edmondop
Copy link
Contributor

Which issue does this PR close?

Closes
#10659

@comphead
Copy link
Contributor

Thanks @edmondop I'm planning to read it through soon

@edmondop edmondop marked this pull request as ready for review June 4, 2024 02:40
@comphead
Copy link
Contributor

comphead commented Jun 6, 2024

Hi @edmondop are you planning to proceed on this PR?

@edmondop
Copy link
Contributor Author

edmondop commented Jun 6, 2024

Yes I was stuck because one test doesn't pass, will need to investigate unless you have some hints to share

@comphead
Copy link
Contributor

comphead commented Jun 6, 2024

Yes I was stuck because one test doesn't pass, will need to investigate unless you have some hints to share

If you referring to CI test failed, it is

  left: (0, "+----+----+-------------+------------+")
 right: (0, "+----+----+-------------+-------------+")

which should probably be straightforward to fix, we can check row counts first?

@edmondop
Copy link
Contributor Author

edmondop commented Jun 6, 2024

Yes I was stuck because one test doesn't pass, will need to investigate unless you have some hints to share

If you referring to CI test failed, it is

  left: (0, "+----+----+-------------+------------+")
 right: (0, "+----+----+-------------+-------------+")

which should probably be straightforward to fix, we can check row counts first?

Thanks for the tip, checking for the row count effectively show that the right_join filtered return a different number of rows when using SMJ and HashJoin. What could be the root cause?

I have tried to modify the column used for the filter un-commenting the section commented below, but this breaks all the other test cases too

fn less_than_10_join_filter(schema1: Arc<Schema>, _schema2: Arc<Schema>) -> JoinFilter {
    let less_than_100 = Arc::new(BinaryExpr::new(
        Arc::new(Column::new("a", 0)),
        Operator::Lt,
        Arc::new(Literal::new(ScalarValue::from(100))),
    )) as _;
    let column_indices = vec![
        ColumnIndex {
            index: 0,
            side: JoinSide::Left,
        },
        // ColumnIndex {
        //     index: 0,
        //     side: JoinSide::Right,
        // },
    ];
    let intermediate_schema =
        Schema::new(vec![schema1.field_with_name("a").unwrap().to_owned()]);

    JoinFilter::new(less_than_100, column_indices, intermediate_schema)
}

@comphead
Copy link
Contributor

Thanks @edmondop, would you be able to fetch the exact example where HJ and SMJ mismatches?

@edmondop
Copy link
Contributor Author

Details

These are the two tests that fails

failures:
    fuzz_cases::join_fuzz::test_anti_join_1k_filtered
    fuzz_cases::join_fuzz::test_right_join_1k_filtered
---- fuzz_cases::join_fuzz::test_anti_join_1k_filtered stdout ----
thread 'fuzz_cases::join_fuzz::test_anti_join_1k_filtered' panicked at datafusion/core/tests/fuzz_cases/join_fuzz.rs:404:13:
assertion `left == right` failed: SortMergeJoinExec and HashJoinExec produced different row counts
  left: 5
 right: 902

right join passes on my machine, I think I need to rebase, but also it seems easier to solve

---- fuzz_cases::join_fuzz::test_right_join_1k_filtered stdout ----
thread 'fuzz_cases::join_fuzz::test_right_join_1k_filtered' panicked at datafusion/core/tests/fuzz_cases/join_fuzz.rs:377:70:
called `Result::unwrap()` on an `Err` value: ArrowError(InvalidArgumentError("Column 'a' is declared as non-nullable but contains null values"), None)
stack backtrace:

@comphead
Copy link
Contributor

@edmondop I think it is a solid start already, lets comment this tests for now and address them separately

@comphead
Copy link
Contributor

I took the liberty to ignore tests

@comphead
Copy link
Contributor

Right filtered join fuzz tests failures probably related to #10882

@edmondop
Copy link
Contributor Author

Fixed formatting @comphead all the checks are passing now

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm well done @edmondop
We plan to address anti and right filtered joins separately and then uncomment the test

@comphead comphead merged commit 97ea05c into apache:main Jun 12, 2024
23 checks passed
@ozankabak
Copy link
Contributor

FYI, CI is failing on main after merge.

@edmondop
Copy link
Contributor Author

FYI, CI is failing on main after merge.

This should fix it #10887. I also filed #10886 cc @comphead

jayzhan211 pushed a commit to jayzhan211/datafusion that referenced this pull request Jun 12, 2024
* Extending join fuzz tests to support join filtering


---------

Co-authored-by: Oleks V <[email protected]>
@comphead
Copy link
Contributor

Thanks @ozankabak and @edmondop
Not sure why it failed, the build was green. I'm checking tests

findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* Extending join fuzz tests to support join filtering


---------

Co-authored-by: Oleks V <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants