Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] join_test failed in integration tests #6003

Closed
pxLi opened this issue Jul 15, 2022 · 7 comments
Closed

[BUG] join_test failed in integration tests #6003

pxLi opened this issue Jul 15, 2022 · 7 comments
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@pxLi
Copy link
Collaborator

pxLi commented Jul 15, 2022

Describe the bug
failed list,

[2022-07-15T04:36:09.326Z] FAILED ../../src/main/python/join_test.py::test_broadcast_join_right_struct_as_key[LeftAnti-Struct(['child0', String(not_null)],['child1', Byte(not_null)],['child2', Short(not_null)],['child3', Integer(not_null)],['child4', Long(not_null)],['child5', Boolean(not_null)],['child6', Date(not_null)],['child7', Timestamp(not_null)])][IGNORE_ORDER({'local': True})]
[2022-07-15T04:36:09.326Z] FAILED ../../src/main/python/join_test.py::test_broadcast_join_right_struct_mixed_key[LeftAnti-Struct(['child0', String(not_null)],['child1', Byte(not_null)],['child2', Short(not_null)],['child3', Integer(not_null)],['child4', Long(not_null)],['child5', Boolean(not_null)],['child6', Date(not_null)],['child7', Timestamp(not_null)])][IGNORE_ORDER({'local': True})]
[2022-07-15T04:36:09.326Z] FAILED ../../src/main/python/join_test.py::test_sortmerge_join_struct_as_key[LeftAnti-Struct(['child0', String(not_null)],['child1', Byte(not_null)],['child2', Short(not_null)],['child3', Integer(not_null)],['child4', Long(not_null)],['child5', Boolean(not_null)],['child6', Date(not_null)],['child7', Timestamp(not_null)])][IGNORE_ORDER({'local': True})]
[2022-07-15T04:36:09.326Z] FAILED ../../src/main/python/join_test.py::test_sortmerge_join_struct_mixed_key[LeftAnti-Struct(['child0', String(not_null)],['child1', Byte(not_null)],['child2', Short(not_null)],['child3', Integer(not_null)],['child4', Long(not_null)],['child5', Boolean(not_null)],['child6', Date(not_null)],['child7', Timestamp(not_null)])][IGNORE_ORDER({'local': True})]

detailed logs,
failed.log
CPU and GPU out put mismatched

AssertionError: CPU and GPU list have different lengths at [] CPU: 260 GPU: 500

Could be related to new merged cudf commits (current jni jar targets to rapidsai/cudf@ec761da)

To reproduce,

integration_tests/run_pyspark_from_build.sh -k test_broadcast_join_right_struct_mixed_key
@pxLi pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 15, 2022
@pxLi pxLi changed the title [BUG] join_test failed in databricks runtimes [BUG] join_test failed in integration tests Jul 15, 2022
@pxLi pxLi added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Jul 15, 2022
@res-life
Copy link
Collaborator

Seems it's related to cuDF commit: rapidsai/cudf#11100
I reverted this PR, then the cases passed.
The cases get errors when doing left-anti-join on a struct column with a non-nullable sub-column in it.

@ttnghia
Copy link
Collaborator

ttnghia commented Jul 15, 2022

I'm still debugging it. Caught the failed input data but can't reproduce in cudf test yet:

One of the failed test:

@ignore_order(local=True)
@pytest.mark.parametrize('data_gen', struct_gens, ids=idfn)
@pytest.mark.parametrize('join_type', ['LeftAnti'], ids=idfn)
def test_broadcast_join_right_struct_as_key(data_gen, join_type):
    def do_join(spark):
        left, right = create_df(spark, data_gen, 10, 20)
        # print("CPU OUTPUT: %s" % cpu)
        print("LEFT %s" % left.collect())
        print("RIGHT %s" % right.collect())
        return left.join(broadcast(right), left.a == right.r_a, join_type)
    assert_gpu_and_cpu_are_equal_collect(do_join)

Output:

LEFT 
[Row(a=Row(child0=-341142443), b=Row(child0=-100515324)),
Row(a=Row(child0=20129910), b=Row(child0=1727289611)),
Row(a=Row(child0=790204970), b=Row(child0=-282729822)),
Row(a=Row(child0=-98742266), b=Row(child0=-2087175000)),
Row(a=Row(child0=1578841898), b=Row(child0=1591161832)),
Row(a=Row(child0=2008070098), b=Row(child0=1859007115)),
Row(a=Row(child0=887174525), b=Row(child0=173016769)),
Row(a=None, b=Row(child0=-236270331)),
Row(a=Row(child0=324918642), b=Row(child0=-1345486411)),
Row(a=Row(child0=-100798596), b=Row(child0=1618216427))]

RIGHT [Row(r_a=Row(child0=-341142443), r_b=Row(child0=-100515324)),
Row(r_a=Row(child0=20129910), r_b=Row(child0=1727289611)),
Row(r_a=Row(child0=790204970), r_b=Row(child0=-282729822)),
Row(r_a=Row(child0=-98742266), r_b=Row(child0=-2087175000)),
Row(r_a=Row(child0=1578841898), r_b=Row(child0=1591161832)),
Row(r_a=Row(child0=2008070098), r_b=Row(child0=1859007115)),
Row(r_a=Row(child0=887174525), r_b=Row(child0=173016769)),
Row(r_a=None, r_b=Row(child0=-236270331)),
Row(r_a=Row(child0=324918642), r_b=Row(child0=-1345486411)),
Row(r_a=Row(child0=-100798596), r_b=Row(child0=1618216427)),
Row(r_a=Row(child0=1338435109), r_b=Row(child0=1931725428)),
Row(r_a=Row(child0=1259821658), r_b=Row(child0=558978567)),
Row(r_a=Row(child0=994238642), r_b=Row(child0=1265224248)),
Row(r_a=Row(child0=848988934), r_b=Row(child0=775624428)),
Row(r_a=Row(child0=326215455), r_b=Row(child0=2063803297)),
Row(r_a=Row(child0=-1615357958), r_b=Row(child0=-1031136222)),
Row(r_a=Row(child0=-642982504), r_b=Row(child0=2096953985)),
Row(r_a=Row(child0=994707147), r_b=Row(child0=741485586)),
Row(r_a=None, r_b=Row(child0=1483587345)),
Row(r_a=Row(child0=-1396639655), r_b=Row(child0=-613528857))]

CPU OUTPUT: [Row(a=None, b=Row(child0=-236270331))]
GPU OUTPUT: [Row(a=None, b=Row(child0=-236270331)), 
Row(a=Row(child0=-100798596), b=Row(child0=1618216427)), 
Row(a=Row(child0=324918642), b=Row(child0=-1345486411)), 
Row(a=Row(child0=887174525), b=Row(child0=173016769))]

@ttnghia
Copy link
Collaborator

ttnghia commented Jul 16, 2022

Update: I have successfully reproduced the bug in cudf. The reason is still unclear - it is not due to corner cases and is more complicated to identify. I'm continuing to trace it down.

@ttnghia
Copy link
Collaborator

ttnghia commented Jul 16, 2022

Tracked down to a bug in the cudf experimental row hasher: rapidsai/cudf#11283
Still chasing down the bug. Will keep updated here.

@ttnghia
Copy link
Collaborator

ttnghia commented Jul 16, 2022

Done. Should be fixed by this: rapidsai/cudf#11284

@pxLi
Copy link
Collaborator Author

pxLi commented Jul 18, 2022

Thanks! I just triggered the new plugin jni build (~ 3-4 hours), premerge CI should be back to normal after new jni artifact is online

@pxLi
Copy link
Collaborator Author

pxLi commented Jul 18, 2022

passed premerge CI

@pxLi pxLi closed this as completed Jul 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

No branches or pull requests

3 participants