Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Incorrect output from averages with filters in partial only mode #155

Closed
kuhushukla opened this issue Jun 11, 2020 · 3 comments · Fixed by #612
Closed

[BUG] Incorrect output from averages with filters in partial only mode #155

kuhushukla opened this issue Jun 11, 2020 · 3 comments · Fixed by #612
Assignees
Labels
bug Something isn't working P0 Must have for release SQL part of the SQL/Dataframe plugin

Comments

@kuhushukla
Copy link
Collaborator

Describe the bug
When using filters on averages with conf set to run only the partial mode on the gpu, the result can be different from the CPU.

Steps/Code to reproduce bug
Adding the following test to hash_aggregate_test.py shows the failure behaviour.

@ignore_order
@allow_non_gpu(
    'HashAggregateExec', 'AggregateExpression',
    'AttributeReference', 'Alias', 'Sum', 'Count', 'Max', 'Min', 'Average', 'Cast',
    'KnownFloatingPointNormalized', 'NormalizeNaNAndZero', 'GreaterThan', 'Literal', 'If',
    'EqualTo', 'First', 'SortAggregateExec', 'Coalesce')
@pytest.mark.parametrize('data_gen', get_params(_init_list_no_nans,
    marked_params=params_markers_for_avg_sum), ids=idfn)
def test_hash_multiple_filters_fail(data_gen):
    df = with_cpu_session(
        lambda spark : gen_df(spark, data_gen, length=100))
    df.createOrReplaceTempView("hash_agg_table")
    assert_gpu_and_cpu_are_equal_collect(
        lambda spark: spark.sql(
            'select avg(b) filter (where b > 20)' +
            'from hash_agg_table group by a'),
        conf=_no_nans_float_conf_partial)

Expected behavior
cpu result = 266969243.5
gpu result = None

@kuhushukla kuhushukla added bug Something isn't working SQL part of the SQL/Dataframe plugin ? - Needs Triage Need team to review and classify labels Jun 11, 2020
@kuhushukla kuhushukla assigned revans2 and kuhushukla and unassigned revans2 Jun 12, 2020
@kuhushukla
Copy link
Collaborator Author

This is a tough one given the current design. Pushing the filter down to the agg function like avg() doesn't work and if we want to special case it for average in GpuAggregateExpressions wrapper, we would have to assume a few things or reach into the original aggregate function to check what it is and if it is an average. Appreciate any comments on if disabling is the way forward, or special casing would be OK.

@abellina
Copy link
Collaborator

Could you add a little more info on what is hard about? And also the special case solution, what is the issue with that approach?

I am a little confused in how having a partial cpu hash agg would impact filter.

@kuhushukla
Copy link
Collaborator Author

SUre.
The issue pertains to filters but only because of nulls being passed down to the CPU. Let me elaborate and hopefully we can find a way.
Input is
column names = (a,b)
values (1, 15)
(1, null)
say we are calculating avg(b) and number of partitions is 2 so one partition gets(1,15)another gets (1, null)
Averages in partial mode output (sum,count) columns. In this case for the second columnar batch we will have (0.0,0) where both the bits have validity vector set to false which means they are nulls. The CPU however is expecting that nulls in partial mode have come out as (0.0,0) i.e. valid zero values. If you remeber this is what we had even with defaultValues for GpuAverage.
"Overall the answer computed becomes null instead of 15 in this case"
The reason why filter fails is because we use GpuCaseWhen which will emit nulls if filter doesnt match and we need to convert this null to a valid 0 before we send it down to the CPU. Since WrappedAggFunction is the only place we have access to the filter we would need to detect that the origAggFunction is an Average and then make sure we dont pass down nulls. That is what I mean by special casing.

I tried an approach where I was forcing the knowledge that we have a filter or basically something that can cause nulls down to GpuAverage via GpuDecalarativeAggregate as that seems like the right place anyway to detect filters and do the GpuCaseWhen but I had some issues with the count aggregate not working correctly as I had the filter set to none during convertToGpu and by the time I set it , it is too late. Average cannot access the filter straight up so detecting that we are doing a filter in the wrapped class on an average specifically is the only idea I have right now.

@kuhushukla kuhushukla removed their assignment Jun 24, 2020
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Jun 29, 2020
@sameerz sameerz added the P0 Must have for release label Jul 22, 2020
@sameerz sameerz added this to the Aug 17 - Aug 28 milestone Aug 19, 2020
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
Signed-off-by: spark-rapids automation <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release SQL part of the SQL/Dataframe plugin
Projects
None yet
4 participants