[BUG] window_function_test FAILED on PASCAL GPU #4980

NvTimLiu · 2022-03-18T01:45:37Z

Describe the bug

This failure seems only occur on PASCAL GPU.

Got diff window_function_test() data output between CPU and on PASCAL GPU.

Job: #rapids_it-PASCAL/2

=========================== short test summary info ============================

FAILED ../../src/main/python/window_function_test.py::test_window_running[Byte-Long-1000][IGNORE_ORDER({'local': True})]

FAILED ../../src/main/python/window_function_test.py::test_window_running[Byte-Long-1g][IGNORE_ORDER({'local': True})]

FAILED ../../src/main/python/window_function_test.py::test_window_running[Short-Long-1000][IGNORE_ORDER({'local': True})]

FAILED ../../src/main/python/window_function_test.py::test_window_running[Short-Long-1g][IGNORE_ORDER({'local': True})]

FAILED ../../src/main/python/window_function_test.py::test_window_running[Boolean-Long-1g][IGNORE_ORDER({'local': True})]


_____________________ test_window_running[Boolean-Long-1g] _____________________

b_gen = Boolean, c_gen = Long, batch_size = '1g'

    @ignore_order(local=True)
    @pytest.mark.parametrize('batch_size', ['1000', '1g'], ids=idfn) # set the batch size so we can test multiple stream batches
    @pytest.mark.parametrize('b_gen, c_gen', [(long_gen, x) for x in running_part_and_order_gens] +
            [(x, long_gen) for x in all_basic_gens_no_nans + [decimal_gen_32bit]], ids=idfn)
    def test_window_running(b_gen, c_gen, batch_size):
        conf = {'spark.rapids.sql.batchSizeBytes': batch_size,
                'spark.rapids.sql.hasNans': False,
                'spark.rapids.sql.variableFloatAgg.enabled': True,
                'spark.rapids.sql.castFloatToDecimal.enabled': True}
        query_parts = ['b', 'a', 'row_number() over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as row_num',
                'rank() over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as rank_val',
                'dense_rank() over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as dense_rank_val',
                'count(c) over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as count_col',
                'min(c) over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as min_col',
                'max(c) over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as max_col']

        # Decimal precision can grow too large. Float and Double can get odd results for Inf/-Inf because of ordering
        if isinstance(c_gen.data_type, NumericType) and (not isinstance(c_gen, FloatGen)) and (not isinstance(c_gen, DoubleGen)) and (not isinstance(c_gen, DecimalGen)):
            query_parts.append('sum(c) over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as sum_col')

>       assert_gpu_and_cpu_are_equal_sql(
            lambda spark : three_col_df(spark, LongRangeGen(), RepeatSeqGen(b_gen, length=100), c_gen, length=1024 * 14),
            "window_agg_table",
            'select ' +
            ', '.join(query_parts) +
            ' from window_agg_table ',
            validate_execs_in_gpu_plan = ['GpuRunningWindowExec'],
            conf = conf)

^[[1m^[[31m../../src/main/python/window_function_test.py^[[0m:501:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
^[[1m^[[31m../../src/main/python/asserts.py^[[0m:550: in assert_gpu_and_cpu_are_equal_sql
    assert_gpu_and_cpu_are_equal_collect(do_it_all, conf, is_cpu_first=is_cpu_first)
^[[1m^[[31m../../src/main/python/asserts.py^[[0m:508: in assert_gpu_and_cpu_are_equal_collect
    _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first)
^[[1m^[[31m../../src/main/python/asserts.py^[[0m:439: in _assert_gpu_and_cpu_are_equal
    assert_equal(from_cpu, from_gpu)
^[[1m^[[31m../../src/main/python/asserts.py^[[0m:106: in assert_equal
    _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])
^[[1m^[[31m../../src/main/python/asserts.py^[[0m:42: in _assert_equal
    _assert_equal(cpu[index], gpu[index], float_check, path + [index])
^[[1m^[[31m../../src/main/python/asserts.py^[[0m:35: in _assert_equal
    _assert_equal(cpu[field], gpu[field], float_check, path + [field])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cpu = 5365, gpu = 5363
float_check = <function get_float_check.<locals>.<lambda> at 0x7fd990f5f280>
path = [2720, 'a']

    def _assert_equal(cpu, gpu, float_check, path):
        t = type(cpu)
        if (t is Row):
            assert len(cpu) == len(gpu), "CPU and GPU row have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
            if hasattr(cpu, "__fields__") and hasattr(gpu, "__fields__"):
                assert cpu.__fields__ == gpu.__fields__, "CPU and GPU row have different fields at {} CPU: {} GPU: {}".format(path, cpu.__fields__, gpu.__fields__)
                for field in cpu.__fields__:
                    _assert_equal(cpu[field], gpu[field], float_check, path + [field])
            else:
                for index in range(len(cpu)):
                    _assert_equal(cpu[index], gpu[index], float_check, path + [index])
        elif (t is list):
            assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
            for index in range(len(cpu)):
                _assert_equal(cpu[index], gpu[index], float_check, path + [index])
        elif (t is tuple):
            assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
            for index in range(len(cpu)):
                _assert_equal(cpu[index], gpu[index], float_check, path + [index])
        elif (t is pytypes.GeneratorType):
            index = 0
            # generator has no zip :( so we have to do this the hard way
            done = False
            while not done:
                sub_cpu = None
                sub_gpu = None
                try:
                    sub_cpu = next(cpu)
                except StopIteration:
                    done = True

                try:
                    sub_gpu = next(gpu)
                except StopIteration:
                    done = True

                if done:
                    assert sub_cpu == sub_gpu and sub_cpu == None, "CPU and GPU generators have different lengths at {}".format(path)
                else:
                    _assert_equal(sub_cpu, sub_gpu, float_check, path + [index])

                index = index + 1
        elif (t is dict):
            # The order of key/values is not guaranteed in python dicts, nor are they guaranteed by Spark
            # so sort the items to do our best with ignoring the order of dicts
            cpu_items = list(cpu.items()).sort(key=_RowCmp)
            gpu_items = list(gpu.items()).sort(key=_RowCmp)
            _assert_equal(cpu_items, gpu_items, float_check, path + ["map"])
        elif (t is int):
>           assert cpu == gpu, "GPU and CPU int values are different at {}".format(path)
^[[1m^[[31mE           AssertionError: GPU and CPU int values are different at [2720, 'a']^[[0m

^[[1m^[[31m../../src/main/python/asserts.py^[[0m:77: AssertionError
----------------------------- Captured stdout call -----------------------------
### CPU RUN ###
### GPU RUN ###
### COLLECT: GPU TOOK 0.4507310390472412 CPU TOOK 0.4415421485900879 ###

Environment details (please complete the following information)

Environment location: [Standalone on PASCAL GPU)]

The text was updated successfully, but these errors were encountered:

mythrocks · 2022-03-21T17:52:10Z

        query_parts = ['b', 'a', 'row_number() over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as row_num',
                'rank() over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as rank_val',
                'dense_rank() over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as dense_rank_val',
                'count(c) over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as count_col',
                'min(c) over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as min_col',
                'max(c) over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as max_col']

I suspect that this is yet another case of non-deterministic sorting. In the event of repeated values of (b, a), the windows might have differing sort orders on Pascal.
After we have a repro on Pascal, we could try one of two things:

count(c) over (partition by b order by a,c rows between UNBOUNDED PRECEDING AND CURRENT ROW) as count_col

Or, we could change the generator for a not to have repeats at all.

The former might be easiest.

mythrocks · 2022-03-30T21:49:05Z

Observations:

Failures are for types narrower than 32bits: BOOL8, BYTE, SHORT (Thanks, @NVnavkumar.)
Chunks of 32 rows have bad values. :]
It is often the input data that is reported incorrectly, after all the aggregations have produced the right results. It points to a bad post-process step.
Repro is very difficult. Consistency is elusive.

This points to a possible problem in convert_to_rows() or convert_to_rows_fixed_width().

mythrocks · 2022-03-30T21:55:22Z

Follow-up: Looks like a problem in row/column transposition on Pascal.

Consider GpuColumnarToRowExec. If this is changed to use ColumnarToRowIterator instead of AcceleratedColumnarToRowIterator (which uses convert_to_rows_fixed_width()), the problem does not manifest. (Edit: Thanks, @revans2!)

A quick fix for 22.04 might involve disabling acceleration for row/column transposition for Pascal. We can re-enable this once the synchronization problem is solved.

I will have to check how feasible it is to introspect for GPU capability, from the JNI layer.

mythrocks · 2022-03-31T20:57:10Z

Some more data for consideration:

The repro case passes when falling back from AcceleratedColumnarToRowIterator to ColumnarToRowIterator. i.e. Spark's UnsafeRow conversion.
The repro case also passes when we use convert_to_rows() instead of fixed_width_convert_to_rows(). The problem appears to be in fixed_width_convert_to_rows().
The repro case continues to fail, even after adding __syncthreads() calls on all divergence points in fixed_width_convert_to_rows().

I wonder at what point we should consider postponing the correct fix till 22.06, and adding a Pascal bypass for the 22.04 hotfix.

Works around the failures described in NVIDIA#4980. This commit bypasses GPU accelerated row-column conversion for Pascal GPUs. Other architectures should remain unaffected by this shunt. Further investigation might be required to find the actual problem in CUDF's `fixed_width_convert_to_rows()` implementation. Signed-off-by: MithunR <[email protected]>

This commit introduces JNI bindings to retrieve the major and minor CUDA compute capability versions for the current CUDA device. This feature enables introspection from `spark-rapids` to detect the GPU architecture, for model-specific behaviour. This is required from NVIDIA/spark-rapids/pull/5122, to work around the erroneous behaviour of JNI `fixed_width_convert_to_rows()` on Pascal GPUs (#10569), (which in turn produces failures like NVIDIA/spark-rapids/issues/4980). Authors: - MithunR (https://github.com/mythrocks) Approvers: - https://github.com/nvdbaranec - Jason Lowe (https://github.com/jlowe) - Nghia Truong (https://github.com/ttnghia)

mythrocks · 2022-04-15T17:49:24Z

This failure should now be resolved. We have a follow-up issue (rapidsai/cudf/issues/10569) for fixing the actual corruption in row/column conversions for Pascal hardware.

NvTimLiu added bug Something isn't working ? - Needs Triage Need team to review and classify labels Mar 18, 2022

NVnavkumar self-assigned this Mar 21, 2022

sameerz added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Mar 22, 2022

sameerz added this to the Mar 21 - Apr 1 milestone Mar 22, 2022

sameerz assigned mythrocks and unassigned NVnavkumar Mar 31, 2022

sameerz modified the milestones: Mar 21 - Apr 1, Apr 4 - Apr 15 Apr 4, 2022

mythrocks closed this as completed Apr 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] window_function_test FAILED on PASCAL GPU #4980

[BUG] window_function_test FAILED on PASCAL GPU #4980

NvTimLiu commented Mar 18, 2022 •

edited

Loading

mythrocks commented Mar 21, 2022

mythrocks commented Mar 30, 2022

mythrocks commented Mar 30, 2022 •

edited

Loading

mythrocks commented Mar 31, 2022

mythrocks commented Apr 15, 2022

[BUG] window_function_test FAILED on PASCAL GPU #4980

[BUG] window_function_test FAILED on PASCAL GPU #4980

Comments

NvTimLiu commented Mar 18, 2022 • edited Loading

mythrocks commented Mar 21, 2022

mythrocks commented Mar 30, 2022

mythrocks commented Mar 30, 2022 • edited Loading

mythrocks commented Mar 31, 2022

mythrocks commented Apr 15, 2022

NvTimLiu commented Mar 18, 2022 •

edited

Loading

mythrocks commented Mar 30, 2022 •

edited

Loading