Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] window_function_test FAILED on PASCAL GPU #4980

Closed
NvTimLiu opened this issue Mar 18, 2022 · 5 comments
Closed

[BUG] window_function_test FAILED on PASCAL GPU #4980

NvTimLiu opened this issue Mar 18, 2022 · 5 comments
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@NvTimLiu
Copy link
Collaborator

NvTimLiu commented Mar 18, 2022

Describe the bug

This failure seems only occur on PASCAL GPU.

Got diff window_function_test() data output between CPU and on PASCAL GPU.

Job: #rapids_it-PASCAL/2

=========================== short test summary info ============================

FAILED ../../src/main/python/window_function_test.py::test_window_running[Byte-Long-1000][IGNORE_ORDER({'local': True})]

FAILED ../../src/main/python/window_function_test.py::test_window_running[Byte-Long-1g][IGNORE_ORDER({'local': True})]

FAILED ../../src/main/python/window_function_test.py::test_window_running[Short-Long-1000][IGNORE_ORDER({'local': True})]

FAILED ../../src/main/python/window_function_test.py::test_window_running[Short-Long-1g][IGNORE_ORDER({'local': True})]

FAILED ../../src/main/python/window_function_test.py::test_window_running[Boolean-Long-1g][IGNORE_ORDER({'local': True})]


_____________________ test_window_running[Boolean-Long-1g] _____________________

b_gen = Boolean, c_gen = Long, batch_size = '1g'

    @ignore_order(local=True)
    @pytest.mark.parametrize('batch_size', ['1000', '1g'], ids=idfn) # set the batch size so we can test multiple stream batches
    @pytest.mark.parametrize('b_gen, c_gen', [(long_gen, x) for x in running_part_and_order_gens] +
            [(x, long_gen) for x in all_basic_gens_no_nans + [decimal_gen_32bit]], ids=idfn)
    def test_window_running(b_gen, c_gen, batch_size):
        conf = {'spark.rapids.sql.batchSizeBytes': batch_size,
                'spark.rapids.sql.hasNans': False,
                'spark.rapids.sql.variableFloatAgg.enabled': True,
                'spark.rapids.sql.castFloatToDecimal.enabled': True}
        query_parts = ['b', 'a', 'row_number() over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as row_num',
                'rank() over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as rank_val',
                'dense_rank() over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as dense_rank_val',
                'count(c) over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as count_col',
                'min(c) over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as min_col',
                'max(c) over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as max_col']

        # Decimal precision can grow too large. Float and Double can get odd results for Inf/-Inf because of ordering
        if isinstance(c_gen.data_type, NumericType) and (not isinstance(c_gen, FloatGen)) and (not isinstance(c_gen, DoubleGen)) and (not isinstance(c_gen, DecimalGen)):
            query_parts.append('sum(c) over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as sum_col')

>       assert_gpu_and_cpu_are_equal_sql(
            lambda spark : three_col_df(spark, LongRangeGen(), RepeatSeqGen(b_gen, length=100), c_gen, length=1024 * 14),
            "window_agg_table",
            'select ' +
            ', '.join(query_parts) +
            ' from window_agg_table ',
            validate_execs_in_gpu_plan = ['GpuRunningWindowExec'],
            conf = conf)

^[[1m^[[31m../../src/main/python/window_function_test.py^[[0m:501:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
^[[1m^[[31m../../src/main/python/asserts.py^[[0m:550: in assert_gpu_and_cpu_are_equal_sql
    assert_gpu_and_cpu_are_equal_collect(do_it_all, conf, is_cpu_first=is_cpu_first)
^[[1m^[[31m../../src/main/python/asserts.py^[[0m:508: in assert_gpu_and_cpu_are_equal_collect
    _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first)
^[[1m^[[31m../../src/main/python/asserts.py^[[0m:439: in _assert_gpu_and_cpu_are_equal
    assert_equal(from_cpu, from_gpu)
^[[1m^[[31m../../src/main/python/asserts.py^[[0m:106: in assert_equal
    _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])
^[[1m^[[31m../../src/main/python/asserts.py^[[0m:42: in _assert_equal
    _assert_equal(cpu[index], gpu[index], float_check, path + [index])
^[[1m^[[31m../../src/main/python/asserts.py^[[0m:35: in _assert_equal
    _assert_equal(cpu[field], gpu[field], float_check, path + [field])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cpu = 5365, gpu = 5363
float_check = <function get_float_check.<locals>.<lambda> at 0x7fd990f5f280>
path = [2720, 'a']

    def _assert_equal(cpu, gpu, float_check, path):
        t = type(cpu)
        if (t is Row):
            assert len(cpu) == len(gpu), "CPU and GPU row have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
            if hasattr(cpu, "__fields__") and hasattr(gpu, "__fields__"):
                assert cpu.__fields__ == gpu.__fields__, "CPU and GPU row have different fields at {} CPU: {} GPU: {}".format(path, cpu.__fields__, gpu.__fields__)
                for field in cpu.__fields__:
                    _assert_equal(cpu[field], gpu[field], float_check, path + [field])
            else:
                for index in range(len(cpu)):
                    _assert_equal(cpu[index], gpu[index], float_check, path + [index])
        elif (t is list):
            assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
            for index in range(len(cpu)):
                _assert_equal(cpu[index], gpu[index], float_check, path + [index])
        elif (t is tuple):
            assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
            for index in range(len(cpu)):
                _assert_equal(cpu[index], gpu[index], float_check, path + [index])
        elif (t is pytypes.GeneratorType):
            index = 0
            # generator has no zip :( so we have to do this the hard way
            done = False
            while not done:
                sub_cpu = None
                sub_gpu = None
                try:
                    sub_cpu = next(cpu)
                except StopIteration:
                    done = True

                try:
                    sub_gpu = next(gpu)
                except StopIteration:
                    done = True

                if done:
                    assert sub_cpu == sub_gpu and sub_cpu == None, "CPU and GPU generators have different lengths at {}".format(path)
                else:
                    _assert_equal(sub_cpu, sub_gpu, float_check, path + [index])

                index = index + 1
        elif (t is dict):
            # The order of key/values is not guaranteed in python dicts, nor are they guaranteed by Spark
            # so sort the items to do our best with ignoring the order of dicts
            cpu_items = list(cpu.items()).sort(key=_RowCmp)
            gpu_items = list(gpu.items()).sort(key=_RowCmp)
            _assert_equal(cpu_items, gpu_items, float_check, path + ["map"])
        elif (t is int):
>           assert cpu == gpu, "GPU and CPU int values are different at {}".format(path)
^[[1m^[[31mE           AssertionError: GPU and CPU int values are different at [2720, 'a']^[[0m

^[[1m^[[31m../../src/main/python/asserts.py^[[0m:77: AssertionError
----------------------------- Captured stdout call -----------------------------
### CPU RUN ###
### GPU RUN ###
### COLLECT: GPU TOOK 0.4507310390472412 CPU TOOK 0.4415421485900879 ###

Environment details (please complete the following information)

  • Environment location: [Standalone on PASCAL GPU)]
@NvTimLiu NvTimLiu added bug Something isn't working ? - Needs Triage Need team to review and classify labels Mar 18, 2022
@mythrocks
Copy link
Collaborator

        query_parts = ['b', 'a', 'row_number() over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as row_num',
                'rank() over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as rank_val',
                'dense_rank() over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as dense_rank_val',
                'count(c) over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as count_col',
                'min(c) over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as min_col',
                'max(c) over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as max_col']

I suspect that this is yet another case of non-deterministic sorting. In the event of repeated values of (b, a), the windows might have differing sort orders on Pascal.
After we have a repro on Pascal, we could try one of two things:

count(c) over (partition by b order by a,c rows between UNBOUNDED PRECEDING AND CURRENT ROW) as count_col

Or, we could change the generator for a not to have repeats at all.

The former might be easiest.

@NVnavkumar NVnavkumar self-assigned this Mar 21, 2022
@sameerz sameerz added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Mar 22, 2022
@sameerz sameerz added this to the Mar 21 - Apr 1 milestone Mar 22, 2022
@mythrocks
Copy link
Collaborator

Observations:

  1. Failures are for types narrower than 32bits: BOOL8, BYTE, SHORT (Thanks, @NVnavkumar.)
  2. Chunks of 32 rows have bad values. :]
  3. It is often the input data that is reported incorrectly, after all the aggregations have produced the right results. It points to a bad post-process step.
  4. Repro is very difficult. Consistency is elusive.

This points to a possible problem in convert_to_rows() or convert_to_rows_fixed_width().

@mythrocks
Copy link
Collaborator

mythrocks commented Mar 30, 2022

Follow-up: Looks like a problem in row/column transposition on Pascal.

Consider GpuColumnarToRowExec. If this is changed to use ColumnarToRowIterator instead of AcceleratedColumnarToRowIterator (which uses convert_to_rows_fixed_width()), the problem does not manifest. (Edit: Thanks, @revans2!)

A quick fix for 22.04 might involve disabling acceleration for row/column transposition for Pascal. We can re-enable this once the synchronization problem is solved.

I will have to check how feasible it is to introspect for GPU capability, from the JNI layer.

@sameerz sameerz assigned mythrocks and unassigned NVnavkumar Mar 31, 2022
@mythrocks
Copy link
Collaborator

Some more data for consideration:

  1. The repro case passes when falling back from AcceleratedColumnarToRowIterator to ColumnarToRowIterator. i.e. Spark's UnsafeRow conversion.
  2. The repro case also passes when we use convert_to_rows() instead of fixed_width_convert_to_rows(). The problem appears to be in fixed_width_convert_to_rows().
  3. The repro case continues to fail, even after adding __syncthreads() calls on all divergence points in fixed_width_convert_to_rows().

I wonder at what point we should consider postponing the correct fix till 22.06, and adding a Pascal bypass for the 22.04 hotfix.

mythrocks added a commit to mythrocks/spark-rapids that referenced this issue Apr 1, 2022
Works around the failures described in NVIDIA#4980.
This commit bypasses GPU accelerated row-column conversion for Pascal GPUs. Other
architectures should remain unaffected by this shunt.
Further investigation might be required to find the actual problem in
CUDF's `fixed_width_convert_to_rows()` implementation.

Signed-off-by: MithunR <[email protected]>
ajschmidt8 pushed a commit to rapidsai/cudf that referenced this issue Apr 1, 2022
This commit introduces JNI bindings to retrieve the major and minor CUDA compute capability versions for the current CUDA device.

This feature enables introspection from `spark-rapids` to detect the GPU architecture, for model-specific behaviour.
This is required from NVIDIA/spark-rapids/pull/5122, to work around the erroneous behaviour of JNI `fixed_width_convert_to_rows()` on Pascal GPUs (#10569), (which in turn produces failures like NVIDIA/spark-rapids/issues/4980).

Authors:
   - MithunR (https://github.com/mythrocks)

Approvers:
   - https://github.com/nvdbaranec
   - Jason Lowe (https://github.com/jlowe)
   - Nghia Truong (https://github.com/ttnghia)
@sameerz sameerz modified the milestones: Mar 21 - Apr 1, Apr 4 - Apr 15 Apr 4, 2022
@mythrocks
Copy link
Collaborator

This failure should now be resolved. We have a follow-up issue (rapidsai/cudf/issues/10569) for fixing the actual corruption in row/column conversions for Pascal hardware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

No branches or pull requests

4 participants