-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] window_function_test FAILED on PASCAL GPU #4980
Comments
query_parts = ['b', 'a', 'row_number() over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as row_num',
'rank() over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as rank_val',
'dense_rank() over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as dense_rank_val',
'count(c) over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as count_col',
'min(c) over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as min_col',
'max(c) over (partition by b order by a rows between UNBOUNDED PRECEDING AND CURRENT ROW) as max_col'] I suspect that this is yet another case of non-deterministic sorting. In the event of repeated values of count(c) over (partition by b order by a,c rows between UNBOUNDED PRECEDING AND CURRENT ROW) as count_col Or, we could change the generator for The former might be easiest. |
Observations:
This points to a possible problem in |
Follow-up: Looks like a problem in row/column transposition on Pascal. Consider A quick fix for 22.04 might involve disabling acceleration for row/column transposition for Pascal. We can re-enable this once the synchronization problem is solved. I will have to check how feasible it is to introspect for GPU capability, from the JNI layer. |
Some more data for consideration:
I wonder at what point we should consider postponing the correct fix till 22.06, and adding a Pascal bypass for the 22.04 hotfix. |
Works around the failures described in NVIDIA#4980. This commit bypasses GPU accelerated row-column conversion for Pascal GPUs. Other architectures should remain unaffected by this shunt. Further investigation might be required to find the actual problem in CUDF's `fixed_width_convert_to_rows()` implementation. Signed-off-by: MithunR <[email protected]>
This commit introduces JNI bindings to retrieve the major and minor CUDA compute capability versions for the current CUDA device. This feature enables introspection from `spark-rapids` to detect the GPU architecture, for model-specific behaviour. This is required from NVIDIA/spark-rapids/pull/5122, to work around the erroneous behaviour of JNI `fixed_width_convert_to_rows()` on Pascal GPUs (#10569), (which in turn produces failures like NVIDIA/spark-rapids/issues/4980). Authors: - MithunR (https://github.com/mythrocks) Approvers: - https://github.com/nvdbaranec - Jason Lowe (https://github.com/jlowe) - Nghia Truong (https://github.com/ttnghia)
This failure should now be resolved. We have a follow-up issue (rapidsai/cudf/issues/10569) for fixing the actual corruption in row/column conversions for Pascal hardware. |
Describe the bug
This failure seems only occur on PASCAL GPU.
Got diff window_function_test() data output between CPU and on PASCAL GPU.
Job: #rapids_it-PASCAL/2
Environment details (please complete the following information)
The text was updated successfully, but these errors were encountered: