Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] JNI fixed_width_convert_to_rows() produces incorrect output on Pascal GPUs #10569

Closed
mythrocks opened this issue Apr 1, 2022 · 2 comments
Labels
bug Something isn't working Java Affects Java cuDF API. Spark Functionality that helps Spark RAPIDS wontfix This will not be worked on

Comments

@mythrocks
Copy link
Contributor

fixed_width_convert_to_rows() seems to be producing bad results on Pascal GPUs. The errors manifest downstream as described in NVIDIA/spark-rapids/issues/4980.
Salient observations:

  1. This only seems to happen when some of the transposed columns are of a narrow type than 32-bit. E.g. BOOL8, BYTE, SHORT. This might indicate a problem with how the alignment of type is calculated.
  2. When the error occurs, groups/multiples of 32 rows seem to go wrong together. This might indicate a problem synchronizing on Pascal. A hasty/naive attempt at adding __syncthreads() at points of divergence in fixed_width_convert_to_rows() did not fix the problem.
  3. convert_to_rows() does not seem to exhibit this behaviour.

A reduced C++ repro case has proved elusive. But it does seem reproducible from the NVIDIA/spark-rapids window-function tests.

@mythrocks mythrocks added bug Something isn't working Needs Triage Need team to review and classify labels Apr 1, 2022
@mythrocks mythrocks changed the title [BUG] fixed_width_convert_to_rows() produces incorrect output on Pascal GPUs [BUG] JNI fixed_width_convert_to_rows() produces incorrect output on Pascal GPUs Apr 1, 2022
@mythrocks mythrocks added the Java Affects Java cuDF API. label Apr 1, 2022
@jlowe jlowe added the Spark Functionality that helps Spark RAPIDS label Apr 1, 2022
ajschmidt8 pushed a commit that referenced this issue Apr 1, 2022
This commit introduces JNI bindings to retrieve the major and minor CUDA compute capability versions for the current CUDA device.

This feature enables introspection from `spark-rapids` to detect the GPU architecture, for model-specific behaviour.
This is required from NVIDIA/spark-rapids/pull/5122, to work around the erroneous behaviour of JNI `fixed_width_convert_to_rows()` on Pascal GPUs (#10569), (which in turn produces failures like NVIDIA/spark-rapids/issues/4980).

Authors:
   - MithunR (https://github.com/mythrocks)

Approvers:
   - https://github.com/nvdbaranec
   - Jason Lowe (https://github.com/jlowe)
   - Nghia Truong (https://github.com/ttnghia)
@github-actions
Copy link

github-actions bot commented May 1, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@GregoryKimball GregoryKimball removed the Needs Triage Need team to review and classify label Jun 28, 2022
@mythrocks
Copy link
Contributor Author

Having consulted the team, I think it's best we close this bug for now.
The CUDF JNI code has a workaround for this failure on Pascal. Let's revisit this if we see this on another arch.

@mythrocks mythrocks closed this as not planned Won't fix, can't repro, duplicate, stale Jun 29, 2022
@mythrocks mythrocks added the wontfix This will not be worked on label Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Java Affects Java cuDF API. Spark Functionality that helps Spark RAPIDS wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants