Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas UDF: Sort the data before computing the sum. #1810

Merged
merged 1 commit into from
Feb 25, 2021

Conversation

firestarman
Copy link
Collaborator

@firestarman firestarman commented Feb 25, 2021

This PR is to fix the issues #757 and #740 by sorting the data before computing the sum to make sure the results returned from the UDF for CPU and GPU are always the same.

The input to_process (orignal type is integral) of the UDF has the same numbers but in different orders for CPU and GPU, then the sum will be different when,

  1. to_process contains null values (Then Pandas will cast the to_process to float64), and
  2. Some of the numbers are big enough and will lose precision after being casted to float64.

Seems it is not easy to sort the data in one group by Spark for the test test_group_aggregate_udf , so the PR chooses to sort the input data in the UDF.

For more details please go to #740 (comment)

closes #740
closes #757

Signed-off-by: Firestarman [email protected]

To make sure the results returned from the UDF for CPU and GPU are the same.

The input 'to_process'(orignal type is integral) of the UDF has the same numbers
but in different orders for CPU and GPU, then the sum will be different when,

1) 'to_process' contains null values. (Then Pandas will cast the 'to_process' to float64)
2) Some of the numbers are big enough and will lose precision after being casted to float64.

Signed-off-by: Firestarman <[email protected]>
@firestarman
Copy link
Collaborator Author

build

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good change

@revans2 revans2 merged commit 8ea7e77 into NVIDIA:branch-0.4 Feb 25, 2021
@firestarman firestarman deleted the fix-flaky-tests branch February 26, 2021 01:36
@sameerz sameerz added the test Only impacts tests label Mar 1, 2021
@sameerz sameerz added this to the Feb 16 - Feb 26 milestone Mar 1, 2021
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
3 participants