Pandas UDF: Sort the data before computing the sum. #1810

firestarman · 2021-02-25T04:17:53Z

This PR is to fix the issues #757 and #740 by sorting the data before computing the sum to make sure the results returned from the UDF for CPU and GPU are always the same.

The input to_process (orignal type is integral) of the UDF has the same numbers but in different orders for CPU and GPU, then the sum will be different when,

to_process contains null values (Then Pandas will cast the to_process to float64), and
Some of the numbers are big enough and will lose precision after being casted to float64.

Seems it is not easy to sort the data in one group by Spark for the test test_group_aggregate_udf , so the PR chooses to sort the input data in the UDF.

For more details please go to #740 (comment)

closes #740
closes #757

Signed-off-by: Firestarman [email protected]

To make sure the results returned from the UDF for CPU and GPU are the same. The input 'to_process'(orignal type is integral) of the UDF has the same numbers but in different orders for CPU and GPU, then the sum will be different when, 1) 'to_process' contains null values. (Then Pandas will cast the 'to_process' to float64) 2) Some of the numbers are big enough and will lose precision after being casted to float64. Signed-off-by: Firestarman <[email protected]>

firestarman · 2021-02-25T04:19:29Z

build

revans2

Looks like a good change

) Signed-off-by: Firestarman <[email protected]>

revans2 approved these changes Feb 25, 2021

View reviewed changes

revans2 merged commit 8ea7e77 into NVIDIA:branch-0.4 Feb 25, 2021

firestarman deleted the fix-flaky-tests branch February 26, 2021 01:36

sameerz added the test Only impacts tests label Mar 1, 2021

sameerz added this to the Feb 16 - Feb 26 milestone Mar 1, 2021

nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021

Sort the floating point test data before computing the sum. (NVIDIA#1810

92fbc4e

) Signed-off-by: Firestarman <[email protected]>

nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021

Sort the floating point test data before computing the sum. (NVIDIA#1810

9318fbf

) Signed-off-by: Firestarman <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas UDF: Sort the data before computing the sum. #1810

Pandas UDF: Sort the data before computing the sum. #1810

firestarman commented Feb 25, 2021 •

edited

Loading

firestarman commented Feb 25, 2021

revans2 left a comment

Pandas UDF: Sort the data before computing the sum. #1810

Pandas UDF: Sort the data before computing the sum. #1810

Conversation

firestarman commented Feb 25, 2021 • edited Loading

firestarman commented Feb 25, 2021

revans2 left a comment

Choose a reason for hiding this comment

firestarman commented Feb 25, 2021 •

edited

Loading