Pandas UDF: Sort the data before computing the sum. #1810
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is to fix the issues #757 and #740 by sorting the data before computing the
sum
to make sure the results returned from the UDF for CPU and GPU are always the same.The input
to_process
(orignal type is integral) of the UDF has the same numbers but in different orders for CPU and GPU, then the sum will be different when,to_process
contains null values (Then Pandas will cast theto_process
tofloat64
), andfloat64
.Seems it is not easy to sort the data in one group by Spark for the test
test_group_aggregate_udf
, so the PR chooses to sort the input data in the UDF.For more details please go to #740 (comment)
closes #740
closes #757
Signed-off-by: Firestarman [email protected]