-
Notifications
You must be signed in to change notification settings - Fork 237
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix a Pandas UDF slowness issue (#11395)
Close #10770 In CombiningIterator, the call to hasNext of pythonOutputIter may trigger a read without setting the target rows number, and the default rows number is Int.MaxValue, then the GpuArrowReader will try to read in a quite big batch when the partition data is big enough, leading to too much data copying by DirectByteBufferOutputStream at the writer side. Then slowness comes up. This PR changes the default read rows number to arrowMaxRecordsPerBatch to align with the Arrow batching behavior in Spark, and set the target read rows number in the hasNext function too. --------- Signed-off-by: Firestarman <[email protected]>
- Loading branch information
1 parent
dbd92d2
commit db1d580
Showing
2 changed files
with
28 additions
and
12 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters