-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Ascii function for ascii and latin-1 [databricks] #10054
Conversation
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Did some perf tests, and gpu is slightly faster than cpu (12%). I plan to refine my implementation to reduce times of cuDF API calling, but the undefined values will be more undefined, which doesn't matter for now. |
Signed-off-by: Haoyang Li <[email protected]>
Ok, I refined the implementation and it now runs faster, about 69% speedup over cpu. |
build |
Test failed in CI because Latin-1 support in Spark is from 3.3.1, will add shim for it so ascii under 3.3.1 will be faster and (maybe) fully supported. |
Signed-off-by: Haoyang Li <[email protected]>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just need to be sure that we test this on databricks
Signed-off-by: Haoyang Li <[email protected]>
build |
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
build |
Looks like we need to fix this for 3.5.1 too |
build |
perf test results on the byte-based solution (in lower version shims):
|
The previous tests timed out trying to run on databricks. Trying again. |
build |
Closes #9585
This PR part supported ascii function for strings starting with ASCII and Latin-1 supplement, returning results from 0 to 255. The function is disabled by default.
This PR uses
cudf::code_points
to get the utf-8 code points of the first letter in a string, and converts to ASCII using the following rules:It is good enough for customer, we can file another issue to fully support it when needed.
perf test results
data: 50000000 string from big datagen: