Add JNI for strings::code_points #14533

thirtiseven · 2023-11-30T09:34:57Z

Description

This implements JNI work for strings::code_points to expose the API to Java usage.

It will be useful for NVIDIA/spark-rapids#9585

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Haoyang Li <[email protected]>

copy-pr-bot · 2023-11-30T09:35:01Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

davidwendt · 2023-11-30T12:49:59Z

/ok to test

java/src/main/native/src/ColumnViewJni.cpp

jlowe

Not very excited to add this API since it seems problematic to use in general. However it can be useful for a very specific case (once we pre-process input accordingly for it).

jlowe · 2023-11-30T14:41:49Z

java/src/main/java/ai/rapids/cudf/ColumnView.java

@@ -373,6 +373,16 @@ public final ColumnVector getByteCount() {
    return new ColumnVector(byteCount(getNativeView()));
  }

+  /**
+   * Get the code point values (integers) for each character of each string.


This API seems very problematic in light of the effort to move to large strings. A strings column will soon support more than 2^31 characters. Calling this API on such a column will crash since it cannot manifest an INT32 column with more than 2^31 entries.

It also seems problematic from a usability point of view. Since it returns only a column of INT32 instead of LIST(INT32), it's not straightforward to figure out where the code points of one string stops and another starts. We can't use the offset column of the original string, since that's byte offsets instead of character offsets. I guess one would need to get the character lengths of the original string (converting nulls to zereoes) and then do a prefix scan to compute the code point offsets to know where one string's codepoints are in the result.

It also seems very wasteful for what NVIDIA/spark-rapids#9585 needs if called directly, since it will explode the memory of many string columns by 4X. We should first slice the original string column to only select the first character of each string. That would work around the large strings issue, the "where does a string start" issue, as well as the waste, since we only need the codepoint of the first character for that Spark feature.

Thanks for the review and analysis! I'm trying the 'only select the first character' way in the plugin.

Another problem with the spark issue is that the results of Latin-1 Supplement chars are mismatched between spark and code_points. For example é is 50089 for code_points and utf-8, and 233 for spark and Unicode (and Latin-1 and utf-16?), I'm trying to work around it but it is possible that we need a custom kernel for ascii.

Co-authored-by: Nghia Truong <[email protected]>

res-life · 2023-12-12T04:37:32Z

/merge

res-life · 2023-12-12T04:39:00Z

/ok to test

Signed-off-by: Haoyang Li <[email protected]>

res-life · 2023-12-12T04:48:45Z

/ok to test

Add code_points jni

1413851

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven requested a review from a team as a code owner November 30, 2023 09:34

github-actions bot added the Java Affects Java cuDF API. label Nov 30, 2023

davidwendt added 3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 30, 2023

ttnghia reviewed Nov 30, 2023

View reviewed changes

java/src/main/native/src/ColumnViewJni.cpp Outdated Show resolved Hide resolved

jlowe approved these changes Nov 30, 2023

View reviewed changes

thirtiseven and others added 2 commits December 1, 2023 17:51

Update java/src/main/native/src/ColumnViewJni.cpp

981510b

Co-authored-by: Nghia Truong <[email protected]>

Merge branch 'rapidsai:branch-24.02' into code_points_jni

00e965d

ttnghia approved these changes Dec 6, 2023

View reviewed changes

Merge branch 'branch-24.02' into code_points_jni

fa0bb0b

format

378b5d1

Signed-off-by: Haoyang Li <[email protected]>

rapids-bot bot merged commit f8e891f into rapidsai:branch-24.02 Dec 12, 2023
67 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add JNI for strings::code_points #14533

Add JNI for strings::code_points #14533

thirtiseven commented Nov 30, 2023

copy-pr-bot bot commented Nov 30, 2023

davidwendt commented Nov 30, 2023

jlowe left a comment

jlowe Nov 30, 2023 •

edited

Loading

thirtiseven Dec 1, 2023 •

edited

Loading

res-life commented Dec 12, 2023

res-life commented Dec 12, 2023

res-life commented Dec 12, 2023

Add JNI for strings::code_points #14533

Add JNI for strings::code_points #14533

Conversation

thirtiseven commented Nov 30, 2023

Description

Checklist

copy-pr-bot bot commented Nov 30, 2023

davidwendt commented Nov 30, 2023

jlowe left a comment

Choose a reason for hiding this comment

jlowe Nov 30, 2023 • edited Loading

Choose a reason for hiding this comment

thirtiseven Dec 1, 2023 • edited Loading

Choose a reason for hiding this comment

res-life commented Dec 12, 2023

res-life commented Dec 12, 2023

res-life commented Dec 12, 2023

jlowe Nov 30, 2023 •

edited

Loading

thirtiseven Dec 1, 2023 •

edited

Loading