Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JNI for strings::split_re and strings::split_record_re #10139

Merged
merged 54 commits into from
Feb 14, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
eaba42e
Add libcudf strings split API that accepts regex pattern
davidwendt Jan 26, 2022
a832436
add error-checking gtests
davidwendt Jan 26, 2022
baccf10
Add JNI
ttnghia Jan 26, 2022
d4e5746
Merge branch 'branch-22.04' into fea-split-with-regex
davidwendt Jan 27, 2022
d33f79b
use count_matches utility
davidwendt Jan 27, 2022
9c74fdf
add split_re declaration
davidwendt Jan 27, 2022
1a89db5
split_re implementation and tests
davidwendt Jan 27, 2022
8599d0c
rename split_record_re.cu to split_re.cu
davidwendt Jan 28, 2022
b6d7453
refactored split_re/rsplit_re functions
davidwendt Jan 31, 2022
9556fc1
Merge branch 'branch-22.04' into fea-split-with-regex
davidwendt Jan 31, 2022
7bc451b
remove unneeded if-check
davidwendt Jan 31, 2022
93887b1
add all empty and all null test cases
davidwendt Jan 31, 2022
0930513
Merge branch 'branch-22.04' into fea-split-with-regex
davidwendt Feb 1, 2022
c88eeae
add more maxsplit gtests
davidwendt Feb 1, 2022
7d9d30d
Merge branch 'branch-22.04' into fea-split-with-regex
davidwendt Feb 1, 2022
c76456d
Merge branch 'branch-22.04' into fea-split-with-regex
davidwendt Feb 3, 2022
22be900
Merge branch 'branch-22.04' into fea-split-with-regex
davidwendt Feb 3, 2022
3609f2b
Merge branch 'branch-22.04' into fea-split-with-regex
davidwendt Feb 3, 2022
e959cd0
Merge branch 'branch-22.04' into jni_for_strings_split_re
ttnghia Feb 4, 2022
773047d
Merge branch 'branch-22.04' into fea-split-with-regex
davidwendt Feb 4, 2022
79887d8
Change JNI to add a new boolean flag for regex split
ttnghia Feb 4, 2022
28524b3
Implement all possible overloads for stringSplit binding
ttnghia Feb 4, 2022
59dbd9d
Merge remote-tracking branch 'david/fea-split-with-regex' into jni_fo…
ttnghia Feb 4, 2022
75ffaf8
Change JNI for stringSplit and stringSplitRecord
ttnghia Feb 4, 2022
b3604c9
Rename tests
ttnghia Feb 4, 2022
61605ef
Remove test
ttnghia Feb 4, 2022
8307bfe
Add assert
ttnghia Feb 4, 2022
75dc621
Add assert
ttnghia Feb 4, 2022
4a20662
Fix assert
ttnghia Feb 4, 2022
f915f7e
Fix string construction from jstring
ttnghia Feb 5, 2022
4176563
Rename variable and rewrite javadoc
ttnghia Feb 7, 2022
1e51736
Merge branch 'branch-22.04' into fea-split-with-regex
davidwendt Feb 7, 2022
6d8bcc9
Convert java limit to cudf max_split
ttnghia Feb 7, 2022
2e6450f
Fix Java test
ttnghia Feb 7, 2022
eb8c326
fix doxygen typo in @throw line
davidwendt Feb 8, 2022
d6ee883
refactor max-tokens calculation into helper function
davidwendt Feb 8, 2022
f528107
Fix typo
ttnghia Feb 8, 2022
b9707ec
Merge branch 'fea-split-with-regex' into jni_for_strings_split_re
ttnghia Feb 8, 2022
0cd385b
Merge branch 'branch-22.04' into jni_for_strings_split_re
ttnghia Feb 10, 2022
70a4e34
Remove support for empty delimiter
ttnghia Feb 10, 2022
af16edd
Update Java tests
ttnghia Feb 10, 2022
50792dd
Merge branch 'branch-22.04' into jni_for_strings_split_re
ttnghia Feb 10, 2022
cac2637
Fix Java tests
ttnghia Feb 10, 2022
d207e7d
Merge branch 'branch-22.04' into jni_for_strings_split_re
ttnghia Feb 11, 2022
2fade8d
Reverse change
ttnghia Feb 11, 2022
a65d358
Rewrite Javadoc
ttnghia Feb 11, 2022
7f0fee7
Fix Javadoc for the native methods
ttnghia Feb 11, 2022
1b4cd51
Rewrite JNI
ttnghia Feb 11, 2022
00d7d8b
Add a function to construct std::string from native_jstring
ttnghia Feb 11, 2022
69bb7a0
Update JNI
ttnghia Feb 11, 2022
a1b0e37
Revert "Add a function to construct std::string from native_jstring"
ttnghia Feb 11, 2022
3d55e34
Update JNI
ttnghia Feb 11, 2022
2ba4039
Update comments to clarify why we don't support limit==0 and limit==1
ttnghia Feb 11, 2022
bad6e67
Remove unused header that was added by accident
ttnghia Feb 11, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
215 changes: 138 additions & 77 deletions java/src/main/java/ai/rapids/cudf/ColumnView.java
Original file line number Diff line number Diff line change
Expand Up @@ -826,18 +826,18 @@ public final ColumnVector mergeAndSetValidity(BinaryOp mergeOp, ColumnView... co
/**
* Creates a deep copy of a column while replacing the validity mask. The validity mask is the
* device_vector equivalent of the boolean column given as argument.
*
*
* The boolColumn must have the same number of rows as the current column.
* The result column will have the same number of rows as the current column.
* The result column will have the same number of rows as the current column.
* For all indices `i` where the boolColumn is `true`, the result column will have a valid value at index i.
* For all other values (i.e. `false` or `null`), the result column will have nulls.
*
*
* If the current column has a null at a given index `i`, and the new validity mask is `true` at index `i`,
* then the row value is undefined.
*
*
* @param boolColumn bool column whose value is to be used as the validity mask.
* @return Deep copy of the column with replaced validity mask.
*/
*/
public final ColumnVector copyWithBooleanColumnAsValidity(ColumnView boolColumn) {
return new ColumnVector(copyWithBooleanColumnAsValidity(getNativeView(), boolColumn.getNativeView()));
}
Expand Down Expand Up @@ -2345,88 +2345,128 @@ public final ColumnVector stringLocate(Scalar substring, int start, int end) {
}

/**
* Returns a list of columns by splitting each string using the specified delimiter.
* The number of rows in the output columns will be the same as the input column.
* Null entries are added for a row where split results have been exhausted.
* Null string entries return corresponding null output columns.
* @param delimiter UTF-8 encoded string identifying the split points in each string.
* An empty string indicates split on whitespace.
* @param maxSplit the maximum number of splits to perform, or -1 for all possible splits.
* @return New table of strings columns.
* Returns a list of columns by splitting each string using the specified pattern. The number of
* rows in the output columns will be the same as the input column. Null entries are added for a
* row where split results have been exhausted. Null input entries result in all nulls in the
* corresponding rows of the output columns.
*
* @param pattern UTF-8 encoded string identifying the split pattern for each input string.
* @param limit the maximum size of the list resulting from splitting each input string,
* or -1 for all possible splits. Note that limit = 0 (all possible splits without
* trailing empty strings) and limit = 1 (no split at all) are not supported.
* @param splitByRegex a boolean flag indicating whether the input strings will be split by a
* regular expression pattern or just by a string literal delimiter.
* @return list of strings columns as a table.
*/
public final Table stringSplit(Scalar delimiter, int maxSplit) {
public final Table stringSplit(String pattern, int limit, boolean splitByRegex) {
assert type.equals(DType.STRING) : "column type must be a String";
assert delimiter != null : "delimiter may not be null";
assert delimiter.getType().equals(DType.STRING) : "delimiter must be a string scalar";
return new Table(stringSplit(this.getNativeView(), delimiter.getScalarHandle(), maxSplit));
assert pattern != null : "pattern is null";
assert pattern.length() > 0 : "empty pattern is not supported";
assert limit != 0 && limit != 1 : "split limit == 0 and limit == 1 are not supported";
return new Table(stringSplit(this.getNativeView(), pattern, limit, splitByRegex));
}

/**
* Returns a list of columns by splitting each string using the specified delimiter.
* The number of rows in the output columns will be the same as the input column.
* Null entries are added for a row where split results have been exhausted.
* Null string entries return corresponding null output columns.
* @param delimiter UTF-8 encoded string identifying the split points in each string.
* An empty string indicates split on whitespace.
* @return New table of strings columns.
* Returns a list of columns by splitting each string using the specified pattern. The number of
* rows in the output columns will be the same as the input column. Null entries are added for a
* row where split results have been exhausted. Null input entries result in all nulls in the
* corresponding rows of the output columns.
*
* @param pattern UTF-8 encoded string identifying the split pattern for each input string.
* @param splitByRegex a boolean flag indicating whether the input strings will be split by a
* regular expression pattern or just by a string literal delimiter.
* @return list of strings columns as a table.
*/
public final Table stringSplit(Scalar delimiter) {
return stringSplit(delimiter, -1);
public final Table stringSplit(String pattern, boolean splitByRegex) {
return stringSplit(pattern, -1, splitByRegex);
}

/**
* Returns a list of columns by splitting each string using whitespace as the delimiter.
* The number of rows in the output columns will be the same as the input column.
* Null entries are added for a row where split results have been exhausted.
* Null string entries return corresponding null output columns.
* @return New table of strings columns.
* Returns a list of columns by splitting each string using the specified string literal
* delimiter. The number of rows in the output columns will be the same as the input column.
* Null entries are added for a row where split results have been exhausted. Null input entries
* result in all nulls in the corresponding rows of the output columns.
*
* @param delimiter UTF-8 encoded string identifying the split delimiter for each input string.
* @param limit the maximum size of the list resulting from splitting each input string,
* or -1 for all possible splits. Note that limit = 0 (all possible splits without
* trailing empty strings) and limit = 1 (no split at all) are not supported.
* @return list of strings columns as a table.
*/
public final Table stringSplit() {
try (Scalar emptyString = Scalar.fromString("")) {
return stringSplit(emptyString, -1);
}
public final Table stringSplit(String delimiter, int limit) {
return stringSplit(delimiter, limit, false);
}

/**
* Returns a column of lists of strings by splitting each string using whitespace as the delimiter.
* Returns a list of columns by splitting each string using the specified string literal
* delimiter. The number of rows in the output columns will be the same as the input column.
* Null entries are added for a row where split results have been exhausted. Null input entries
* result in all nulls in the corresponding rows of the output columns.
*
* @param delimiter UTF-8 encoded string identifying the split delimiter for each input string.
* @return list of strings columns as a table.
*/
public final ColumnVector stringSplitRecord() {
return stringSplitRecord(-1);
public final Table stringSplit(String delimiter) {
return stringSplit(delimiter, -1, false);
}

/**
* Returns a column of lists of strings by splitting each string using whitespace as the delimiter.
* @param maxSplit the maximum number of splits to perform, or -1 for all possible splits.
* Returns a column that are lists of strings in which each list is made by splitting the
* corresponding input string using the specified pattern.
*
* @param pattern UTF-8 encoded string identifying the split pattern for each input string.
* @param limit the maximum size of the list resulting from splitting each input string,
* or -1 for all possible splits. Note that limit = 0 (all possible splits without
* trailing empty strings) and limit = 1 (no split at all) are not supported.
* @param splitByRegex a boolean flag indicating whether the input strings will be split by a
* regular expression pattern or just by a string literal delimiter.
* @return a LIST column of string elements.
*/
public final ColumnVector stringSplitRecord(int maxSplit) {
try (Scalar emptyString = Scalar.fromString("")) {
return stringSplitRecord(emptyString, maxSplit);
}
public final ColumnVector stringSplitRecord(String pattern, int limit, boolean splitByRegex) {
assert type.equals(DType.STRING) : "column type must be String";
assert pattern != null : "pattern is null";
assert pattern.length() > 0 : "empty pattern is not supported";
assert limit != 0 && limit != 1 : "split limit == 0 and limit == 1 are not supported";
return new ColumnVector(
stringSplitRecord(this.getNativeView(), pattern, limit, splitByRegex));
}

/**
* Returns a column that are lists of strings in which each list is made by splitting the
* corresponding input string using the specified pattern.
*
* @param pattern UTF-8 encoded string identifying the split pattern for each input string.
* @param splitByRegex a boolean flag indicating whether the input strings will be split by a
* regular expression pattern or just by a string literal delimiter.
* @return a LIST column of string elements.
*/
public final ColumnVector stringSplitRecord(String pattern, boolean splitByRegex) {
return stringSplitRecord(pattern, -1, splitByRegex);
}

/**
* Returns a column of lists of strings by splitting each string using the specified delimiter.
* @param delimiter UTF-8 encoded string identifying the split points in each string.
* An empty string indicates split on whitespace.
* Returns a column that are lists of strings in which each list is made by splitting the
* corresponding input string using the specified string literal delimiter.
*
* @param delimiter UTF-8 encoded string identifying the split delimiter for each input string.
* @param limit the maximum size of the list resulting from splitting each input string,
* or -1 for all possible splits. Note that limit = 0 (all possible splits without
* trailing empty strings) and limit = 1 (no split at all) are not supported.
* @return a LIST column of string elements.
*/
public final ColumnVector stringSplitRecord(Scalar delimiter) {
return stringSplitRecord(delimiter, -1);
public final ColumnVector stringSplitRecord(String delimiter, int limit) {
return stringSplitRecord(delimiter, limit, false);
}

/**
* Returns a column that is a list of strings. Each string list is made by splitting each input
* string using the specified delimiter.
* @param delimiter UTF-8 encoded string identifying the split points in each string.
* An empty string indicates split on whitespace.
* @param maxSplit the maximum number of splits to perform, or -1 for all possible splits.
* @return New table of strings columns.
* Returns a column that are lists of strings in which each list is made by splitting the
* corresponding input string using the specified string literal delimiter.
*
* @param delimiter UTF-8 encoded string identifying the split delimiter for each input string.
* @return a LIST column of string elements.
*/
public final ColumnVector stringSplitRecord(Scalar delimiter, int maxSplit) {
assert type.equals(DType.STRING) : "column type must be a String";
assert delimiter != null : "delimiter may not be null";
assert delimiter.getType().equals(DType.STRING) : "delimiter must be a string scalar";
return new ColumnVector(
stringSplitRecord(this.getNativeView(), delimiter.getScalarHandle(), maxSplit));
public final ColumnVector stringSplitRecord(String delimiter) {
return stringSplitRecord(delimiter, -1, false);
}

/**
Expand Down Expand Up @@ -3248,7 +3288,7 @@ public enum FindOptions {FIND_FIRST, FIND_LAST};
* Create a column of int32 indices, indicating the position of the scalar search key
* in each list row.
* All indices are 0-based. If a search key is not found, the index is set to -1.
* The index is set to null if one of the following is true:
* The index is set to null if one of the following is true:
* 1. The search key is null.
* 2. The list row is null.
* @param key The scalar search key
Expand All @@ -3265,7 +3305,7 @@ public final ColumnVector listIndexOf(Scalar key, FindOptions findOption) {
* Create a column of int32 indices, indicating the position of each row in the
* search key column in the corresponding row of the lists column.
* All indices are 0-based. If a search key is not found, the index is set to -1.
* The index is set to null if one of the following is true:
* The index is set to null if one of the following is true:
* 1. The search key row is null.
* 2. The list row is null.
* @param keys ColumnView of search keys.
Expand Down Expand Up @@ -3531,15 +3571,36 @@ private static native long repeatStringsWithColumnRepeatTimes(long stringsHandle
private static native long substringLocate(long columnView, long substringScalar, int start, int end);

/**
* Native method which returns array of columns by splitting each string using the specified
* delimiter.
* @param columnView native handle of the cudf::column_view being operated on.
* @param delimiter UTF-8 encoded string identifying the split points in each string.
* @param maxSplit the maximum number of splits to perform, or -1 for all possible splits.
* Returns a list of columns by splitting each string using the specified pattern. The number of
* rows in the output columns will be the same as the input column. Null entries are added for a
* row where split results have been exhausted. Null input entries result in all nulls in the
* corresponding rows of the output columns.
*
* @param nativeHandle native handle of the input strings column that being operated on.
* @param pattern UTF-8 encoded string identifying the split pattern for each input string.
* @param limit the maximum size of the list resulting from splitting each input string,
* or -1 for all possible splits. Note that limit = 0 (all possible splits without
* trailing empty strings) and limit = 1 (no split at all) are not supported.
* @param splitByRegex a boolean flag indicating whether the input strings will be split by a
* regular expression pattern or just by a string literal delimiter.
*/
private static native long[] stringSplit(long columnView, long delimiter, int maxSplit);
private static native long[] stringSplit(long nativeHandle, String pattern, int limit,
boolean splitByRegex);

private static native long stringSplitRecord(long nativeView, long scalarHandle, int maxSplit);
/**
* Returns a column that are lists of strings in which each list is made by splitting the
* corresponding input string using the specified string literal delimiter.
*
* @param nativeHandle native handle of the input strings column that being operated on.
* @param pattern UTF-8 encoded string identifying the split pattern for each input string.
* @param limit the maximum size of the list resulting from splitting each input string,
* or -1 for all possible splits. Note that limit = 0 (all possible splits without
* trailing empty strings) and limit = 1 (no split at all) are not supported.
* @param splitByRegex a boolean flag indicating whether the input strings will be split by a
* regular expression pattern or just by a string literal delimiter.
*/
private static native long stringSplitRecord(long nativeHandle, String pattern, int limit,
boolean splitByRegex);

/**
* Native method to calculate substring from a given string column. 0 indexing.
Expand Down Expand Up @@ -3714,7 +3775,7 @@ private static native long stringReplaceWithBackrefs(long columnView, String pat
/**
* Native method to search list rows for null elements.
* @param nativeView the column view handle of the list
* @return column handle of the resultant boolean column
* @return column handle of the resultant boolean column
*/
private static native long listContainsNulls(long nativeView);

Expand Down Expand Up @@ -3896,20 +3957,20 @@ private static native long bitwiseMergeAndSetValidity(long baseHandle, long[] vi
/**
* Native method to deep copy a column while replacing the null mask. The null mask is the
* device_vector equivalent of the boolean column given as argument.
*
*
* The boolColumn must have the same number of rows as the exemplar column.
* The result column will have the same number of rows as the exemplar.
* For all indices `i` where the boolean column is `true`, the result column will have a valid value at index i.
* For all other values (i.e. `false` or `null`), the result column will have nulls.
*
*
* If the exemplar column has a null at a given index `i`, and the new validity mask is `true` at index `i`,
* then the resultant row value is undefined.
*
*
* @param exemplarViewHandle column view of the column that is deep copied.
* @param boolColumnViewHandle bool column whose value is to be used as the null mask.
* @return Deep copy of the column with replaced null mask.
*/
private static native long copyWithBooleanColumnAsValidity(long exemplarViewHandle,
*/
private static native long copyWithBooleanColumnAsValidity(long exemplarViewHandle,
long boolColumnViewHandle) throws CudfException;

////////
Expand Down
Loading