Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Java API for Concatenate strings with separator [skip ci] #8289

Merged
merged 22 commits into from
May 26, 2021
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
df9eac9
concat_ws initial version without array support, include tests
tgravescs May 13, 2021
701c801
Add documentation and asserts
tgravescs May 13, 2021
ab9349c
documentation and more tests
tgravescs May 13, 2021
06d2cb3
Initial jni for list column view concatenate with separator
tgravescs May 13, 2021
c0fd1a3
Add more tests for arrays
tgravescs May 14, 2021
3d613e3
another test
tgravescs May 19, 2021
44438cc
Update to changes to concate api
tgravescs May 19, 2021
c43b0e0
Update api for new separator parameter
tgravescs May 19, 2021
61f56c8
Update to new parameter is concatenate list api
tgravescs May 19, 2021
38ae371
Add in the extra paramters for concatenate with scalar separator and …
tgravescs May 19, 2021
963cb39
Add more tests
tgravescs May 19, 2021
93fba2a
Update spacing
tgravescs May 19, 2021
f3c18a8
Revert "Update spacing"
tgravescs May 19, 2021
27b9267
Update spacing
tgravescs May 19, 2021
6c4c2ef
Update to match null handling for array of all nulls
tgravescs May 20, 2021
10ffddf
Fix descriptions of parameters and typos
tgravescs May 20, 2021
cedd6b1
remove extra include
tgravescs May 20, 2021
e9d7b22
Add test for one column
tgravescs May 20, 2021
91713ad
Move string concatenate list element functions to ColumnView and update
tgravescs May 20, 2021
6e1c2be
Update to use camel case and fix doc
tgravescs May 20, 2021
f8612d7
Fix camel case in column view
tgravescs May 20, 2021
0bdf70c
PR 8282 changed behavior of passing single column to stringConcatenate,
tgravescs May 21, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
227 changes: 224 additions & 3 deletions java/src/main/java/ai/rapids/cudf/ColumnVector.java
Original file line number Diff line number Diff line change
Expand Up @@ -500,7 +500,8 @@ public static ColumnVector stringConcatenate(ColumnView[] columns) {

/**
* Concatenate columns of strings together, combining a corresponding row from each column into
* a single string row of a new column.
* a single string row of a new column. This version inludes the separator for null rows
jlowe marked this conversation as resolved.
Show resolved Hide resolved
* if 'narep' is valid.
* @param separator string scalar inserted between each string being merged.
* @param narep string scalar indicating null behavior. If set to null and any string in the row
* is null the resulting string will be null. If not null, null values in any column
Expand All @@ -509,6 +510,23 @@ public static ColumnVector stringConcatenate(ColumnView[] columns) {
* @return A new java column vector containing the concatenated strings.
*/
public static ColumnVector stringConcatenate(Scalar separator, Scalar narep, ColumnView[] columns) {
return stringConcatenate(separator, narep, columns, true);
}

/**
* Concatenate columns of strings together, combining a corresponding row from each column into
* a single string row of a new column.
* @param separator string scalar inserted between each string being merged.
* @param narep string scalar indicating null behavior. If set to null and any string in the row
* is null the resulting string will be null. If not null, null values in any column
* will be replaced by the specified string.
* @param columns array of columns containing strings, must be non-empty
* @param separate_nulls if true, then the separator is included for null rows if
* `col_narep` is valid.
jlowe marked this conversation as resolved.
Show resolved Hide resolved
* @return A new java column vector containing the concatenated strings.
*/
public static ColumnVector stringConcatenate(Scalar separator, Scalar narep, ColumnView[] columns,
boolean separate_nulls) {
assert columns != null : "input columns should not be null";
assert columns.length > 0 : "input columns should not be empty";
assert separator != null : "separator scalar provided may not be null";
Expand All @@ -522,7 +540,137 @@ public static ColumnVector stringConcatenate(Scalar separator, Scalar narep, Col
column_views[i] = columns[i].getNativeView();
}

return new ColumnVector(stringConcatenation(column_views, separator.getScalarHandle(), narep.getScalarHandle()));
return new ColumnVector(stringConcatenation(column_views, separator.getScalarHandle(),
narep.getScalarHandle(), separate_nulls));
}

/**
* Concatenate columns of strings together using a separator specified for each row
* and returns the result as a string column. If the row separator for a given row is null,
* output column for that row is null. Null column values for a given row are skipped.
* @param columns array of columns containing strings, must be more than 1 columns
jlowe marked this conversation as resolved.
Show resolved Hide resolved
* @param sep_col strings column that provides the separator for a given row
* @return A new java column vector containing the concatenated strings with separator between.
*/
public static ColumnVector stringConcatenate(ColumnView[] columns, ColumnView sep_col) {
try (Scalar nullString = Scalar.fromString(null);
Scalar emptyString = Scalar.fromString("")) {
return stringConcatenate(columns, sep_col, nullString, emptyString, false);
}
}

/**
* Concatenate columns of strings together using a separator specified for each row
* and returns the result as a string column. If the row separator for a given row is null,
* output column for that row is null unless separator_narep is provided.
* The separator is applied between two output row values if the separate_nulls
* is `YES` or only between valid rows if separate_nulls is `NO`.
* @param columns array of columns containing strings, must be more than 1 columns
jlowe marked this conversation as resolved.
Show resolved Hide resolved
* @param sep_col strings column that provides the separator for a given row
* @param separator_narep String that should be used in place of a null separator for a given
* row.
* @param col_narep string that should be used in place of any null strings
* found in any column.
* @param separate_nulls if true, then the separator is included for null rows if
* `col_narep` is valid.
* @return A new java column vector containing the concatenated strings with separator between.
*/
public static ColumnVector stringConcatenate(ColumnView[] columns,
ColumnView sep_col, Scalar separator_narep, Scalar col_narep, boolean separate_nulls) {
jlowe marked this conversation as resolved.
Show resolved Hide resolved
assert columns.length >= 1 : ".stringConcatenate() operation requires at least 1 column";
jlowe marked this conversation as resolved.
Show resolved Hide resolved
assert separator_narep != null : "separator narep scalar provided may not be null";
assert col_narep != null : "column narep scalar provided may not be null";
assert separator_narep.getType().equals(DType.STRING) : "separator naprep scalar must be a string scalar";
assert col_narep.getType().equals(DType.STRING) : "column narep scalar must be a string scalar";

long[] column_views = new long[columns.length];
for(int i = 0; i < columns.length; i++) {
assert columns[i] != null : "Column vectors passed may not be null";
column_views[i] = columns[i].getNativeView();
}

return new ColumnVector(stringConcatenationSepCol(column_views, sep_col.getNativeView(),
separator_narep.getScalarHandle(), col_narep.getScalarHandle(), separate_nulls));
}

/**
* Given a lists column of strings (each row is a list of strings), concatenates the strings
* within each row and returns a single strings column result. Each new string is created by
* concatenating the strings from the same row (same list element) delimited by the separator
* provided. This version of the function relaces nulls with empty string and returns null
* for empty list.
* @param list_column column containing lists of strings to concatenate.
* @param sep_col strings column that provides separators for concatenation.
* @return A new java column vector containing the concatenated strings with separator between.
*/
public static ColumnVector stringConcatenateListElements(ColumnView list_column,
jlowe marked this conversation as resolved.
Show resolved Hide resolved
ColumnView sep_col) {
try (Scalar nullString = Scalar.fromString(null);
Scalar emptyString = Scalar.fromString("")) {
return stringConcatenateListElements(list_column, sep_col, nullString, emptyString,
false, false);
}
}

/**
* Given a lists column of strings (each row is a list of strings), concatenates the strings
* within each row and returns a single strings column result.
* Each new string is created by concatenating the strings from the same row (same list element)
* delimited by the row separator provided in the sep_colstrings column.
* @param list_column column containing lists of strings to concatenate.
* @param sep_col strings column that provides separators for concatenation.
* @param separator_narep string that should be used to replace null separator, default is an
* invalid-scalar denoting that rows containing null separator will
jlowe marked this conversation as resolved.
Show resolved Hide resolved
* result in null string in the corresponding output rows.
* @param string_narep string that should be used to replace null strings in any non-null list
* row, default is an invalid-scalar denoting that list rows containing null
jlowe marked this conversation as resolved.
Show resolved Hide resolved
* strings will result in null string in the corresponding output rows.
* @param separate_nulls if true, then the separator is included for null rows if
* `col_narep` is valid.
* @param empty_string_output_if_empty_list if set to true, any input row that is an empty list
* will result in an empty string. Otherwise, it will result in a null.
* @return A new java column vector containing the concatenated strings with separator between.
*/
public static ColumnVector stringConcatenateListElements(ColumnView list_column,
ColumnView sep_col, Scalar separator_narep, Scalar col_narep, boolean separate_nulls,
boolean empty_string_output_if_empty_list) {
assert separator_narep != null : "separator narep scalar provided may not be null";
assert col_narep != null : "column narep scalar provided may not be null";
assert separator_narep.getType().equals(DType.STRING) : "separator naprep scalar must be a string scalar";
assert col_narep.getType().equals(DType.STRING) : "column narep scalar must be a string scalar";

return new ColumnVector(stringConcatenationListElementsSepCol(list_column.getNativeView(),
sep_col.getNativeView(), separator_narep.getScalarHandle(), col_narep.getScalarHandle(),
separate_nulls, empty_string_output_if_empty_list));
}

/**
* Given a lists column of strings (each row is a list of strings), concatenates the strings
* within each row and returns a single strings column result. Each new string is created by
* concatenating the strings from the same row (same list element) delimited by the
* separator provided.
* @param list_column column containing lists of strings to concatenate.
* @param separator string scalar inserted between each string being merged.
* @param narep string scalar indicating null behavior. If set to null and any string in the row
* is null the resulting string will be null. If not null, null values in any
* column will be replaced by the specified string. The underlying value in the
* string scalar may be null, but the object passed in may not.
* @param separate_nulls if true, then the separator is included for null rows if
* `col_narep` is valid.
jlowe marked this conversation as resolved.
Show resolved Hide resolved
* @param empty_string_output_if_empty_list if set to true, any input row that is an empty list
* will result in an empty string. Otherwise, it will result in a null.
* @return A new java column vector containing the concatenated strings with separator between.
*/
public static ColumnVector stringConcatenateListElements(ColumnView list_column,
Scalar separator, Scalar narep, boolean separate_nulls,
boolean empty_string_output_if_empty_list) {
assert separator != null : "separator scalar provided may not be null";
assert narep != null : "column narep scalar provided may not be null";
assert narep.getType().equals(DType.STRING) : "narep scalar must be a string scalar";

return new ColumnVector(stringConcatenationListElements(list_column.getNativeView(),
separator.getScalarHandle(), narep.getScalarHandle(), separate_nulls,
empty_string_output_if_empty_list));
}

/**
Expand Down Expand Up @@ -721,10 +869,83 @@ private static native long makeList(long[] handles, long typeHandle, int scale,
* the resulting string will be null. If not null, null values in any column will be
* replaced by the specified string. The underlying value in the string scalar may be null,
* but the object passed in may not.
* @param separate_nulls boolean if true, then the separator is included for null rows if
* `col_narep` is valid.
jlowe marked this conversation as resolved.
Show resolved Hide resolved
* @return native handle of the resulting cudf column, used to construct the Java column
* by the stringConcatenate method.
*/
private static native long stringConcatenation(long[] columnViews, long separator, long narep);
private static native long stringConcatenation(long[] columnViews, long separator, long narep,
boolean separate_nulls);

/**
* Native method to concatenate columns of strings together using a separator specified for each row
* and returns the result as a string column.
* @param columns array of longs holding the native handles of the column_views to combine.
* @param sep_column long holding the native handle of the strings_column_view used as separators.
* @param separator_narep String scalar that should be used in place of a null separator for a given
* row.
* @param col_narep string String scalar that should be used in place of any null strings
* found in any column.
* @param separate_nulls boolean if true, then the separator is included for null rows if
* `col_narep` is valid.
* @return native handle of the resulting cudf column, used to construct the Java column.
*/
private static native long stringConcatenationSepCol(long[] columnViews,
long sep_column,
long separator_narep,
long col_narep,
boolean separate_nulls);

/**
* Native method to concatenate a list column of strings (each row is a list of strings),
* concatenates the strings within each row and returns a single strings column result.
* Each new string is created by concatenating the strings from the same row (same list element)
* delimited by the row separator provided in the `separators` strings column.
* @param list_column long holding the native handle of the column containing lists of strings
* to concatenate.
* @param sep_col long holding the native handle of the strings column that provides separators
* for concatenation.
* @param separator_narep String scalar that should be used in place of a null separator for a given
* row.
* @param col_narep string String scalar that should be used in place of any null strings
* found in any column.
* @param separate_nulls boolean if true, then the separator is included for null rows if
* `col_narep` is valid.
* @param empty_string_output_if_empty_list boolean if true, any input row that is an empty list
* will result in an empty string. Otherwise, it will result in a null.
* @return native handle of the resulting cudf column, used to construct the Java column.
*/
private static native long stringConcatenationListElementsSepCol(long list_column,
long sep_column,
long separator_narep,
long col_narep,
boolean separate_nulls,
boolean empty_string_output_if_empty_list);

/**
* Native method to concatenate a list column of strings (each row is a list of strings),
* concatenates the strings within each row and returns a single strings column result.
* Each new string is created by concatenating the strings from the same row (same list element)
* delimited by the separator provided.
* @param list_column long holding the native handle of the column containing lists of strings
* to concatenate.
* @param separator string scalar inserted between each string being merged, may not be null.
* @param narep string scalar indicating null behavior. If set to null and any string in the row
* is null the resulting string will be null. If not null, null values in any
* column will be replaced by the specified string. The underlying value in the
* string scalar may be null, but the object passed in may not.
* @param separate_nulls boolean if true, then the separator is included for null rows if
* `col_narep` is valid.
* @param empty_string_output_if_empty_list boolean if true, any input row that is an empty list
* will result in an empty string. Otherwise, it will
* result in a null.
* @return native handle of the resulting cudf column, used to construct the Java column.
*/
private static native long stringConcatenationListElements(long list_column,
long separator,
jlowe marked this conversation as resolved.
Show resolved Hide resolved
long narep,
boolean separate_nulls,
boolean empty_string_output_if_empty_list);

/**
* Native method to hash each row of the given table. Hashing function dispatched on the
Expand Down
Loading