Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Java API for Concatenate strings with separator [skip ci] #8289

Merged
merged 22 commits into from
May 26, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
df9eac9
concat_ws initial version without array support, include tests
tgravescs May 13, 2021
701c801
Add documentation and asserts
tgravescs May 13, 2021
ab9349c
documentation and more tests
tgravescs May 13, 2021
06d2cb3
Initial jni for list column view concatenate with separator
tgravescs May 13, 2021
c0fd1a3
Add more tests for arrays
tgravescs May 14, 2021
3d613e3
another test
tgravescs May 19, 2021
44438cc
Update to changes to concate api
tgravescs May 19, 2021
c43b0e0
Update api for new separator parameter
tgravescs May 19, 2021
61f56c8
Update to new parameter is concatenate list api
tgravescs May 19, 2021
38ae371
Add in the extra paramters for concatenate with scalar separator and …
tgravescs May 19, 2021
963cb39
Add more tests
tgravescs May 19, 2021
93fba2a
Update spacing
tgravescs May 19, 2021
f3c18a8
Revert "Update spacing"
tgravescs May 19, 2021
27b9267
Update spacing
tgravescs May 19, 2021
6c4c2ef
Update to match null handling for array of all nulls
tgravescs May 20, 2021
10ffddf
Fix descriptions of parameters and typos
tgravescs May 20, 2021
cedd6b1
remove extra include
tgravescs May 20, 2021
e9d7b22
Add test for one column
tgravescs May 20, 2021
91713ad
Move string concatenate list element functions to ColumnView and update
tgravescs May 20, 2021
6e1c2be
Update to use camel case and fix doc
tgravescs May 20, 2021
f8612d7
Fix camel case in column view
tgravescs May 20, 2021
0bdf70c
PR 8282 changed behavior of passing single column to stringConcatenate,
tgravescs May 21, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 104 additions & 9 deletions java/src/main/java/ai/rapids/cudf/ColumnVector.java
Original file line number Diff line number Diff line change
Expand Up @@ -500,7 +500,8 @@ public static ColumnVector stringConcatenate(ColumnView[] columns) {

/**
* Concatenate columns of strings together, combining a corresponding row from each column into
* a single string row of a new column.
* a single string row of a new column. This version includes the separator for null rows
* if 'narep' is valid.
* @param separator string scalar inserted between each string being merged.
* @param narep string scalar indicating null behavior. If set to null and any string in the row
* is null the resulting string will be null. If not null, null values in any column
Expand All @@ -509,20 +510,89 @@ public static ColumnVector stringConcatenate(ColumnView[] columns) {
* @return A new java column vector containing the concatenated strings.
*/
public static ColumnVector stringConcatenate(Scalar separator, Scalar narep, ColumnView[] columns) {
return stringConcatenate(separator, narep, columns, true);
}

/**
* Concatenate columns of strings together, combining a corresponding row from each column into
* a single string row of a new column.
* @param separator string scalar inserted between each string being merged.
* @param narep string scalar indicating null behavior. If set to null and any string in the row
* is null the resulting string will be null. If not null, null values in any column
* will be replaced by the specified string.
* @param columns array of columns containing strings, must be non-empty
* @param separateNulls if true, then the separator is included for null rows if
* `narep` is valid.
* @return A new java column vector containing the concatenated strings.
*/
public static ColumnVector stringConcatenate(Scalar separator, Scalar narep, ColumnView[] columns,
boolean separateNulls) {
assert columns != null : "input columns should not be null";
assert columns.length > 0 : "input columns should not be empty";
assert separator != null : "separator scalar provided may not be null";
assert separator.getType().equals(DType.STRING) : "separator scalar must be a string scalar";
assert narep != null : "narep scalar provided may not be null";
assert narep.getType().equals(DType.STRING) : "narep scalar must be a string scalar";

long[] column_views = new long[columns.length];
long[] columnViews = new long[columns.length];
for(int i = 0; i < columns.length; i++) {
assert columns[i] != null : "Column vectors passed may not be null";
columnViews[i] = columns[i].getNativeView();
}

return new ColumnVector(stringConcatenation(columnViews, separator.getScalarHandle(),
narep.getScalarHandle(), separateNulls));
}

/**
* Concatenate columns of strings together using a separator specified for each row
* and returns the result as a string column. If the row separator for a given row is null,
* output column for that row is null. Null column values for a given row are skipped.
* @param columns array of columns containing strings
* @param sepCol strings column that provides the separator for a given row
* @return A new java column vector containing the concatenated strings with separator between.
*/
public static ColumnVector stringConcatenate(ColumnView[] columns, ColumnView sepCol) {
try (Scalar nullString = Scalar.fromString(null);
Scalar emptyString = Scalar.fromString("")) {
return stringConcatenate(columns, sepCol, nullString, emptyString, false);
}
}

/**
* Concatenate columns of strings together using a separator specified for each row
* and returns the result as a string column. If the row separator for a given row is null,
* output column for that row is null unless separatorNarep is provided.
* The separator is applied between two output row values if the separateNulls
* is `YES` or only between valid rows if separateNulls is `NO`.
* @param columns array of columns containing strings
* @param sepCol strings column that provides the separator for a given row
* @param separatorNarep string scalar indicating null behavior when a separator is null.
* If set to null and the separator is null the resulting string will
* be null. If not null, this string will be used in place of a null
* separator.
* @param colNarep string that should be used in place of any null strings
* found in any column.
* @param separateNulls if true, then the separator is included for null rows if
* `colNarep` is valid.
* @return A new java column vector containing the concatenated strings with separator between.
*/
public static ColumnVector stringConcatenate(ColumnView[] columns,
ColumnView sepCol, Scalar separatorNarep, Scalar colNarep, boolean separateNulls) {
assert columns.length >= 1 : ".stringConcatenate() operation requires at least 1 column";
jlowe marked this conversation as resolved.
Show resolved Hide resolved
assert separatorNarep != null : "separator narep scalar provided may not be null";
assert colNarep != null : "column narep scalar provided may not be null";
assert separatorNarep.getType().equals(DType.STRING) : "separator naprep scalar must be a string scalar";
assert colNarep.getType().equals(DType.STRING) : "column narep scalar must be a string scalar";

long[] columnViews = new long[columns.length];
for(int i = 0; i < columns.length; i++) {
assert columns[i] != null : "Column vectors passed may not be null";
column_views[i] = columns[i].getNativeView();
columnViews[i] = columns[i].getNativeView();
}

return new ColumnVector(stringConcatenation(column_views, separator.getScalarHandle(), narep.getScalarHandle()));
return new ColumnVector(stringConcatenationSepCol(columnViews, sepCol.getNativeView(),
separatorNarep.getScalarHandle(), colNarep.getScalarHandle(), separateNulls));
}

/**
Expand Down Expand Up @@ -717,14 +787,39 @@ private static native long makeList(long[] handles, long typeHandle, int scale,
*
* @param columnViews array of longs holding the native handles of the column_views to combine.
* @param separator string scalar inserted between each string being merged, may not be null.
* @param narep string scalar indicating null behavior. If set to null and any string in the row is null
* the resulting string will be null. If not null, null values in any column will be
* replaced by the specified string. The underlying value in the string scalar may be null,
* but the object passed in may not.
* @param narep string scalar indicating null behavior. If set to null and any string in
* the row is null the resulting string will be null. If not null, null
* values in any column will be replaced by the specified string. The
* underlying value in the string scalar may be null, but the object passed
* in may not.
* @param separate_nulls boolean if true, then the separator is included for null rows if
* `narep` is valid.
* @return native handle of the resulting cudf column, used to construct the Java column
* by the stringConcatenate method.
*/
private static native long stringConcatenation(long[] columnViews, long separator, long narep);
private static native long stringConcatenation(long[] columnViews, long separator, long narep,
boolean separate_nulls);

/**
* Native method to concatenate columns of strings together using a separator specified for each row
* and returns the result as a string column.
* @param columns array of longs holding the native handles of the column_views to combine.
* @param sep_column long holding the native handle of the strings_column_view used as separators.
* @param separator_narep string scalar indicating null behavior when a separator is null.
* If set to null and the separator is null the resulting string will
* be null. If not null, this string will be used in place of a null
* separator.
* @param col_narep string String scalar that should be used in place of any null strings
* found in any column.
* @param separate_nulls boolean if true, then the separator is included for null rows if
* `col_narep` is valid.
* @return native handle of the resulting cudf column, used to construct the Java column.
*/
private static native long stringConcatenationSepCol(long[] columnViews,
long sep_column,
long separator_narep,
long col_narep,
boolean separate_nulls);

/**
* Native method to hash each row of the given table. Hashing function dispatched on the
Expand Down
132 changes: 132 additions & 0 deletions java/src/main/java/ai/rapids/cudf/ColumnView.java
Original file line number Diff line number Diff line change
Expand Up @@ -2116,6 +2116,84 @@ public final ColumnVector substring(ColumnView start, ColumnView end) {
return new ColumnVector(substringColumn(getNativeView(), start.getNativeView(), end.getNativeView()));
}

/**
* Given a lists column of strings (each row is a list of strings), concatenates the strings
* within each row and returns a single strings column result. Each new string is created by
* concatenating the strings from the same row (same list element) delimited by the separator
* provided. This version of the function relaces nulls with empty string and returns null
* for empty list.
* @param sepCol strings column that provides separators for concatenation.
* @return A new java column vector containing the concatenated strings with separator between.
*/
public final ColumnVector stringConcatenateListElements(ColumnView sepCol) {
try (Scalar nullString = Scalar.fromString(null);
Scalar emptyString = Scalar.fromString("")) {
return stringConcatenateListElements(sepCol, nullString, emptyString,
false, false);
}
}

/**
* Given a lists column of strings (each row is a list of strings), concatenates the strings
* within each row and returns a single strings column result.
* Each new string is created by concatenating the strings from the same row (same list element)
* delimited by the row separator provided in the sepCol strings column.
* @param sepCol strings column that provides separators for concatenation.
* @param separatorNarep string scalar indicating null behavior when a separator is null.
* If set to null and the separator is null the resulting string will
* be null. If not null, this string will be used in place of a null
* separator.
* @param stringNarep string that should be used to replace null strings in any non-null list
* row. If set to null and the string is null the resulting string will
* be null. If not null, this string will be used in place of a null value.
* @param separateNulls if true, then the separator is included for null rows if
* `stringNarep` is valid.
* @param emptyStringOutputIfEmptyList if set to true, any input row that is an empty list
* will result in an empty string. Otherwise, it will result in a null.
* @return A new java column vector containing the concatenated strings with separator between.
*/
public final ColumnVector stringConcatenateListElements(ColumnView sepCol,
Scalar separatorNarep, Scalar stringNarep, boolean separateNulls,
boolean emptyStringOutputIfEmptyList) {
assert type.equals(DType.LIST) : "column type must be a list";
assert separatorNarep != null : "separator narep scalar provided may not be null";
assert stringNarep != null : "string narep scalar provided may not be null";
assert separatorNarep.getType().equals(DType.STRING) : "separator naprep scalar must be a string scalar";
assert stringNarep.getType().equals(DType.STRING) : "string narep scalar must be a string scalar";

return new ColumnVector(stringConcatenationListElementsSepCol(getNativeView(),
sepCol.getNativeView(), separatorNarep.getScalarHandle(), stringNarep.getScalarHandle(),
separateNulls, emptyStringOutputIfEmptyList));
}

/**
* Given a lists column of strings (each row is a list of strings), concatenates the strings
* within each row and returns a single strings column result. Each new string is created by
* concatenating the strings from the same row (same list element) delimited by the
* separator provided.
* @param separator string scalar inserted between each string being merged.
* @param narep string scalar indicating null behavior. If set to null and any string in the row
* is null the resulting string will be null. If not null, null values in any
* column will be replaced by the specified string. The underlying value in the
* string scalar may be null, but the object passed in may not.
* @param separateNulls if true, then the separator is included for null rows if
* `narep` is valid.
* @param emptyStringOutputIfEmptyList if set to true, any input row that is an empty list
* will result in an empty string. Otherwise, it will result in a null.
* @return A new java column vector containing the concatenated strings with separator between.
*/
public final ColumnVector stringConcatenateListElements(Scalar separator,
Scalar narep, boolean separateNulls, boolean emptyStringOutputIfEmptyList) {
assert type.equals(DType.LIST) : "column type must be a list";
assert separator != null : "separator scalar provided may not be null";
assert narep != null : "column narep scalar provided may not be null";
assert narep.getType().equals(DType.STRING) : "narep scalar must be a string scalar";

return new ColumnVector(stringConcatenationListElements(getNativeView(),
separator.getScalarHandle(), narep.getScalarHandle(), separateNulls,
emptyStringOutputIfEmptyList));
}

/**
* Apply a JSONPath string to all rows in an input strings column.
*
Expand Down Expand Up @@ -2712,6 +2790,60 @@ static DeviceMemoryBufferView getOffsetsBuffer(long viewHandle) {
*/
private static native long stringTimestampToTimestamp(long viewHandle, int unit, String format);

/**
* Native method to concatenate a list column of strings (each row is a list of strings),
* concatenates the strings within each row and returns a single strings column result.
* Each new string is created by concatenating the strings from the same row (same list element)
* delimited by the row separator provided in the `separators` strings column.
* @param listColumnHandle long holding the native handle of the column containing lists of strings
* to concatenate.
* @param sepColumn long holding the native handle of the strings column that provides separators
* for concatenation.
* @param separatorNarep string scalar indicating null behavior when a separator is null.
* If set to null and the separator is null the resulting string will
* be null. If not null, this string will be used in place of a null
* separator.
* @param colNarep string String scalar that should be used in place of any null strings
* found in any column.
* @param separateNulls boolean if true, then the separator is included for null rows if
* `colNarep` is valid.
* @param emptyStringOutputIfEmptyList boolean if true, any input row that is an empty list
* will result in an empty string. Otherwise, it will
* result in a null.
* @return native handle of the resulting cudf column, used to construct the Java column.
*/
private static native long stringConcatenationListElementsSepCol(long listColumnHandle,
long sepColumn,
long separatorNarep,
long colNarep,
boolean separateNulls,
boolean emptyStringOutputIfEmptyList);

/**
* Native method to concatenate a list column of strings (each row is a list of strings),
* concatenates the strings within each row and returns a single strings column result.
* Each new string is created by concatenating the strings from the same row (same list element)
* delimited by the separator provided.
* @param listColumnHandle long holding the native handle of the column containing lists of strings
* to concatenate.
* @param separator string scalar inserted between each string being merged, may not be null.
* @param narep string scalar indicating null behavior. If set to null and any string in the row
* is null the resulting string will be null. If not null, null values in any
* column will be replaced by the specified string. The underlying value in the
* string scalar may be null, but the object passed in may not.
* @param separateNulls boolean if true, then the separator is included for null rows if
* `narep` is valid.
* @param emptyStringOutputIfEmptyList boolean if true, any input row that is an empty list
* will result in an empty string. Otherwise, it will
* result in a null.
* @return native handle of the resulting cudf column, used to construct the Java column.
*/
private static native long stringConcatenationListElements(long listColumnHandle,
long separator,
long narep,
boolean separateNulls,
boolean emptyStringOutputIfEmptyList);

private static native long getJSONObject(long viewHandle, long scalarHandle) throws CudfException;

/**
Expand Down
Loading