[BUG] Concatenate with separator, ignore separator for null replacement Inconsistent #4728

rwlee · 2020-03-27T22:58:12Z

CUDF has two concatenate with separator api's. One that takes a string scalar as separator and one that takes a strings_column_view of separators. They are inconsistent in their null handling and neither match exactly how Spark acts for concat_ws, which is actually different from the concat Spark behavior.

CUDF concat with string scalar separator:

Any row with a null entry will result in the corresponding output
row to be null entry unless a narep string is specified to be used
in its place.

CUDF concat with string column view separator

If all column values for a given row is null, output column for that row is null, unless
there is a valid @p col_narep
null column values for a given row are skipped, if the column replacement isn't valid

The main point being in the concat with string column view separator we skip null values rather then it causing the entire output to be null. The concat with string column view separator is consistent with the Spark behavior except for the behavior of when all values are null CUDF returns null, Spark returns an empty string for that.

Now if we look at the CUDF concatenate ApI added for arrays (concatenate_list_elements), we also added 2 apis, one for scalar string separator and one for strings_column_view separator. Those api's both have null handled like the concatenate with the string scalar separator:

Any non-null list row having a null element will result in the corresponding output row to be null unless a valid string_narep scalar is provided to be used in its place.

Note there is also a bug in the concatenate_list_elements not handling an empty Array. On the Spark side empty Array would return an empty String.

So overall I think the api's should be made consistent, and ideally we would be able to choose how the nulls are handled. Perhaps we could add a parameter to all of these to choose the null handling?

For Spark concat_ws() the null handling should be such that nulls are skipped (including the separators) and that goes all the way to if all the columns are null. If all the columns or entries in an array or null, Spark expects it to return an empty string.

For the Spark concat() functionality returns a null if any of the values are null, which extends to if all the values are null you also get null.

example Spark concat_ws behavior:

+-------+
|nullcol|
+-------+
|   null|
|notnull|
+-------+

>>> spark.sql("select concat_ws('-', nullcol, nullcol) as res from df").show(truncate=False)
+---------------+
|res            |
+---------------+
|               | (empty string when all col values null)
|notnull-notnull|
+---------------+

+----+
|   d|
+----+
| 123|
|1234|
+----+
>>> spark.sql("select concat_ws('-', nullcol, d) as res from df").show(truncate=False)
+------------+
|res         |
+------------+
|123         |.  (skip null value and separator)
|notnull-1234|
+------------+

Spark concat behavior:

>>> spark.sql("select concat(nullcol, nullcol) as res from df").show(truncate=False)
+--------------+
|res           |
+--------------+
|null          |
|notnullnotnull|
+--------------+

>>> spark.sql("select concat(nullcol, d) as res from df").show(truncate=False)
+-----------+
|res        |
+-----------+
|null       |
|notnull1234|
+-----------+

The text was updated successfully, but these errors were encountered:

github-actions · 2021-03-14T19:13:56Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions · 2021-03-14T19:13:58Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

tgravescs · 2021-05-14T13:25:54Z

so I think this is actually covered in #3726 as the description above is described in the new api for:

std::unique_ptr<column> concatenate(
  table_view const& string_columns,
  strings_column_view const& separators,
  string_scalar const& separator_narep   = string_scalar("", false),
  string_scalar const& column_narep      = string_scalar("", false),
  rmm::mr::device_memory_resource* mr    = rmm::mr::get_default_resource());

It looks like that might be inconsisten with the previous api where the separators is not a column.

The thing that is different between cudf and spark concat_ws is the null handling when all nulls and then also with null handling in #7727 for arrays

for 3726 spark expected if you concatenate all null values with a separator to get an empty string (basically it just skips null values including the insertion of the separator)

SEP = "-"
concat_ws(SEP, null, null)
Result = ("")

CUDF Result = (null)

the other case with 7727 dealing with arrays, spark expects the null to be skipped (meaning separator left off as well) and if all nulls then it will return empty string

SEP = "-"
concat_ws(SEP, Array[null, "a", "b"])
Spark Result = "a-b"
CUDF Result = null

concat_ws(SEP, Array[null, null, null])
Spark Result = ""
CUDF Result = null

concat_ws(SEP, Array[])
Spark Result = ""
CUDF Result = its doesn't return null but get error trying to access it *****

tgravescs · 2021-05-14T13:53:46Z

Also note, we can't use narep with arrays here to say nulls are mapped to empty string "" because cudf still puts in the separator, but spark leaves off the value and separator

Example with Arrays:
SEP = "-"
col_narep = ""
concat_ws(SEP, Array[null, "a", "b"], col_narep)
Spark Result = "a-b"
CUDF result = "-a-b"

Similarly for the concat with separator without Arrays, it actually does the right thing and skips most nulls and separators except in the case where all of them nulls. If we were to use col_napep there and map null to empty string we would get separators as well.

revans2 · 2021-05-14T13:59:42Z

Would it make since to have some kind of a filter on a list to remove the null values? This would allow us to do other operations in spark generically too, like array_remove and filter.

tgravescs · 2021-05-14T14:22:26Z

so filter on a list to remove null values I think would work, except for it doesn't seem like cudf is handling empty array in concatenate_list_elements (I'll do some more testing on this). The other issue in my opinion is the array api (concatenate_list_elements) seems inconsistent with the concatenate with separator api that does skip the null values. One could argue that is a bit different since concatenate handles separate columns and here we are just one column. Either way I think we need the all null case to be handled because it returns null instead of empty string.

the only way I think we could handle the all null case with concatenate api would be to see if any nulls in the result and if the separator was null its ok, but If sep wasn't null then we have to replace those nulls with empty string.

tgravescs · 2021-05-14T15:22:19Z

I was testing empty array more with concatenate_list_elements and it doesn't return null and I get an error when we try to access whatever it returns so I think its not handling properly.

tgravescs · 2021-05-14T18:57:58Z

I'm changing this to a bug based on the APIs being inconcistent, if others disagree we can change back. I'm going to update the description with details.

tgravescs · 2021-05-14T19:23:27Z

@jrhemstad @davidwendt would be great to get input on thoughts on making these consistent? the concatenate_list_elements is a new api we just added so hopefully no issues with changing that but the concatenate api's are older.

davidwendt · 2021-05-17T12:30:18Z

Not sure if I'm following this. Just for clarification, there is an inconsistency between the behavior of the scalar-version of cudf::strings::concatenate_list_elements and the column-version of cudf::strings::concatenate_list_elements in regards to handling nulls?

If so, is the inconsistency between handling nulls within a list element [[a,b,c], [d,null,f]] or null entry on the list itself [[a,b,c], null, [d,e,f]] ? (or maybe both)

I tried a quick test in python cudf and it segfaulted so there is definitely something wrong.
But I wanted to be sure of the problem being reported here.

ttnghia · 2021-05-17T13:06:07Z

Hi David. There is a bug that causes such segfault, I have fixed it locally but plan to combine it with more code for a final PR.

What we are concerning here is how to handle the case of all null string elements in a list for row-wise concatenation.

In particular, if we have [null, null, null], then what should the result be? Naively, concatenating all null elements should return in a null element, but sometimes it is desired to return an empty string (""), which is "inconsistent behavior" that @tgravescs mentioned.

For fixing such "inconsistency", I propose to add an option to all the existing strings::concatenate and strings::concatenate_list_elements APIs like

enum class concatenate_all_nulls_policy {
  NULL_OUT,
  EMPTY_OUT
};

so when we specify NULL_OUT (default value), concatenating null + null + .... + null will be a null. Otherwise, the output string will be an empty string.

davidwendt · 2021-05-17T13:35:59Z

Did you mean cudf::lists::concatenate_rows?

ttnghia · 2021-05-17T13:41:04Z

Sorry, I meant strings::concatenate.

tgravescs · 2021-05-17T14:02:54Z

so the inconsistency I'm referring to is in handling null across multiple of these apis. Really the one that is different from the others is:

std::unique_ptr<column> concatenate(
  table_view const& strings_columns,
  strings_column_view const& separators,
  string_scalar const& separator_narep = string_scalar("", false),
  string_scalar const& col_narep       = string_scalar("", false),
  rmm::mr::device_memory_resource* mr  = rmm::mr::get_current_device_resource());

The behavior of this one is to skip nulls when they are values, except for if they are all nulls, then it puts a null in. See https://github.com/rapidsai/cudf/blob/branch-0.20/cpp/include/cudf/strings/combine.hpp#L75. All the other api's for concatenate and concatenate_list_elements put a null for the value whenever it finds any null value.

For Spark concat_ws we want the behavior of skipping nulls including when they are all null, it should return the empty string.

Ideally I would like to see something like a parameter to all the concatenate and concatenate_list_elements that will define the behavior of handling nulls. One way it skips the nulls and in the other if it finds a null value then the result is null

tgravescs · 2021-05-18T14:05:21Z

so I discovered another corner case here when spark concatenates an empty array with another value, it leaves off the array and any separators. If you just have an empty array it returns an empty string.

Since we are doing 2 part here, where if we see an array column we call into concatenate_list_elements, with the proposed changes an empty array would return an empty string. If we then pass then into the cudf concatenate api, that would put in separators because that is what an empty string outside an array does. We don't want it to put in separators so I think we would have to replace that with null and have it skip the separator. The problem is then telling the difference between empty array and array with empty string.

so I think we need concatenate_list_elements of empty array to return null.

the flow:

df.select(concat_ws('-', Array(), lit('z')))
Run:
concatenate_list_elements(Array()) => return null

Then:
concatenate(null, col with lit('z'), parameter for null = '' with skip separator)

Result:
"z"
this matches Spark behavior. If we were to return empty string for Empty array we would results in "-z"

ttnghia · 2021-05-18T16:05:47Z

It seems that we may want different results when calling strings::concatenate_list_elements on an empty input array. I am not sure if it is necessary to add an option for that (to specify an empty string list will result in a null or an empty string), or you can do it at the plugin side:

Call strings::concatenate_list_elements on the input: any empty string list will result in an empty output string, then,
Iterate over the input lists column, check if any input row is an empty list then set null for the corresponding row in the output strings column.
Feed the modified strings column to the next concatenation call.

Maybe adding an option in cudf is better---it is very easy and just be 1-2 LOC. I'll do it on top of David's PR#8282.

Closes #4728 This PR adds a new parameter to the `cudf::strings::concatenate` APIs to specify if separators should be added between null entries when the null-replacement (narep) parameter is valid. If the narep scalar is invalid (i.e. null itself) then the entire output row becomes null. If not, separators are added between each element. Examples: ``` s1 = ['a', 'b', null, 'dd', null] s2 = ['A', null, 'CC', 'D', null] concatenate( {s1, s2}, sep='+', narep=invalid ) -> ['a+A', null, null, 'dd+D', null] concatenate( {s1, s2}, sep='+', narep='@' ) -> ['a+A', 'b+@', '@+CC', 'dd+D', '@+@'] concatenate( {s1, s2}, sep='+', narep='' ) -> ['a+A', 'b+', '+CC', 'dd+D', '+'] ``` The new parameter is an enum `separator_on_nulls` which has `YES` or `NO` settings. The default parameter value will be `YES` to keep the current behavior as expected by Python cudf and for consistency with Pandas behavior. Specifying `NO` here will suppress the separator with null elements (when narep is valid). ``` concatenate( {s1, s2}, sep='+', narep='', NO ) -> ['a+A', 'b', 'CC', 'dd+D', ''] ``` This PR also changes the name of the `cudf::strings::concatenate_list_elements` API to `cudf::strings::join_list_elements` instead. The API pattern and behavior more mimic the `cudf::strings::join_strings` then the concatenate functions. Also, these are called by the Python `join` functions so the rename makes it more consistent with cudf. This is a breaking change in order to make these APIs more consistent. Previously, the separators column version was returning nulls only for an all-null row. This has been changed to honor the `separator_on_null` parameter instead. Currently there was no Python cudf API calling this version. Only the rename required minor changes to the Cython layer. The gtests were updated to reflect the new behavior. None of the pytests required any changes since the default parameter value matches the original behavior for those APIs that cudf actually calls. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Nghia Truong (https://github.com/ttnghia) - Keith Kraus (https://github.com/kkraus14) - Thomas Graves (https://github.com/tgravescs) - Christopher Harris (https://github.com/cwharris) URL: #8282

rwlee added feature request New feature or request Needs Triage Need team to review and classify labels Mar 27, 2020

rwlee added Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python) labels Mar 27, 2020

kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Apr 7, 2020

github-actions bot added the inactive-90d label Mar 14, 2021

github-actions bot added the inactive-30d label Mar 14, 2021

davidwendt mentioned this issue Mar 26, 2021

[FEA] concatenate array of strings #7727

Closed

ttnghia self-assigned this May 14, 2021

tgravescs added bug Something isn't working and removed feature request New feature or request labels May 14, 2021

tgravescs changed the title ~~[FEA] Concatenate with separator, ignore separator for null replacement~~ [BUG] Concatenate with separator, ignore separator for null replacement Inconsistent May 14, 2021

davidwendt self-assigned this May 19, 2021

davidwendt mentioned this issue May 19, 2021

Add separator-on-null parameter to strings concatenate APIs #8282

Merged

rapids-bot bot closed this as completed in #8282 May 24, 2021

HaoYang670 mentioned this issue Nov 23, 2021

what should stringConcatenateListElements return when list only contains nulls? #9745

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Concatenate with separator, ignore separator for null replacement Inconsistent #4728

[BUG] Concatenate with separator, ignore separator for null replacement Inconsistent #4728

rwlee commented Mar 27, 2020 •

edited by tgravescs

Loading

github-actions bot commented Mar 14, 2021

github-actions bot commented Mar 14, 2021

tgravescs commented May 14, 2021 •

edited

Loading

tgravescs commented May 14, 2021 •

edited

Loading

revans2 commented May 14, 2021

tgravescs commented May 14, 2021

tgravescs commented May 14, 2021

tgravescs commented May 14, 2021

tgravescs commented May 14, 2021

davidwendt commented May 17, 2021

ttnghia commented May 17, 2021 •

edited

Loading

davidwendt commented May 17, 2021

ttnghia commented May 17, 2021

tgravescs commented May 17, 2021 •

edited

Loading

tgravescs commented May 18, 2021 •

edited

Loading

ttnghia commented May 18, 2021 •

edited

Loading

[BUG] Concatenate with separator, ignore separator for null replacement Inconsistent #4728

[BUG] Concatenate with separator, ignore separator for null replacement Inconsistent #4728

Comments

rwlee commented Mar 27, 2020 • edited by tgravescs Loading

github-actions bot commented Mar 14, 2021

github-actions bot commented Mar 14, 2021

tgravescs commented May 14, 2021 • edited Loading

tgravescs commented May 14, 2021 • edited Loading

revans2 commented May 14, 2021

tgravescs commented May 14, 2021

tgravescs commented May 14, 2021

tgravescs commented May 14, 2021

tgravescs commented May 14, 2021

davidwendt commented May 17, 2021

ttnghia commented May 17, 2021 • edited Loading

davidwendt commented May 17, 2021

ttnghia commented May 17, 2021

tgravescs commented May 17, 2021 • edited Loading

tgravescs commented May 18, 2021 • edited Loading

ttnghia commented May 18, 2021 • edited Loading

rwlee commented Mar 27, 2020 •

edited by tgravescs

Loading

tgravescs commented May 14, 2021 •

edited

Loading

tgravescs commented May 14, 2021 •

edited

Loading

ttnghia commented May 17, 2021 •

edited

Loading

tgravescs commented May 17, 2021 •

edited

Loading

tgravescs commented May 18, 2021 •

edited

Loading

ttnghia commented May 18, 2021 •

edited

Loading