-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Concatenate with separator, ignore separator for null replacement Inconsistent #4728
Comments
This issue has been labeled |
This issue has been labeled |
so I think this is actually covered in #3726 as the description above is described in the new api for:
It looks like that might be inconsisten with the previous api where the separators is not a column. The thing that is different between cudf and spark concat_ws is the null handling when all nulls and then also with null handling in #7727 for arrays for 3726 spark expected if you concatenate all null values with a separator to get an empty string (basically it just skips null values including the insertion of the separator) SEP = "-" CUDF Result = (null) the other case with 7727 dealing with arrays, spark expects the null to be skipped (meaning separator left off as well) and if all nulls then it will return empty string SEP = "-" concat_ws(SEP, Array[null, null, null]) concat_ws(SEP, Array[]) |
Also note, we can't use narep with arrays here to say nulls are mapped to empty string "" because cudf still puts in the separator, but spark leaves off the value and separator Example with Arrays: Similarly for the concat with separator without Arrays, it actually does the right thing and skips most nulls and separators except in the case where all of them nulls. If we were to use col_napep there and map null to empty string we would get separators as well. |
Would it make since to have some kind of a filter on a list to remove the null values? This would allow us to do other operations in spark generically too, like |
so filter on a list to remove null values I think would work, except for it doesn't seem like cudf is handling empty array in concatenate_list_elements (I'll do some more testing on this). The other issue in my opinion is the array api (concatenate_list_elements) seems inconsistent with the concatenate with separator api that does skip the null values. One could argue that is a bit different since concatenate handles separate columns and here we are just one column. Either way I think we need the all null case to be handled because it returns null instead of empty string. the only way I think we could handle the all null case with concatenate api would be to see if any nulls in the result and if the separator was null its ok, but If sep wasn't null then we have to replace those nulls with empty string. |
I was testing empty array more with concatenate_list_elements and it doesn't return null and I get an error when we try to access whatever it returns so I think its not handling properly. |
I'm changing this to a bug based on the APIs being inconcistent, if others disagree we can change back. I'm going to update the description with details. |
@jrhemstad @davidwendt would be great to get input on thoughts on making these consistent? the concatenate_list_elements is a new api we just added so hopefully no issues with changing that but the concatenate api's are older. |
Not sure if I'm following this. Just for clarification, there is an inconsistency between the behavior of the scalar-version of If so, is the inconsistency between handling nulls within a list element I tried a quick test in python cudf and it segfaulted so there is definitely something wrong. |
Hi David. There is a bug that causes such segfault, I have fixed it locally but plan to combine it with more code for a final PR. What we are concerning here is how to handle the case of all null string elements in a list for row-wise concatenation. In particular, if we have For fixing such "inconsistency", I propose to add an option to all the existing
so when we specify |
Did you mean |
Sorry, I meant |
so the inconsistency I'm referring to is in handling null across multiple of these apis. Really the one that is different from the others is:
The behavior of this one is to skip nulls when they are values, except for if they are all nulls, then it puts a null in. See https://github.com/rapidsai/cudf/blob/branch-0.20/cpp/include/cudf/strings/combine.hpp#L75. All the other api's for concatenate and concatenate_list_elements put a null for the value whenever it finds any null value. For Spark concat_ws we want the behavior of skipping nulls including when they are all null, it should return the empty string. Ideally I would like to see something like a parameter to all the concatenate and concatenate_list_elements that will define the behavior of handling nulls. One way it skips the nulls and in the other if it finds a null value then the result is null |
so I discovered another corner case here when spark concatenates an empty array with another value, it leaves off the array and any separators. If you just have an empty array it returns an empty string. Since we are doing 2 part here, where if we see an array column we call into concatenate_list_elements, with the proposed changes an empty array would return an empty string. If we then pass then into the cudf concatenate api, that would put in separators because that is what an empty string outside an array does. We don't want it to put in separators so I think we would have to replace that with null and have it skip the separator. The problem is then telling the difference between empty array and array with empty string. so I think we need concatenate_list_elements of empty array to return null. the flow:
Then: Result: |
It seems that we may want different results when calling
Maybe adding an option in cudf is better---it is very easy and just be 1-2 LOC. I'll do it on top of David's PR#8282. |
Closes #4728 This PR adds a new parameter to the `cudf::strings::concatenate` APIs to specify if separators should be added between null entries when the null-replacement (narep) parameter is valid. If the narep scalar is invalid (i.e. null itself) then the entire output row becomes null. If not, separators are added between each element. Examples: ``` s1 = ['a', 'b', null, 'dd', null] s2 = ['A', null, 'CC', 'D', null] concatenate( {s1, s2}, sep='+', narep=invalid ) -> ['a+A', null, null, 'dd+D', null] concatenate( {s1, s2}, sep='+', narep='@' ) -> ['a+A', 'b+@', '@+CC', 'dd+D', '@+@'] concatenate( {s1, s2}, sep='+', narep='' ) -> ['a+A', 'b+', '+CC', 'dd+D', '+'] ``` The new parameter is an enum `separator_on_nulls` which has `YES` or `NO` settings. The default parameter value will be `YES` to keep the current behavior as expected by Python cudf and for consistency with Pandas behavior. Specifying `NO` here will suppress the separator with null elements (when narep is valid). ``` concatenate( {s1, s2}, sep='+', narep='', NO ) -> ['a+A', 'b', 'CC', 'dd+D', ''] ``` This PR also changes the name of the `cudf::strings::concatenate_list_elements` API to `cudf::strings::join_list_elements` instead. The API pattern and behavior more mimic the `cudf::strings::join_strings` then the concatenate functions. Also, these are called by the Python `join` functions so the rename makes it more consistent with cudf. This is a breaking change in order to make these APIs more consistent. Previously, the separators column version was returning nulls only for an all-null row. This has been changed to honor the `separator_on_null` parameter instead. Currently there was no Python cudf API calling this version. Only the rename required minor changes to the Cython layer. The gtests were updated to reflect the new behavior. None of the pytests required any changes since the default parameter value matches the original behavior for those APIs that cudf actually calls. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Nghia Truong (https://github.com/ttnghia) - Keith Kraus (https://github.com/kkraus14) - Thomas Graves (https://github.com/tgravescs) - Christopher Harris (https://github.com/cwharris) URL: #8282
CUDF has two concatenate with separator api's. One that takes a string scalar as separator and one that takes a strings_column_view of separators. They are inconsistent in their null handling and neither match exactly how Spark acts for concat_ws, which is actually different from the concat Spark behavior.
CUDF concat with string scalar separator:
row to be null entry unless a narep string is specified to be used
in its place.
CUDF concat with string column view separator
there is a valid @p col_narep
The main point being in the concat with string column view separator we skip null values rather then it causing the entire output to be null. The concat with string column view separator is consistent with the Spark behavior except for the behavior of when all values are null CUDF returns null, Spark returns an empty string for that.
Now if we look at the CUDF concatenate ApI added for arrays (concatenate_list_elements), we also added 2 apis, one for scalar string separator and one for strings_column_view separator. Those api's both have null handled like the concatenate with the string scalar separator:
string_narep
scalar is provided to be used in its place.Note there is also a bug in the concatenate_list_elements not handling an empty Array. On the Spark side empty Array would return an empty String.
So overall I think the api's should be made consistent, and ideally we would be able to choose how the nulls are handled. Perhaps we could add a parameter to all of these to choose the null handling?
For Spark concat_ws() the null handling should be such that nulls are skipped (including the separators) and that goes all the way to if all the columns are null. If all the columns or entries in an array or null, Spark expects it to return an empty string.
For the Spark concat() functionality returns a null if any of the values are null, which extends to if all the values are null you also get null.
example Spark concat_ws behavior:
Spark concat behavior:
The text was updated successfully, but these errors were encountered: