-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
what should stringConcatenateListElements return when list only contains nulls? #9745
Comments
Hi @ttnghia, would you take a look when you have time? |
Yes, the result is correct. It's the desired behavior of Spark. When the input is all nulls, Spark will output null. |
You will find that behavior almost everywhere: nulls in => null out 😃 |
FYI: Here is the cudf doc for the case of all nulls in a list ( cudf/cpp/include/cudf/strings/combine.hpp Line 222 in 31f92d7
So the output of all-nulls input will depend on the parameter |
That's interesting. @revans2 should know more about this, so can you comment on this please? If necessary then we can change that in cudf implementation. |
This is the result I get using Spark 3.2.0. When casting to string, |
I just realize that the example above is casting from struct to string, not concat strings. Did you try concat strings?
The behavior for all-nulls input was requested to match with Python output. We can further modify the cudf API if we must support non-null non-empty output for such cases. |
I review #4728 and find the difference between |
@ttnghia before proceeding with anything along these lines, can someone please summarize what libcudf operation we are talking about and the missing behavior? I've seen both casting and concatenating discussed and those seem like very different operations, so I'm not clear what behavior we are discussing. |
The behavior we're talking here is directly related to the API cudf/cpp/include/cudf/strings/combine.hpp Line 258 in 31f92d7
Previously, in order to match with Spark and Python behaviors, it was designed with a parameter to specify what the output will be in case of all-nulls input list, that is either a null, or an empty string ( cudf/cpp/include/cudf/strings/combine.hpp Line 222 in 31f92d7
Now it seems that we need to have a third option to allow using |
We're talking about casting struct to string (in Spark). That can be done by casting each child into a string then concatenating these string results. Yes, since those are different ops, we may have undesired behavior when using API for doing one thing to do the other thing. Using I'm not very familiar with the implementation detail of such casting op. Maybe we need to modify |
I strongly dislike that idea. This API is already a nightmare of option flags with a complex cross product of behavior depending on the values of the various parameters. This API is trying to do too many things. For example, why is the Likewise, The relationship between
Why? This is very non-intuitive. If I have It is clear this API was overfit to corner cases of Spark and/or Python. We should take a step back and ask "What does it make sense for this function to do?" then work backwards from there and decide how corner cases can be satisfied by pre/post-processing as I described above. |
I'm a bit confused why the casting op from struct to string can't produce the desired result (like |
Currently, I am developing the function of casting ArrayType column to StringType Column (NVIDIA/spark-rapids#4028). Firstly, I cast all elements in array to string type. Secondly, I use the function
To be consistent with Spark, this function should return "[null, null]" when the input array is Array(null, null) . However, these are what I get when I run the code on my desktop:
When the input is all-nulls-array, the output is |
This is the result when I run the same code on Spark. You can see the difference between Spark and Spark-rapids when casting |
A null element will result in a null after casting. You can then use |
Thank you, I will try it |
I currently use GpuNvl to cast all |
I am using the method
stringConcatenateListElements(Scalar separator, Scalar narep, boolean separateNulls, boolean emptyStringOutputIfEmptyList)
in Spark-Rapids to cast Array typed dataframe to string type. I find that when the input isArray(null, null. null)
, the output is always an empty string. (I setnarep="null"
,seperateNulls
=true). However, what the expected is"null, null, null"
.Describe the solution you'd like
I add a test in
ColumnVectorTest.java
:This is what I expected.
The text was updated successfully, but these errors were encountered: