-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Null equality of structs doesn't match to spark #7934
Comments
This was discussed here #7226 (comment). libcudf will not support specifying a per member null equality or null ordering. To do so in Spark, you can flatten to columns yourself and pass in the resulting table with whatever null equality/order you want. |
@jrhemstad Is it possible for the flattening code to have spark-friendly defaults like for null ordering?
Edit: Thinking about this more, I'm not sure it would be that easy to do that as we don't know the origin of a column when building the hash map. |
As I understand it, the null equality/ordering specified for the top-level struct parent will be applied to all children. Based on #7226, my understanding was that this was the default behavior of Spark. |
Spark join skips null records during key matching, unless it is a FullOuterJoin. And Spark only checks validity of all root keys. For instance, key |
I see two approaches to fixing this problem. The first is to pass in per-column nullability information, which seems a non-starter. The second would be to have cudf have different nuillability for the flattened columns, possibly as an option to match spark. I'm unsure if other frameworks expect the same. I am asking around to see if this is unique to spark. |
cc @felipeblazing @williamBlazing what's the behavior you'd want for Blazing in this case? cc @shwina @brandon-b-miller any thoughts from the Python/Pandas perspective? |
edit: please ignore this; see my next comment. >>> s1 = pd.Series({"a": [1, 2, pd.NA], "b": pd.NA})
>>> s1 == s1
a True
b False
dtype: bool |
I don't understand the syntax of this example. |
Here is a Pandas series of dictionaries ("structs"). The top-level parent is nullable, and so is the (single) child column. The top level parent has a non-null element at position >>> s = cudf.Series([{"a": pd.NA}, pd.NA])
>>> print(s)
0 {'a': <NA>}
1 <NA>
dtype: object Doing an elementwise compare of this Series with itself: >>> s == s
0 True
1 False
dtype: bool We see that null equality at the top level is |
Do the rules for elementwise comparison apply to merge as well? What happens when you do an inner join of |
@shwina do we want to follow the Pandas behavior here? It's going to defer to Python dictionary comparison here. |
@shwina I also think your example above is potentially broken, |
I agree we may not want the same semantics. It's somewhat serendipitous that Pandas and Spark exhibit the same behaviour here. |
Spoke with Keith and Jake about this. The feeling is that this is custom Spark behavior more than something that would be expected by an end-user of cudf. As such, cudf will not implement Spark-specific changes like this, but will certainly expose what is needed to allow the Spark plugin to build what it needs. In this case, the plugin doesn't need any extra support as removing top-level null entries and then passing the null equality internal structure comparisons expect will work. |
Describe the bug
For structs with nullable children, these children share same
null_equality
with other join keys, since cuDF flattens all columns. In Spark,null_equality
of atomic types arefalse
, butnull_equality
of structures' children aretrue
. Here is the the ording comparison strategy of Spark.So, the problem is that there exists no universe
null_equality
option for join keys which are composed by both structs with nullable children and nullable atomic types.P.S. This issue is a corrected version of #7911.
The text was updated successfully, but these errors were encountered: