-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Add nested struct support for comparison operations #8964
Comments
This issue has been labeled |
This is still required |
This code snippet demonstrates some behavior with NaNs that I investigated with @rwlee. tl;dr Spark treats NaN the same in binary operators Show snippetSave as import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{DoubleType, StructType}
import org.apache.spark.sql.Row
val schema = new StructType()
.add("struct1", new StructType()
.add("x", DoubleType)
.add("y", DoubleType))
.add("struct2", new StructType()
.add("x", DoubleType)
.add("y", DoubleType))
val v1 = 1.0
val v2 = Double.NaN
val structData = Seq(
Row(Row(v1, v1), Row(v1, v1)),
Row(Row(v1, v1), Row(v1, v2)),
Row(Row(v1, v1), Row(v2, v1)),
Row(Row(v1, v1), Row(v2, v2)),
Row(Row(v1, v2), Row(v1, v1)),
Row(Row(v1, v2), Row(v1, v2)),
Row(Row(v1, v2), Row(v2, v1)),
Row(Row(v1, v2), Row(v2, v2)),
Row(Row(v2, v1), Row(v1, v1)),
Row(Row(v2, v1), Row(v1, v2)),
Row(Row(v2, v1), Row(v2, v1)),
Row(Row(v2, v1), Row(v2, v2)),
Row(Row(v2, v2), Row(v1, v1)),
Row(Row(v2, v2), Row(v1, v2)),
Row(Row(v2, v2), Row(v2, v1)),
Row(Row(v2, v2), Row(v2, v2)),
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(structData), schema)
df.printSchema()
df.show(false)
val df2 = df.selectExpr("struct1", "struct2", "struct1 < struct2", "struct1 <= struct2", "struct1 == struct2")
df2.printSchema()
df2.show(false) Show outputThis is the relevant part of the output for understanding NaN behavior.
|
Adds support for Spark's null aware equality binop and expands/improves Java testing for struct binops. Properly tests null structs and full operator testing coverage. Utilizes existing Spark struct binop support with JNI changes to force the full null-aware comparison. Expands on #11153 Partial solution to #8964 -- `NULL_MAX` and `NULL_MIN` still outstanding. Authors: - Ryan Lee (https://github.com/rwlee) Approvers: - Tobias Ribizel (https://github.com/upsj) - Vukasin Milovanovic (https://github.com/vuule) - Jason Lowe (https://github.com/jlowe) URL: #11520
@revans2 is this issue solved? |
@GregoryKimball Yes I would say that it is fixed now. We don't support comparisons of ARRAYs, but we do support structs and structs of structs. |
Is your feature request related to a problem? Please describe.
For Spark we are pushing to get more support for structs in a number of operators. We already have some support for sorting structs, so we should be able to come up with a way to do comparisons of nested structs too. NOTE this does not include lists as children of the structs just structs that contains basic types including strings and other structs.
The operations we would like to support include the BINARY ops EQUAL, NOT_EQUAL, LESS, GREATER, LESS_EQUAL, GREATER_EQUAL, NULL_EQUALS, and if possible NULL_MAX and NULL_MIN.
This should follow the same pattern we have supported for sorting with the order of precedence for the children in a struct go from first to last. In this case we would like nulls within the struct columns to be less than other values, but equal to each other. meaning
Struct(null)
is less thanStruct(5)
andStruct(null)
==Struct(null)
. Nulls at the top level still depend on the operator being performed. For NULL_EQUALS nulls are equal to each other.Describe the solution you'd like
It would be great if we could do this as regular binary ops, but if we need them to be separate APIs that works too. If null equality/etc needs to be configurable for the python APIs a separate API is fine.
Describe alternatives you've considered
We could flatten the struct columns ourselves and do a number of different operations to combine the results back together to get the right answer. But cudf already has a flatten method behind the scenes so why replicate that when others could benefit from it too.
The text was updated successfully, but these errors were encountered: