-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Extra GpuColumnarToRow when using ParquetCachedBatchSerializer on databricks #2896
Comments
Hi @viadea, with some investigation, I infer that the existence of extra According to private def insertTransitions(plan: SparkPlan): SparkPlan = {
if (plan.supportsColumnar) {
// The tree feels kind of backwards
// This is the end of the columnar processing so go back to rows
ColumnarToRowExec(insertRowToColumnar(plan))
} else if (!plan.isInstanceOf[ColumnarToRowTransition]) {
plan.withNewChildren(plan.children.map(insertTransitions))
} else {
plan
}
} In current case, the def convertToColumnarIfPossible(plan: SparkPlan): SparkPlan = plan match {
case gen: WholeStageCodegenExec => gen.child match {
case c2r: ColumnarToRowTransition => c2r.child match {
case ia: InputAdapter => ia.child
case _ => plan
}
case _ => plan
}
case c2r: ColumnarToRowTransition => // This matches when whole stage code gen is disabled.
c2r.child
case _ => plan
} Therefore, the existence of |
If the |
Hi @viadea @jlowe , I ran the same query on Databricks cluster with Spark 3.1.1 and Rapids 21.08, but I didn't catch the extra GpuColumnarToRow beneath the InMemoryRelation.
Here is the driver log: https://dbc-9ff9942e-a9c4.cloud.databricks.com/?o=8721196619973675#setting/sparkui/0924-064308-stem303/driver-logs from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("debug123")\
.config("spark.sql.cache.serializer", "com.nvidia.spark.rapids.shims.spark311.ParquetCachedBatchSerializer")\
.getOrCreate()
data = [
Row(Row("Adam ", "", "Green"), "1", "M", 1000),
Row(Row("Bob ", "Middle", "Green"), "2", "M", 2000),
Row(Row("Cathy ", "", "Green"), "3", "F", 3000)]
schema = StructType() \
.add("name", StructType()
.add("firstname", StringType())
.add("middlename", StringType())
.add("lastname", StringType())) \
.add("id", StringType()) \
.add("gender", StringType()) \
.add("salary", IntegerType())
df = spark.createDataFrame(spark.sparkContext.parallelize(data),schema)
df.write.format("parquet").mode("overwrite").save("/tmp/testparquet")
df2 = spark.read.parquet("/tmp/testparquet")
df2.createOrReplaceTempView("df2")
df3 = spark.sql("select struct(name, struct(name.firstname, name.lastname) as newname) as col from df2").cache()
df3.createOrReplaceTempView("df3")
spark.sql("select count(distinct col.name.firstname) from df3").show()
spark.sql("select count(distinct col.name.firstname) from df3").explain() |
I also tested 21.10 snapshot and found the same thing, the extra GpuColumnarToRow is gone. |
I don't know of anything that would have explicitly fixed this. Pinging @razajafri in case he knows of a change that could be related. Closing this as it's not reproducible on recent builds. We can reopen if it appears again. |
Describe the bug
A clear and concise description of what the bug is.
This is a follow-up issue related to #2880.
When using ParquetCachedBatchSerializer on databricks 8.2ML GPU, I found there is an extra
GpuColumnarToRow
right beforeInMemoryRelation
.For example:
Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.
Same reproduce code as #2856
Expected behavior
A clear and concise description of what you expected to happen.
The expectation is the databricks' plan should not have the extra
GpuColumnarToRow
.Environment details (please complete the following information)
Databricks 8.2ML GPU with spark 3.1.1
Using the 21.08 snapshot jar with the fix from #2880 .
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: