-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support bucketing write for GPU #10957
Conversation
Signed-off-by: Firestarman <[email protected]>
Make it draft for early reviews and running tests on DB by ci. |
build |
Signed-off-by: Firestarman <[email protected]>
build |
Signed-off-by: Firestarman <[email protected]>
build |
Signed-off-by: Firestarman <[email protected]>
build |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly didn't make it though the entire patch. It is very large. I'll try to find time to finish it soon.
What performance and scale testing have we done with this?
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFileFormatDataWriter.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFileFormatDataWriter.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFileFormatDataWriter.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFileFormatDataWriter.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFileFormatDataWriter.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFileFormatDataWriter.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFileFormatDataWriter.scala
Outdated
Show resolved
Hide resolved
Yeah, it is big, thx a lot for the review.
not yet, since I thought it is not necessary for feature PRs. But I am happy to run NDS with this if you prefer. |
Signed-off-by: Firestarman <[email protected]>
Signed-off-by: Firestarman <[email protected]>
Signed-off-by: Firestarman <[email protected]>
Signed-off-by: Firestarman <[email protected]>
Signed-off-by: Firestarman <[email protected]>
build |
I am not concerned much about NDS. We are not doing any bucketed writes/reads in NDS. If you think we should change that we can, but it would take some analysis to see how we wanted to do that. I am more concerned about how our performance compares to the CPU for similar situations. Especially for writes. We already know that reads should be much faster because we already are really good at joins and this should reduce the shuffle ahead of a join. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still didn't get through everything, but I am getting closer.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFileFormatDataWriter.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFileFormatDataWriter.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFileFormatDataWriter.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFileFormatDataWriter.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Firestarman <[email protected]>
Signed-off-by: Firestarman <[email protected]>
Signed-off-by: Firestarman <[email protected]>
build |
Signed-off-by: Firestarman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I have made my way through all of the code now. But this is large enough I don't trust myself and would like at least one other person to review this too.
build |
Looks like you had a few failures in databricks.
|
It can be fixed by changing the This is probably not a Rapids bug. I checked the error stack and it even did not go into any GPU code. It seems to be related to the |
Signed-off-by: Firestarman <[email protected]>
build |
The lastest failure is due to the known issue #11070 |
build |
Signed-off-by: Firestarman <[email protected]>
build |
Signed-off-by: Firestarman <[email protected]>
build |
Signed-off-by: Firestarman <[email protected]>
build |
This PR adds the GPU support for the bucketing write. - React the code of the dynamic partition single writer and concurrent writer to try to reuse the code as much as possible, and then add in the bucketing write logic for both of them. - Update the bucket check during the plan overriding for the write commands, including InsertIntoHadoopFsRelationCommand, CreateDataSourceTableAsSelectCommand, InsertIntoHiveTable, CreateHiveTableAsSelectCommand. - From 330, Spark also supports HiveHash to generate the bucket IDs, in addition to Murmur3Hash. So the shim object GpuBucketingUtils is introduced to handle the shim things. - This change also adds two functions (tagForHiveBucketingWrite and tagForBucketing) to do the overriding check for the two hashing functions separately. And the Hive write nodes will fall back to CPU when HiveHash is chosen, because HiveHash is not supported on GPU. --------- Signed-off-by: Firestarman <[email protected]>
close #22
This PR adds the GPU support for the bucketing write.
InsertIntoHadoopFsRelationCommand, CreateDataSourceTableAsSelectCommand, InsertIntoHiveTable, CreateHiveTableAsSelectCommand.
HiveHash
to generate the bucket IDs, in addition toMurmur3Hash
. So the shim objectGpuBucketingUtils
is introduced to handle the shim things.tagForHiveBucketingWrite and tagForBucketing
) to do the overriding check for the two hashing functions separately. And the Hive write nodes will fall back to CPU whenHiveHash
is chosen, because HiveHash is not supported on GPU.