-
Notifications
You must be signed in to change notification settings - Fork 461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GLUTEN-1594][CH] support native parquet writer for clickhouse backend #1595
Conversation
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
package org.apache.spark.sql.execution.datasources; | ||
|
||
public class CHDatasourceJniWrapper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe CHDatasourceWriterWrapper
is better ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's named after this. https://github.com/oap-project/gluten/pull/1344/files#diff-379c35236c5f4cfe33d76e4ed3c9fc90ee447794dfc25b5674b05402c767af3f . I guess this class is also planed for future APIs to support a complete native spark-datasouce
@zhztheplayer @rui-mo please help to review, thanks |
please add the ut for this new feature. |
i'm working on it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. And also verified in 500gb TPC-DS parquet write test in velox backend. And performance data unchanged.
Just curious what benchmark we used in our measurements? |
Run Gluten Clickhouse CI |
customer table in tpch 100. |
copy(child = newChild) | ||
} | ||
|
||
object VeloxColumnarRules { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verified on internal Jenkins. Just a tiny question, can this file be put in backends-velox
instead of gluten-data
if it is only for Velox backend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi morui, I confirmed with @zzcclp , we think:
- currently all classes in gluten-data seems to be "for velox" . Classes "for velox+ch" are all located in gluten-core.
- Due to 1, some classes named "GlutenXXX" in gluten-data is misleading
- This is an unreasonable state, but we don't have the bandwidth to correct it right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some classes named "GlutenXXX" in gluten-data is misleading
The classes GlutenXXX
were added for backends that rely on Gluten to manage data conversions between Java and C++. So we put this layer (gluten-data) as an individual module out from backends-velox. But indeed we don't have another backend to use this part of code yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. In that case, gluten-data is designed to be a common module, right? Anyway, per previous discussion with Binwei, I will do a clean up for classes named as GlutenXXX
and VeloxXXX
because the prefixes are not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gluten-data is designed to be a common module, right?
Yes that was the original design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @zhztheplayer for the explanation!
@rui-mo , If a class is already backend-agnostic, then I'll agree with you that the Gluten prefix is unnecessary.
However, if a class is velox-only , it's perferrable to keep the Velox prefix, so that we can easily distinguish velox-specific classes from the common classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will do a clean up for classes named as GlutenXXX and VeloxXXX because the prefixes are not necessary.
@binmahone Got your point. The motivation for above action is we assume Velox related code is put in backends-velox
module, so the Velox
prefix becomes unnecessary. If some velox-only code exists in a common module, that's a different case which shall be rechecked. I will take a look further, thanks for pointing this out!
Run Gluten Clickhouse CI |
1 similar comment
Run Gluten Clickhouse CI |
gluten-core/src/main/scala/org/apache/spark/sql/execution/datasources/GlutenColumnarRules.scala
Show resolved
Hide resolved
Run Gluten Clickhouse CI |
gluten-core/src/main/scala/org/apache/spark/sql/execution/datasources/GlutenColumnarRules.scala
Show resolved
Hide resolved
gluten-core/src/main/scala/org/apache/spark/sql/execution/datasources/GlutenColumnarRules.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
What changes were proposed in this pull request?
a basic working version is provided by this PR. (so that we're unblocked to perform experiments)
however, as there're many TODO notes in the PR:
so, this ability is still incomplete.
experiments show that native parquet writer is about 20%-25% better than previous (Gluten used to convert columnar to row first, and use vanilla Spark's ParquetOutputFormat for writing):
How was this patch tested?
I'll write a simple test case to ensure basic write routine works