-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
abstract the parquet coalescing reading #2181
abstract the parquet coalescing reading #2181
Conversation
please add description |
@wbo4958 sorry I didn't get to review this and forgot about it, can you bring it up to date and I'll review? |
@tgravescs I am refining this PR, will bring it up when it can be reviewable. thx Tom. |
This PR is to abstract the common coalescing reading into a common class and apply it to Parquet file format Signed-off-by: Bobby Wang <[email protected]>
c5ac829
to
ba08039
Compare
build |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala
Outdated
Show resolved
Hide resolved
build |
@tgravescs, Could you help to review this PR |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala
Outdated
Show resolved
Hide resolved
isCorrectRebaseMode: Boolean, clippedSchema: SchemaBase): Table = { | ||
|
||
// Dump parquet data into a file | ||
if (debugDumpPrefix != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to support this similar dump for all formats. We need specific for actually dumping the data but can there be common interface all must have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, Done
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala
Outdated
Show resolved
Hide resolved
Also what testing have you done here? I would like to run a few performance tests to make sure no regressions, if we can run on a bunch of small files and compares current version vs this version that would be great. |
build |
fyi - still 2 questions/comments open |
build |
Hi @tgravescs I've tested the below scenarios for
total 4000 small parquet files, total 16M
total 100 parquet files, each file is 14M, total 1.3G
Seems this PR does not bring performance regression. |
This PR is to abstract the common coalescing reading logic and apply it to Parquet.