New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

abstract the parquet coalescing reading #2181

Merged

tgravescs merged 5 commits into NVIDIA:branch-21.08 from wbo4958:parquet-abstract-coalescing

Jun 21, 2021

Collaborator

wbo4958 commented Apr 19, 2021 •

edited

Loading

This PR is to abstract the common coalescing reading logic and apply it to Parquet.

Collaborator

tgravescs commented Apr 19, 2021

please add description

sameerz added the performance label

sameerz added this to the Apr 12 - Apr 23 milestone

sameerz assigned wbo4958

sameerz modified the milestones: Apr 12 - Apr 23, Apr 26 - May 7

sameerz removed this from the Apr 26 - May 7 milestone

pxLi changed the base branch from branch-0.6 to branch-21.06

May 19, 2021 01:12

wbo4958 changed the base branch from branch-21.06 to branch-21.08

June 8, 2021 03:46

tgravescs force-pushed the branch-21.08 branch from 618b4ed to eddc523 Compare

June 9, 2021 20:36

wbo4958 requested review from GaryShen2008, jlowe, NvTimLiu, revans2 and tgravescs as code owners

June 9, 2021 20:36

wbo4958 marked this pull request as draft

June 10, 2021 02:03

Collaborator

tgravescs commented Jun 15, 2021

@wbo4958 sorry I didn't get to review this and forgot about it, can you bring it up to date and I'll review?

Collaborator Author

wbo4958 commented Jun 15, 2021

@tgravescs I am refining this PR, will bring it up when it can be reviewable. thx Tom.


          abstract the parquet coalescing reading

ba08039

This PR is to abstract the common coalescing reading into a common class
and apply it to Parquet file format

Signed-off-by: Bobby Wang <[email protected]>

wbo4958 force-pushed the parquet-abstract-coalescing branch from c5ac829 to ba08039 Compare

June 16, 2021 13:39

Collaborator Author

wbo4958 commented Jun 16, 2021

build

firestarman reviewed

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala Show resolved Hide resolved

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala Show resolved Hide resolved

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala Show resolved Hide resolved

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala Outdated Show resolved Hide resolved

wbo4958 added 2 commits

June 17, 2021 10:34


          comment refine

86b3d76


          remove unused class and imports

7f40f44

wbo4958 marked this pull request as ready for review

June 17, 2021 03:04

Collaborator Author

wbo4958 commented Jun 17, 2021

build

wbo4958 changed the title ~~[WIP] Parquet abstract coalescing~~ abstract the parquet coalescing reading

Collaborator Author

wbo4958 commented Jun 17, 2021

@tgravescs, Could you help to review this PR

tgravescs reviewed

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala Show resolved Hide resolved

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala Outdated

+                    isCorrectRebaseMode: Boolean, clippedSchema: SchemaBase): Table = {
+                  // Dump parquet data into a file
+                  if (debugDumpPrefix != null) {

Collaborator

tgravescs Jun 17, 2021

it would be nice to support this similar dump for all formats. We need specific for actually dumping the data but can there be common interface all must have?

Collaborator Author

wbo4958 Jun 18, 2021

Thx, Done

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala Show resolved Hide resolved

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala Show resolved Hide resolved

Collaborator

tgravescs commented Jun 17, 2021

Also what testing have you done here? I would like to run a few performance tests to make sure no regressions, if we can run on a bunch of small files and compares current version vs this version that would be great.


          resolve comments

ed57e2c

Collaborator

tgravescs commented Jun 18, 2021

build

Collaborator

tgravescs commented Jun 18, 2021

fyi - still 2 questions/comments open


          resolve comments

7886e53

Collaborator Author

wbo4958 commented Jun 18, 2021

build

Collaborator Author

wbo4958 commented Jun 21, 2021 •

edited

Loading

I've tested the below scenarios for upstream and upstream + this PR in databricks 8.2 runtime with 2 * g4dn.2xlarge workers.

small file testing

total 4000 small parquet files, total 16M

	upstream + this PR time(s)	upstream time(s)
1st	35.736	33.976
2nd	27.406	27.346
3rd	23.98	23.442
4th	26.217	26.331

a litter bigger file testing

total 100 parquet files, each file is 14M, total 1.3G

	upstream + this PR, time(s)	upstream time(s)
1st	13.836	13.421
2nd	12.348	12.289
3rd	11.676	11.175
4th	10.966	10.842

Seems this PR does not bring performance regression.

tgravescs approved these changes

View reviewed changes

tgravescs added this to the June 21 - July 2 milestone

tgravescs merged commit c03a926 into NVIDIA:branch-21.08

wbo4958 deleted the parquet-abstract-coalescing branch

June 21, 2021 21:09

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

firestarman firestarman left review comments

tgravescs tgravescs approved these changes

GaryShen2008 Awaiting requested review from GaryShen2008

jlowe Awaiting requested review from jlowe

NvTimLiu Awaiting requested review from NvTimLiu

revans2 Awaiting requested review from revans2

Labels