Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

abstract the parquet coalescing reading #2181

Merged
merged 5 commits into from
Jun 21, 2021

Conversation

wbo4958
Copy link
Collaborator

@wbo4958 wbo4958 commented Apr 19, 2021

This PR is to abstract the common coalescing reading logic and apply it to Parquet.

@tgravescs
Copy link
Collaborator

please add description

@sameerz sameerz added the performance A performance related task/issue label Apr 19, 2021
@sameerz sameerz added this to the Apr 12 - Apr 23 milestone Apr 19, 2021
@sameerz sameerz removed this from the Apr 26 - May 7 milestone May 10, 2021
@pxLi pxLi changed the base branch from branch-0.6 to branch-21.06 May 19, 2021 01:12
@wbo4958 wbo4958 changed the base branch from branch-21.06 to branch-21.08 June 8, 2021 03:46
@wbo4958 wbo4958 marked this pull request as draft June 10, 2021 02:03
@tgravescs
Copy link
Collaborator

@wbo4958 sorry I didn't get to review this and forgot about it, can you bring it up to date and I'll review?

@wbo4958
Copy link
Collaborator Author

wbo4958 commented Jun 15, 2021

@tgravescs I am refining this PR, will bring it up when it can be reviewable. thx Tom.

This PR is to abstract the common coalescing reading into a common class
and apply it to Parquet file format

Signed-off-by: Bobby Wang <[email protected]>
@wbo4958 wbo4958 force-pushed the parquet-abstract-coalescing branch from c5ac829 to ba08039 Compare June 16, 2021 13:39
@wbo4958
Copy link
Collaborator Author

wbo4958 commented Jun 16, 2021

build

@wbo4958 wbo4958 marked this pull request as ready for review June 17, 2021 03:04
@wbo4958
Copy link
Collaborator Author

wbo4958 commented Jun 17, 2021

build

@wbo4958 wbo4958 changed the title [WIP] Parquet abstract coalescing abstract the parquet coalescing reading Jun 17, 2021
@wbo4958
Copy link
Collaborator Author

wbo4958 commented Jun 17, 2021

@tgravescs, Could you help to review this PR

isCorrectRebaseMode: Boolean, clippedSchema: SchemaBase): Table = {

// Dump parquet data into a file
if (debugDumpPrefix != null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to support this similar dump for all formats. We need specific for actually dumping the data but can there be common interface all must have?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, Done

@tgravescs
Copy link
Collaborator

Also what testing have you done here? I would like to run a few performance tests to make sure no regressions, if we can run on a bunch of small files and compares current version vs this version that would be great.

@tgravescs
Copy link
Collaborator

build

@tgravescs
Copy link
Collaborator

fyi - still 2 questions/comments open

@wbo4958
Copy link
Collaborator Author

wbo4958 commented Jun 18, 2021

build

@wbo4958
Copy link
Collaborator Author

wbo4958 commented Jun 21, 2021

Hi @tgravescs

I've tested the below scenarios for upstream and upstream + this PR in databricks 8.2 runtime with 2 * g4dn.2xlarge workers.

  • small file testing

total 4000 small parquet files, total 16M

upstream + this PR time(s) upstream time(s)
1st 35.736 33.976
2nd 27.406 27.346
3rd 23.98 23.442
4th 26.217 26.331
  • a litter bigger file testing

total 100 parquet files, each file is 14M, total 1.3G

upstream + this PR, time(s) upstream time(s)
1st 13.836 13.421
2nd 12.348 12.289
3rd 11.676 11.175
4th 10.966 10.842

Seems this PR does not bring performance regression.

@tgravescs tgravescs added this to the June 21 - July 2 milestone Jun 21, 2021
@tgravescs tgravescs merged commit c03a926 into NVIDIA:branch-21.08 Jun 21, 2021
@wbo4958 wbo4958 deleted the parquet-abstract-coalescing branch June 21, 2021 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants