Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Hive optimized Parquet writer by default #17393

Merged
merged 2 commits into from
May 12, 2023

Conversation

electrum
Copy link
Member

@electrum electrum commented May 8, 2023

Release notes

(x) Release notes are required, with the following suggested text:

# Hive
* Enable the optimized Parquet writer by default. This can be disabled using
  the `parquet.optimized-writer.enabled` configuration property or
  the `parquet_optimized_writer_enabled` session property. ({issue}`17393`)

@electrum electrum requested a review from raunaqmorarka May 8, 2023 17:25
@cla-bot cla-bot bot added the cla-signed label May 8, 2023
@github-actions github-actions bot added hive Hive connector tests:hive labels May 8, 2023
@github-actions github-actions bot added the delta-lake Delta Lake connector label May 9, 2023
@raunaqmorarka
Copy link
Member

We also need an update to hive.rst to update default value for parquet.optimized-writer.enabled

@electrum
Copy link
Member Author

electrum commented May 9, 2023

@raunaqmorarka Thanks for the fast review. I updated the PR.

@github-actions github-actions bot added the docs label May 9, 2023
@trinodb trinodb deleted a comment from github-actions bot May 9, 2023
@trinodb trinodb deleted a comment from github-actions bot May 9, 2023
@github-actions
Copy link

github-actions bot commented May 9, 2023

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/4922568047

@sopel39 sopel39 requested a review from gaurav8297 May 9, 2023 11:11
@sopel39
Copy link
Member

sopel39 commented May 9, 2023

@electrum should we increase Parquet validation? Right now it's 5% in Hive.

@electrum
Copy link
Member Author

electrum commented May 9, 2023

@sopel39 That's a sampling percentage per file, so it should trigger for most queries. We could increase if you think it's valuable.

@sopel39
Copy link
Member

sopel39 commented May 9, 2023

@electrum I will defer to @raunaqmorarka for validation % as he did most of writer improvements recently. I was thinking about something more conservative like 100% for few releases as potential implications of bad writers would be considerable.

@raunaqmorarka
Copy link
Member

The performance and cost (S3 requests cost $) implications would be significant enough that if we set the validation percentage too high, users would notice the difference and likely turn off the optimized writer or set the validation to 0. That would defeat the purpose of validation. So the number should be low enough that most users are not significantly hindered by it.
I wouldn't set it higher than 10, current default of 5 is also reasonable. If a bug is going to be found by validation, we would have to be very unlucky to miss it completely at 5% validation too.

@@ -1637,10 +1637,10 @@ with Parquet files performed by the Hive connector.
- ``true``
* - ``parquet.optimized-writer.enabled``
- Whether the optimized writer is used when writing Parquet files.
Set this property to ``true`` to use the optimized parquet writer by
Set this property to ``false`` to disable the optimized parquet writer by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can leave "by default" out

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, though we have that for several other properties as well. I assume that's because the config is setting the default value for the session property, but I agree that documenting it like that is confusing. Let's follow up in a separate PR to remove that for all of them.

@electrum electrum merged commit 17086d9 into trinodb:master May 12, 2023
@electrum electrum deleted the parquet-writer branch May 12, 2023 19:25
@github-actions github-actions bot added this to the 418 milestone May 12, 2023
Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Add constraints for Parquet writer block and pages sizes"

false),
dataSizeProperty(
PARQUET_WRITER_PAGE_SIZE,
"Parquet: Writer page size",
parquetWriterConfig.getPageSize(),
value -> {
validateMinDataSize(PARQUET_WRITER_PAGE_SIZE, value, DataSize.valueOf(PARQUET_WRITER_MIN_PAGE_SIZE));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same in DeltaLakeSessionProperties and IcebergSessionProperties

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector docs hive Hive connector
Development

Successfully merging this pull request may close these issues.

5 participants