-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Hive optimized Parquet writer by default #17393
Conversation
lib/trino-parquet/src/main/java/io/trino/parquet/writer/ParquetWriterOptions.java
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/writer/ParquetWriterOptions.java
Show resolved
Hide resolved
...rino-hive/src/test/java/io/trino/plugin/hive/TestParquetPageSkippingWithOptimizedReader.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeDelete.java
Outdated
Show resolved
Hide resolved
We also need an update to |
@raunaqmorarka Thanks for the fast review. I updated the PR. |
The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/4922568047 |
@electrum should we increase Parquet validation? Right now it's 5% in Hive. |
@sopel39 That's a sampling percentage per file, so it should trigger for most queries. We could increase if you think it's valuable. |
@electrum I will defer to @raunaqmorarka for validation % as he did most of writer improvements recently. I was thinking about something more conservative like 100% for few releases as potential implications of bad writers would be considerable. |
The performance and cost (S3 requests cost $) implications would be significant enough that if we set the validation percentage too high, users would notice the difference and likely turn off the optimized writer or set the validation to 0. That would defeat the purpose of validation. So the number should be low enough that most users are not significantly hindered by it. |
@@ -1637,10 +1637,10 @@ with Parquet files performed by the Hive connector. | |||
- ``true`` | |||
* - ``parquet.optimized-writer.enabled`` | |||
- Whether the optimized writer is used when writing Parquet files. | |||
Set this property to ``true`` to use the optimized parquet writer by | |||
Set this property to ``false`` to disable the optimized parquet writer by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can leave "by default" out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, though we have that for several other properties as well. I assume that's because the config is setting the default value for the session property, but I agree that documenting it like that is confusing. Let's follow up in a separate PR to remove that for all of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Add constraints for Parquet writer block and pages sizes"
false), | ||
dataSizeProperty( | ||
PARQUET_WRITER_PAGE_SIZE, | ||
"Parquet: Writer page size", | ||
parquetWriterConfig.getPageSize(), | ||
value -> { | ||
validateMinDataSize(PARQUET_WRITER_PAGE_SIZE, value, DataSize.valueOf(PARQUET_WRITER_MIN_PAGE_SIZE)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same in DeltaLakeSessionProperties and IcebergSessionProperties
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Release notes
(x) Release notes are required, with the following suggested text: