-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ChunkedSliceOutput to chunk and buffer writes in parquet #18564
Conversation
Allows simplifying some of the code working with pages sizes to use ints instead of longs
846c8da
to
0720356
Compare
lib/trino-parquet/src/main/java/io/trino/parquet/writer/PrimitiveColumnWriter.java
Show resolved
Hide resolved
not sure i understand the goal here. can you expand a bit the commit rationale so that it's more obvious what's the objective? |
Yes, there is varying levels of buffering in different output stream implementations. It's not clear to me that every implementation is doing that or that we should rely on that always being the case. In any case, that's not the main benefit of this change. The main objective here is
Also, ORC writer uses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm % comments
lib/trino-parquet/src/main/java/io/trino/parquet/writer/PrimitiveColumnWriter.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/writer/ParquetWriter.java
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/writer/PrimitiveColumnWriter.java
Show resolved
Hide resolved
0720356
to
47d6c15
Compare
lib/trino-parquet/src/main/java/io/trino/parquet/writer/ParquetDataOutput.java
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/writer/PrimitiveColumnWriter.java
Outdated
Show resolved
Hide resolved
@@ -297,6 +297,7 @@ public long getBufferedBytes() | |||
public long getRetainedBytes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it called after a flush
? (is there something like a flush here)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no "flush" here, there is a "close" followed by a getBuffer
to extract the buffered pages.
This is called by connector page sink for every writer after writing each page.
lib/trino-parquet/src/main/java/io/trino/parquet/writer/ParquetDataOutput.java
Show resolved
Hide resolved
f1df74a
to
dc324d9
Compare
Avoids separate calls to output stream for writing page headers and page data Avoids holding on to extra memory for each buffered page due to over-allocation by compressors
dc324d9
to
f01d196
Compare
Description
Avoids separate calls to output stream for writing page headers and page data.
Avoids holding on to extra memory for each buffered page due to over-allocation by compressors.
Adds memory usage accounting for buffered pages.
Additional context and related issues
Should help with the memory usage accounting problems described in #18557
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: