Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize writing RLE runs in parquet column descriptors #22089

Merged
merged 4 commits into from
May 28, 2024

Conversation

raunaqmorarka
Copy link
Member

@raunaqmorarka raunaqmorarka commented May 23, 2024

Description

Optimize writing RLE runs in parquet column descriptors

Use information about nullability of Blocks to write RLE runs
for repetition and definition levels more efficiently in parquet writer

BenchmarkParquetFormat#write UNCOMPRESSED
                      Before                          After
LINEITEM                  293.0MB/s ± 2869.6kB/s (0.96%)  312.9MB/s ± 2869.2kB/s (0.90%) (N = 10, α = 99.9%)
MAP_VARCHAR_DOUBLE        345.4MB/s ± 3275.7kB/s (0.93%)  359.6MB/s ± 5555.4kB/s (1.51%) (N = 10, α = 99.9%)
LARGE_MAP_VARCHAR_DOUBLE  402.0MB/s ± 6815.6kB/s (1.66%)  448.6MB/s ± 4808.3kB/s (1.05%) (N = 10, α = 99.9%)
MAP_INT_DOUBLE            606.2MB/s ± 2136.1kB/s (0.34%)  676.1MB/s ± 5620.8kB/s (0.81%) (N = 10, α = 99.9%)
LARGE_ARRAY_VARCHAR       257.8MB/s ± 9303.4kB/s (3.52%)  275.7MB/s ± 2583.1kB/s (0.91%) (N = 10, α = 99.9%)

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive, Hudi, Delta Lake, Iceberg
* Improve performance of writing parquet files. ({issue}`22089`)

@cla-bot cla-bot bot added the cla-signed label May 23, 2024
@raunaqmorarka raunaqmorarka changed the title Raunaq/pqw opt Optimize writing RLE runs in parquet column descriptors May 23, 2024
@raunaqmorarka raunaqmorarka marked this pull request as draft May 23, 2024 09:08
@raunaqmorarka raunaqmorarka force-pushed the raunaq/pqw-opt branch 3 times, most recently from 1a3a317 to 5780e73 Compare May 27, 2024 16:36
@raunaqmorarka raunaqmorarka marked this pull request as ready for review May 27, 2024 16:36
@raunaqmorarka
Copy link
Member Author

Screenshot 2024-05-28 at 10 42 20 AM ~4.2% CPU time improvement in insert benchmarks

Currently we rely on this code from parquet-mr
Moving it to Trino to enable optimization in subsequent work
Use information about nullability of Blocks to write RLE runs
for repetition and definition levels more efficiently in parquet writer

BenchmarkParquetFormat#write UNCOMPRESSED
                          Before                          After
LINEITEM                  293.0MB/s ± 2869.6kB/s (0.96%)  312.9MB/s ± 2869.2kB/s (0.90%) (N = 10, α = 99.9%)
MAP_VARCHAR_DOUBLE        345.4MB/s ± 3275.7kB/s (0.93%)  359.6MB/s ± 5555.4kB/s (1.51%) (N = 10, α = 99.9%)
LARGE_MAP_VARCHAR_DOUBLE  402.0MB/s ± 6815.6kB/s (1.66%)  448.6MB/s ± 4808.3kB/s (1.05%) (N = 10, α = 99.9%)
MAP_INT_DOUBLE            606.2MB/s ± 2136.1kB/s (0.34%)  676.1MB/s ± 5620.8kB/s (0.81%) (N = 10, α = 99.9%)
LARGE_ARRAY_VARCHAR       257.8MB/s ± 9303.4kB/s (3.52%)  275.7MB/s ± 2583.1kB/s (0.91%) (N = 10, α = 99.9%)
@raunaqmorarka raunaqmorarka merged commit db64b88 into trinodb:master May 28, 2024
60 checks passed
@raunaqmorarka raunaqmorarka deleted the raunaq/pqw-opt branch May 28, 2024 20:26
@github-actions github-actions bot added this to the 449 milestone May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2 participants