-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add compression codec configurations for data in exchanges and spills #20274
Add compression codec configurations for data in exchanges and spills #20274
Conversation
I'm concerned that zstd is using jni which could affect gc @hackeryang |
I don't think that we should switch by default. We could make it configurable |
4cf6a5f
to
f0ee5f4
Compare
This is using the airlift implementation which is pure java code rather than jni unless I'm mistaken about that |
That's a comparison between ZSTD and GZIP, both these prioritise achieving higher compression over CPU efficiency. ZSTD is better between these two, that does not automatically make it the better choice for every scenario compared to lighter compression schemes like SNAPPY and LZ4. |
@raunaqmorarka I've assumed that zstd codec is using zstd-jni underneath, my mistake |
@wendigo @raunaqmorarka you are right, zstd may not be a silver bullet for all situations~ I have modified the pr description, and added a configuration with a session property to switch between different codecs, please cc again, it seems that the CICD errors are not related to our PR |
core/trino-main/src/main/java/io/trino/SystemSessionProperties.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/buffer/CompressionKind.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/buffer/PagesSerdeFactory.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/buffer/PagesSerdeFactory.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/buffer/PagesSerdeFactory.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/test/java/io/trino/execution/buffer/TestPagesSerde.java
Outdated
Show resolved
Hide resolved
f0ee5f4
to
3a271b5
Compare
core/trino-main/src/main/java/io/trino/execution/buffer/PageDeserializer.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/buffer/PageSerializer.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/buffer/PagesSerdeFactory.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/buffer/CompressionCodec.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/buffer/PageSerializer.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
3a271b5
to
1052860
Compare
core/trino-main/src/main/java/io/trino/spiller/NodeSpillConfig.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/spiller/NodeSpillConfig.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/buffer/CompressionCodec.java
Outdated
Show resolved
Hide resolved
@@ -81,6 +83,7 @@ public class FeaturesConfig | |||
* default value is overwritten for fault tolerant execution in {@link #applyFaultTolerantExecutionDefaults()}} | |||
*/ | |||
private boolean exchangeCompressionEnabled; | |||
private CompressionCodec exchangeCompressionCodec = LZ4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran TPCDS sf1000 benchmark on unpartitioned hive tables with ZSTD exchange compression in pipelined execution mode. The setup is a 6 node cluster of r6g.8xlarge machines on AWS.
TPCDS unpartitioned parquet zstd exchange.pdf
Even though there is a big reduction in internal network data exchanged, there are still regressions in latency due to the high CPU cost of ZSTD compression.
I think this trade-off of more aggressive compression at high CPU cost doesn't make much sense for pipelined execution mode where we exchange data over internal network directly.
This trade off might make sense for filesystem based exchange in fault tolerant execution mode, but that also needs to be assessed with a benchmark.
@losipiuk can we do these changes in way where this config becomes specific to filesystem exchange in FTE ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benchmark results with FTE
FTE unpartitioned hive parquet sf1k zstd compression.pdf
FTE is using LZ4 exchange compression by default already. Switching to ZSTD improves some queries but it's significantly worse overall. So I'm not seeing a compelling reason to add this option even for FTE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benchmark results with FTE FTE unpartitioned hive parquet sf1k zstd compression.pdf
FTE is using LZ4 exchange compression by default already. Switching to ZSTD improves some queries but it's significantly worse overall. So I'm not seeing a compelling reason to add this option even for FTE.
You are right, according to the benchmark results, ZSTD should only be used in clusters with little network bandwidth.
We found that some gaming analytics customers of our company have clusters with strong CPU and poor network cards, maybe in this condition we can use it~
I saw that Huawei OpenLookeng
(originated from Trino 350) uses ZSTD in exchanges, but it was for their datacenter connector:
core/trino-main/src/main/java/io/trino/SystemSessionProperties.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/buffer/CompressionCodec.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/buffer/PagesSerdeFactory.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/spiller/NodeSpillConfig.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/spiller/NodeSpillConfig.java
Outdated
Show resolved
Hide resolved
cb42cb6
to
83c07ef
Compare
core/trino-main/src/main/java/io/trino/execution/buffer/PagesSerdeFactory.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/buffer/PagesSerdeFactory.java
Outdated
Show resolved
Hide resolved
83c07ef
to
0df8a6b
Compare
It seems CI has been passed(except the irrelevant errors), @raunaqmorarka can you please help merge, thanks~ |
Please fix the compile error on the first commit |
0df8a6b
to
09c1c4f
Compare
@raunaqmorarka I have fixed the error, please cc again~ |
We still need a fix to the failing test TestJdbcConnection.testSession |
09c1c4f
to
52b50a7
Compare
52b50a7
to
3b10b19
Compare
@raunaqmorarka Thanks for pointing this out~ I have fixed this test failure, please review again~ The error this time is about Kudu, seems not related to our PR |
core/trino-main/src/main/java/io/trino/execution/buffer/PagesSerdeFactory.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/buffer/PagesSerdeFactory.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % two comments
3b10b19
to
5938b7a
Compare
Description
Our community knows that ZSTD compression is better about speed and compression ratio in many scenarios, for example: #10058
Many blogs and test reports on the internet also mentioned that ZSTD is better than ZLIB in most cases: https://databento.com/blog/zstd-vs-zlib
We found that if page serialization and deserialization also use ZSTD, the compression and decompression performance is also better, and
Fault Tolerant Execution
also uses compressed pages: #14957The benchmark result is shown as below, It seems that if compression is enabled, Zstd is about 10X faster than Lz4.
Because ZSTD may not be a best choice for all scenarios, and we want to take into account the demands of different users, we added a session property and configuration to switch between several compresion codecs, and still use Lz4 by default when compression is enabled.
Additional context and related issues
Lz4 Compression
Zstd Compression
Benchmarks of exchanges between nodes in a real cluster on AWS, thanks to Mr Raunaq: #20274 (comment)
Our discussion about when to use ZSTD: https://trinodb.slack.com/archives/CP1MUNEUX/p1704978337669429
Snappy VS ZSTD: https://www.percona.com/blog/compression-methods-in-mongodb-snappy-vs-zstd/
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: