Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add compression codec configurations for data in exchanges and spills #20274

Merged
merged 2 commits into from
Jan 19, 2024

Conversation

hackeryang
Copy link
Member

@hackeryang hackeryang commented Jan 4, 2024

Description

Our community knows that ZSTD compression is better about speed and compression ratio in many scenarios, for example: #10058
Many blogs and test reports on the internet also mentioned that ZSTD is better than ZLIB in most cases: https://databento.com/blog/zstd-vs-zlib

We found that if page serialization and deserialization also use ZSTD, the compression and decompression performance is also better, and Fault Tolerant Execution also uses compressed pages: #14957

The benchmark result is shown as below, It seems that if compression is enabled, Zstd is about 10X faster than Lz4.

Because ZSTD may not be a best choice for all scenarios, and we want to take into account the demands of different users, we added a session property and configuration to switch between several compresion codecs, and still use Lz4 by default when compression is enabled.

Additional context and related issues

Lz4 Compression

Benchmark                        (compressed)  (encrypted)  (randomSeed)   Mode  Cnt     Score     Error  Units
BenchmarkPagesSerde.deserialize          true         true          1000  thrpt   10   104.194 ±   4.508  ops/s
BenchmarkPagesSerde.deserialize          true        false          1000  thrpt   10  4770.274 ± 106.577  ops/s
BenchmarkPagesSerde.deserialize         false         true          1000  thrpt   10    83.668 ±   3.172  ops/s
BenchmarkPagesSerde.deserialize         false        false          1000  thrpt   10  8175.870 ± 423.283  ops/s
BenchmarkPagesSerde.serialize            true         true          1000  thrpt   10    90.786 ±  18.355  ops/s
BenchmarkPagesSerde.serialize            true        false          1000  thrpt   10   952.958 ±  31.424  ops/s
BenchmarkPagesSerde.serialize           false         true          1000  thrpt   10    85.247 ±   1.277  ops/s
BenchmarkPagesSerde.serialize           false        false          1000  thrpt   10  8060.584 ± 119.043  ops/s

Zstd Compression

Benchmark                        (compressed)  (encrypted)  (randomSeed)   Mode  Cnt      Score      Error  Units
BenchmarkPagesSerde.deserialize          true         true          1000  thrpt   10    102.974 ±    1.746  ops/s
BenchmarkPagesSerde.deserialize          true        false          1000  thrpt   10    373.737 ±   55.635  ops/s
BenchmarkPagesSerde.deserialize         false         true          1000  thrpt   10     85.315 ±    0.897  ops/s
BenchmarkPagesSerde.deserialize         false        false          1000  thrpt   10  10393.729 ± 1296.839  ops/s
BenchmarkPagesSerde.serialize            true         true          1000  thrpt   10     44.524 ±    0.103  ops/s
BenchmarkPagesSerde.serialize            true        false          1000  thrpt   10     65.594 ±    0.373  ops/s
BenchmarkPagesSerde.serialize           false         true          1000  thrpt   10     83.790 ±    2.491  ops/s
BenchmarkPagesSerde.serialize           false        false          1000  thrpt   10   7675.937 ±  347.427  ops/s

Benchmarks of exchanges between nodes in a real cluster on AWS, thanks to Mr Raunaq: #20274 (comment)
Our discussion about when to use ZSTD: https://trinodb.slack.com/archives/CP1MUNEUX/p1704978337669429
Snappy VS ZSTD: https://www.percona.com/blog/compression-methods-in-mongodb-snappy-vs-zstd/

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# General
* The configuration property `exchange.compression-enabled` is now replaced by `exchange.compression-codec` which allows specifying `NONE`, `LZ4` or `ZSTD` as the compression codec for data exchanged between workers. The corresponding system session property `exchange_compression` is replaced by `exchange_compression_codec`. ({issue}`20274`)
* The configuration property `spill-compression-enabled` is now replaced by `spill-compression-codec` which allows specifying `NONE`, `LZ4` or `ZSTD` as the compression codec for data spilled to disk. The already deprecated configuration property `experimental.spill-compression-enabled` is now defunct and must be removed from server configurations. ({issue}`20274`)

@cla-bot cla-bot bot added the cla-signed label Jan 4, 2024
@hackeryang hackeryang self-assigned this Jan 4, 2024
@hackeryang
Copy link
Member Author

Hello @findepi @wendigo @losipiuk , can you please take a review when you have time, thank you~

@wendigo
Copy link
Contributor

wendigo commented Jan 4, 2024

I'm concerned that zstd is using jni which could affect gc @hackeryang

@wendigo
Copy link
Contributor

wendigo commented Jan 4, 2024

I don't think that we should switch by default. We could make it configurable

@findepi
Copy link
Member

findepi commented Jan 4, 2024

cc @raunaqmorarka

@hackeryang hackeryang force-pushed the use_zstd_in_serialization branch from 4cf6a5f to f0ee5f4 Compare January 4, 2024 09:37
@hackeryang hackeryang changed the title Use ZSTD in page serialization and deserialization to improve performance Add compression kind configuration for page serializations, exchanges and spills Jan 4, 2024
@raunaqmorarka
Copy link
Member

I'm concerned that zstd is using jni which could affect gc @hackeryang

This is using the airlift implementation which is pure java code rather than jni unless I'm mistaken about that

@raunaqmorarka
Copy link
Member

raunaqmorarka commented Jan 4, 2024

Our community knows that ZSTD compression is better about speed and compression ratio in most scenarios, for example: #10058

That's a comparison between ZSTD and GZIP, both these prioritise achieving higher compression over CPU efficiency. ZSTD is better between these two, that does not automatically make it the better choice for every scenario compared to lighter compression schemes like SNAPPY and LZ4.
ZSTD makes sense for data files or exchange data stored on S3. But it's not clear that it would be the best choice for data exchanged over the network. With a high speed and bandwidth network we're better off using less CPU heavy compression. The existing LZ4 exchange compression wasn't turned on by default for pipelined execution because its impact is workload and setup dependent, it makes some queries faster and some slower.
This needs benchmarking in an actual Trino cluster rather than JMH or generic compression/decompression speed tests to assess properly.

@wendigo
Copy link
Contributor

wendigo commented Jan 4, 2024

@raunaqmorarka I've assumed that zstd codec is using zstd-jni underneath, my mistake

@hackeryang
Copy link
Member Author

hackeryang commented Jan 4, 2024

Our community knows that ZSTD compression is better about speed and compression ratio in most scenarios, for example: #10058

That's a comparison between ZSTD and GZIP, both these prioritise achieving higher compression over CPU efficiency. ZSTD is better between these two, that does not automatically make it the better choice for every scenario compared to lighter compression schemes like SNAPPY and LZ4. ZSTD makes sense for data files or exchange data stored on S3. But it's not clear that it would be the best choice for data exchanged over the network. With a high speed and bandwidth network we're better off using less CPU heavy compression. The existing LZ4 exchange compression wasn't turned on by default for pipelined execution because its impact is workload and setup dependent, it makes some queries faster and some slower. This needs benchmarking in an actual Trino cluster rather than JMH or generic compression/decompression speed tests to assess properly.

@wendigo @raunaqmorarka you are right, zstd may not be a silver bullet for all situations~

I have modified the pr description, and added a configuration with a session property to switch between different codecs, please cc again, it seems that the CICD errors are not related to our PR

@hackeryang hackeryang changed the title Add compression kind configuration for page serializations, exchanges and spills Add compression codec configurations for data in exchanges and spills Jan 4, 2024
@hackeryang hackeryang force-pushed the use_zstd_in_serialization branch from f0ee5f4 to 3a271b5 Compare January 4, 2024 12:21
@hackeryang hackeryang requested a review from wendigo January 4, 2024 15:52
Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@hackeryang hackeryang force-pushed the use_zstd_in_serialization branch from 3a271b5 to 1052860 Compare January 5, 2024 07:19
@hackeryang hackeryang requested review from losipiuk and wendigo January 5, 2024 08:30
@hackeryang
Copy link
Member Author

hackeryang commented Jan 5, 2024

@wendigo @losipiuk Thank you for your advice~ I have modified some codes, and the 2 test errors seemed not related to our PR, please cc again

core/trino-main/src/main/java/io/trino/FeaturesConfig.java Outdated Show resolved Hide resolved
@@ -81,6 +83,7 @@ public class FeaturesConfig
* default value is overwritten for fault tolerant execution in {@link #applyFaultTolerantExecutionDefaults()}}
*/
private boolean exchangeCompressionEnabled;
private CompressionCodec exchangeCompressionCodec = LZ4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran TPCDS sf1000 benchmark on unpartitioned hive tables with ZSTD exchange compression in pipelined execution mode. The setup is a 6 node cluster of r6g.8xlarge machines on AWS.

TPCDS unpartitioned parquet zstd exchange.pdf

Screenshot 2024-01-05 at 2 18 59 PM

Even though there is a big reduction in internal network data exchanged, there are still regressions in latency due to the high CPU cost of ZSTD compression.
I think this trade-off of more aggressive compression at high CPU cost doesn't make much sense for pipelined execution mode where we exchange data over internal network directly.
This trade off might make sense for filesystem based exchange in fault tolerant execution mode, but that also needs to be assessed with a benchmark.
@losipiuk can we do these changes in way where this config becomes specific to filesystem exchange in FTE ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark results with FTE
FTE unpartitioned hive parquet sf1k zstd compression.pdf

Screenshot 2024-01-09 at 1 26 54 PM

FTE is using LZ4 exchange compression by default already. Switching to ZSTD improves some queries but it's significantly worse overall. So I'm not seeing a compelling reason to add this option even for FTE.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark results with FTE FTE unpartitioned hive parquet sf1k zstd compression.pdf

Screenshot 2024-01-09 at 1 26 54 PM FTE is using LZ4 exchange compression by default already. Switching to ZSTD improves some queries but it's significantly worse overall. So I'm not seeing a compelling reason to add this option even for FTE.

You are right, according to the benchmark results, ZSTD should only be used in clusters with little network bandwidth.

We found that some gaming analytics customers of our company have clusters with strong CPU and poor network cards, maybe in this condition we can use it~

I saw that Huawei OpenLookeng(originated from Trino 350) uses ZSTD in exchanges, but it was for their datacenter connector:
image

@hackeryang hackeryang requested a review from findinpath January 12, 2024 07:38
@hackeryang hackeryang force-pushed the use_zstd_in_serialization branch 4 times, most recently from cb42cb6 to 83c07ef Compare January 16, 2024 10:39
@hackeryang hackeryang force-pushed the use_zstd_in_serialization branch from 83c07ef to 0df8a6b Compare January 17, 2024 06:05
@hackeryang
Copy link
Member Author

It seems CI has been passed(except the irrelevant errors), @raunaqmorarka can you please help merge, thanks~

@raunaqmorarka
Copy link
Member

It seems CI has been passed(except the irrelevant errors), @raunaqmorarka can you please help merge, thanks~

Please fix the compile error on the first commit

@hackeryang hackeryang force-pushed the use_zstd_in_serialization branch from 0df8a6b to 09c1c4f Compare January 18, 2024 06:43
@hackeryang
Copy link
Member Author

It seems CI has been passed(except the irrelevant errors), @raunaqmorarka can you please help merge, thanks~

Please fix the compile error on the first commit

@raunaqmorarka I have fixed the error, please cc again~

@raunaqmorarka
Copy link
Member

It seems CI has been passed(except the irrelevant errors), @raunaqmorarka can you please help merge, thanks~

Please fix the compile error on the first commit

@raunaqmorarka I have fixed the error, please cc again~

We still need a fix to the failing test TestJdbcConnection.testSession

@hackeryang hackeryang force-pushed the use_zstd_in_serialization branch from 09c1c4f to 52b50a7 Compare January 18, 2024 08:41
@github-actions github-actions bot added the jdbc Relates to Trino JDBC driver label Jan 18, 2024
@hackeryang hackeryang force-pushed the use_zstd_in_serialization branch from 52b50a7 to 3b10b19 Compare January 18, 2024 09:13
@hackeryang
Copy link
Member Author

It seems CI has been passed(except the irrelevant errors), @raunaqmorarka can you please help merge, thanks~

Please fix the compile error on the first commit

@raunaqmorarka I have fixed the error, please cc again~

We still need a fix to the failing test TestJdbcConnection.testSession

@raunaqmorarka Thanks for pointing this out~ I have fixed this test failure, please review again~

The error this time is about Kudu, seems not related to our PR

Copy link
Contributor

@wendigo wendigo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % two comments

@hackeryang hackeryang force-pushed the use_zstd_in_serialization branch from 3b10b19 to 5938b7a Compare January 19, 2024 02:18
@raunaqmorarka raunaqmorarka merged commit 1bb096b into trinodb:master Jan 19, 2024
91 checks passed
@github-actions github-actions bot added this to the 437 milestone Jan 19, 2024
@hackeryang hackeryang deleted the use_zstd_in_serialization branch January 19, 2024 04:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed docs jdbc Relates to Trino JDBC driver performance
Development

Successfully merging this pull request may close these issues.

7 participants