Use ZSTD by default in Iceberg #10045

findepi · 2021-11-23T12:02:54Z

phd3 · 2021-11-24T05:26:58Z

failure seems related. Also, would be good to update at https://trino.io/docs/current/connector/iceberg.html#configuration. are we also planning to do the same in hive connector given ZSTD is superior? or that can throw off other old readers?

findepi · 2021-11-24T09:38:13Z

are we also planning to do the same in hive connector given ZSTD is superior?

yes, except Hive has more compatibility cruft, see #9773.

or that can throw off other old readers?

exactly

Also, would be good to update at https://trino.io/docs/current/connector/iceberg.html#configuration.

thanks, done

failure seems related.

indeed:

2021-11-23T14:01:58.4024515Z spark               | java.lang.NoSuchMethodError: 'void com.github.luben.zstd.ZstdInputStream.<init>(java.io.InputStream, com.github.luben.zstd.BufferPool)'
2021-11-23T14:01:58.4027915Z spark               | 	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.codec.ZstdDecompressorStream.<init>(ZstdDecompressorStream.java:39)
2021-11-23T14:01:58.4032354Z spark               | 	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.codec.ZstandardCodec.createInputStream(ZstandardCodec.java:94)
2021-11-23T14:01:58.4039042Z spark               | 	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.codec.ZstandardCodec.createInputStream(ZstandardCodec.java:83)
2021-11-23T14:01:58.4042989Z spark               | 	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.CodecFactory$HeapBytesDecompressor.decompress(CodecFactory.java:111)
2021-11-23T14:01:58.4047031Z spark               | 	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ColumnChunkPageReadStore$ColumnChunkPageReader$1.visit(ColumnChunkPageReadStore.java:139)
2021-11-23T14:01:58.4051200Z spark               | 	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ColumnChunkPageReadStore$ColumnChunkPageReader$1.visit(ColumnChunkPageReadStore.java:131)
2021-11-23T14:01:58.4055699Z spark               | 	at org.apache.iceberg.shaded.org.apache.parquet.column.page.DataPageV1.accept(DataPageV1.java:120)
2021-11-23T14:01:58.4060786Z spark               | 	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ColumnChunkPageReadStore$ColumnChunkPageReader.readPage(ColumnChunkPageReadStore.java:131)
2021-11-23T14:01:58.4064576Z spark               | 	at org.apache.iceberg.parquet.BaseColumnIterator.advance(BaseColumnIterator.java:60)
2021-11-23T14:01:58.4066718Z spark               | 	at org.apache.iceberg.parquet.BaseColumnIterator.setPageSource(BaseColumnIterator.java:51)
2021-11-23T14:01:58.4069002Z spark               | 	at org.apache.iceberg.parquet.ParquetValueReaders$PrimitiveReader.setPageSource(ParquetValueReaders.java:185)
2021-11-23T14:01:58.4071208Z spark               | 	at org.apache.iceberg.parquet.ParquetValueReaders$OptionReader.setPageSource(ParquetValueReaders.java:369)
2021-11-23T14:01:58.4073605Z spark               | 	at org.apache.iceberg.parquet.ParquetValueReaders$StructReader.setPageSource(ParquetValueReaders.java:685)
2021-11-23T14:01:58.4075739Z spark               | 	at org.apache.iceberg.parquet.ParquetReader$FileIterator.advance(ParquetReader.java:142)
2021-11-23T14:01:58.4077522Z spark               | 	at org.apache.iceberg.parquet.ParquetReader$FileIterator.next(ParquetReader.java:112)
2021-11-23T14:01:58.4079028Z spark               | 	at org.apache.iceberg.io.FilterIterator.advance(FilterIterator.java:66)
2021-11-23T14:01:58.4080487Z spark               | 	at org.apache.iceberg.io.FilterIterator.hasNext(FilterIterator.java:50)
2021-11-23T14:01:58.4082076Z spark               | 	at org.apache.iceberg.spark.source.BaseDataReader.next(BaseDataReader.java:87)
2021-11-23T14:01:58.4084375Z spark               | 	at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
2021-11-23T14:01:58.4087382Z spark               | 	at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
2021-11-23T14:01:58.4091246Z spark               | 	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
2021-11-23T14:01:58.4092615Z spark               | 	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
2021-11-23T14:01:58.4120219Z spark               | 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
2021-11-23T14:01:58.4122908Z spark               | 	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
2021-11-23T14:01:58.4125576Z spark               | 	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
2021-11-23T14:01:58.4127464Z spark               | 	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
2021-11-23T14:01:58.4128849Z spark               | 	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
2021-11-23T14:01:58.4130030Z spark               | 	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
2021-11-23T14:01:58.4131443Z spark               | 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
2021-11-23T14:01:58.4133051Z spark               | 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
2021-11-23T14:01:58.4136031Z spark               | 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
2021-11-23T14:01:58.4137546Z spark               | 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
2021-11-23T14:01:58.4139490Z spark               | 	at org.apache.spark.scheduler.Task.run(Task.scala:127)
2021-11-23T14:01:58.4141318Z spark               | 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
2021-11-23T14:01:58.4142916Z spark               | 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
2021-11-23T14:01:58.4144102Z spark               | 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
2021-11-23T14:01:58.4145708Z spark               | 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
2021-11-23T14:01:58.4147516Z spark               | 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
2021-11-23T14:01:58.4148740Z spark               | 	at java.base/java.lang.Thread.run(Thread.java:829)

findepi · 2021-11-29T12:48:24Z

(just rebased)

findepi · 2021-11-29T15:28:28Z

Reported problem as apache/iceberg#3621, but it might as simple as our test setup's versions bump.

findepi · 2021-11-29T21:24:17Z

pom.xml

@@ -70,7 +70,7 @@
        <!-- TODO(https://github.com/airlift/airbase/pull/281): Required by testcontainers, remove when pulled from Airbase -->
        <dep.slf4j.version>1.7.32</dep.slf4j.version>

-        <dep.docker.images.version>52</dep.docker.images.version>
+        <dep.docker.images.version>03aecb7</dep.docker.images.version>


Requires trinodb/docker-images#119

findepi · 2021-11-30T08:31:00Z

39 successful and 2 failing checks

ci / pt (hdp3, suite-2, 11, false) (pull_request) — pt (hdp3, suite-2, 11, false)
ci / pt (default, suite-7-non-generic, 11) (pull_request) — pt (default, suite-7-non-generic, 11)

suite-7-non-generic is iceberg test failure: expected exception message mismatch, probably due to spark version change.

This fixed Iceberg on Spark reads of ZSTD-compressed Parquet files and updates Spark version used in *-spark-iceberg environment.

findepi requested review from losipiuk and phd3 November 23, 2021 12:02

cla-bot bot added the cla-signed label Nov 24, 2021

findepi force-pushed the findepi/use-zstd-by-default-in-iceberg-c091ac branch from a15e6b7 to 54cc00d Compare November 29, 2021 12:48

jackye1995 mentioned this pull request Nov 29, 2021

Failure when reading ZSTD-compressed Parquet file: NoSuchMethodError: 'void com.github.luben.zstd.ZstdInputStream.<init>(java.io.InputStream, com.github.luben.zstd.BufferPool) apache/iceberg#3621

Closed

findepi commented Nov 29, 2021

View reviewed changes

losipiuk approved these changes Nov 30, 2021

View reviewed changes

findepi force-pushed the findepi/use-zstd-by-default-in-iceberg-c091ac branch from d34c631 to 163be63 Compare November 30, 2021 08:31

findepi added 2 commits November 30, 2021 14:42

Update docker-images to 53

36bcf76

This fixed Iceberg on Spark reads of ZSTD-compressed Parquet files and updates Spark version used in *-spark-iceberg environment.

Update code indentation

70dc41f

findepi force-pushed the findepi/use-zstd-by-default-in-iceberg-c091ac branch from 8540fdc to f3b5358 Compare November 30, 2021 13:43

findepi added 2 commits November 30, 2021 16:48

Test Trino Iceberg compression compatibility with Spark

9d27337

Use ZSTD by default in Iceberg

62c0417

findepi force-pushed the findepi/use-zstd-by-default-in-iceberg-c091ac branch from f3b5358 to 62c0417 Compare November 30, 2021 15:48

findepi merged commit 31d4758 into trinodb:master Dec 1, 2021

findepi deleted the findepi/use-zstd-by-default-in-iceberg-c091ac branch December 1, 2021 13:40

github-actions bot added this to the 365 milestone Dec 1, 2021

findepi mentioned this pull request Dec 1, 2021

Release notes for 365 #9826

Closed

12 tasks

findepi added the performance label Dec 4, 2021

dbtsai mentioned this pull request Jul 26, 2023

Core: use ZSTD compressed parquet by default apache/iceberg#8158

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use ZSTD by default in Iceberg #10045

Use ZSTD by default in Iceberg #10045

findepi commented Nov 23, 2021 •

edited

Loading

phd3 commented Nov 24, 2021

findepi commented Nov 24, 2021

findepi commented Nov 29, 2021

findepi commented Nov 29, 2021

findepi Nov 29, 2021

findepi commented Nov 30, 2021

Use ZSTD by default in Iceberg #10045

Use ZSTD by default in Iceberg #10045

Conversation

findepi commented Nov 23, 2021 • edited Loading

phd3 commented Nov 24, 2021

findepi commented Nov 24, 2021

findepi commented Nov 29, 2021

findepi commented Nov 29, 2021

findepi Nov 29, 2021

Choose a reason for hiding this comment

findepi commented Nov 30, 2021

findepi commented Nov 23, 2021 •

edited

Loading