Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ZSTD by default in Iceberg #10045

Merged

Conversation

findepi
Copy link
Member

@findepi findepi commented Nov 23, 2021

fixes #10058

@findepi findepi requested review from losipiuk and phd3 November 23, 2021 12:02
@phd3
Copy link
Member

phd3 commented Nov 24, 2021

failure seems related. Also, would be good to update at https://trino.io/docs/current/connector/iceberg.html#configuration. are we also planning to do the same in hive connector given ZSTD is superior? or that can throw off other old readers?

@findepi
Copy link
Member Author

findepi commented Nov 24, 2021

are we also planning to do the same in hive connector given ZSTD is superior?

yes, except Hive has more compatibility cruft, see #9773.

or that can throw off other old readers?

exactly

Also, would be good to update at https://trino.io/docs/current/connector/iceberg.html#configuration.

thanks, done

failure seems related.

indeed:

2021-11-23T14:01:58.4024515Z spark               | java.lang.NoSuchMethodError: 'void com.github.luben.zstd.ZstdInputStream.<init>(java.io.InputStream, com.github.luben.zstd.BufferPool)'
2021-11-23T14:01:58.4027915Z spark               | 	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.codec.ZstdDecompressorStream.<init>(ZstdDecompressorStream.java:39)
2021-11-23T14:01:58.4032354Z spark               | 	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.codec.ZstandardCodec.createInputStream(ZstandardCodec.java:94)
2021-11-23T14:01:58.4039042Z spark               | 	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.codec.ZstandardCodec.createInputStream(ZstandardCodec.java:83)
2021-11-23T14:01:58.4042989Z spark               | 	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.CodecFactory$HeapBytesDecompressor.decompress(CodecFactory.java:111)
2021-11-23T14:01:58.4047031Z spark               | 	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ColumnChunkPageReadStore$ColumnChunkPageReader$1.visit(ColumnChunkPageReadStore.java:139)
2021-11-23T14:01:58.4051200Z spark               | 	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ColumnChunkPageReadStore$ColumnChunkPageReader$1.visit(ColumnChunkPageReadStore.java:131)
2021-11-23T14:01:58.4055699Z spark               | 	at org.apache.iceberg.shaded.org.apache.parquet.column.page.DataPageV1.accept(DataPageV1.java:120)
2021-11-23T14:01:58.4060786Z spark               | 	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ColumnChunkPageReadStore$ColumnChunkPageReader.readPage(ColumnChunkPageReadStore.java:131)
2021-11-23T14:01:58.4064576Z spark               | 	at org.apache.iceberg.parquet.BaseColumnIterator.advance(BaseColumnIterator.java:60)
2021-11-23T14:01:58.4066718Z spark               | 	at org.apache.iceberg.parquet.BaseColumnIterator.setPageSource(BaseColumnIterator.java:51)
2021-11-23T14:01:58.4069002Z spark               | 	at org.apache.iceberg.parquet.ParquetValueReaders$PrimitiveReader.setPageSource(ParquetValueReaders.java:185)
2021-11-23T14:01:58.4071208Z spark               | 	at org.apache.iceberg.parquet.ParquetValueReaders$OptionReader.setPageSource(ParquetValueReaders.java:369)
2021-11-23T14:01:58.4073605Z spark               | 	at org.apache.iceberg.parquet.ParquetValueReaders$StructReader.setPageSource(ParquetValueReaders.java:685)
2021-11-23T14:01:58.4075739Z spark               | 	at org.apache.iceberg.parquet.ParquetReader$FileIterator.advance(ParquetReader.java:142)
2021-11-23T14:01:58.4077522Z spark               | 	at org.apache.iceberg.parquet.ParquetReader$FileIterator.next(ParquetReader.java:112)
2021-11-23T14:01:58.4079028Z spark               | 	at org.apache.iceberg.io.FilterIterator.advance(FilterIterator.java:66)
2021-11-23T14:01:58.4080487Z spark               | 	at org.apache.iceberg.io.FilterIterator.hasNext(FilterIterator.java:50)
2021-11-23T14:01:58.4082076Z spark               | 	at org.apache.iceberg.spark.source.BaseDataReader.next(BaseDataReader.java:87)
2021-11-23T14:01:58.4084375Z spark               | 	at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
2021-11-23T14:01:58.4087382Z spark               | 	at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
2021-11-23T14:01:58.4091246Z spark               | 	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
2021-11-23T14:01:58.4092615Z spark               | 	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
2021-11-23T14:01:58.4120219Z spark               | 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
2021-11-23T14:01:58.4122908Z spark               | 	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
2021-11-23T14:01:58.4125576Z spark               | 	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
2021-11-23T14:01:58.4127464Z spark               | 	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
2021-11-23T14:01:58.4128849Z spark               | 	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
2021-11-23T14:01:58.4130030Z spark               | 	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
2021-11-23T14:01:58.4131443Z spark               | 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
2021-11-23T14:01:58.4133051Z spark               | 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
2021-11-23T14:01:58.4136031Z spark               | 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
2021-11-23T14:01:58.4137546Z spark               | 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
2021-11-23T14:01:58.4139490Z spark               | 	at org.apache.spark.scheduler.Task.run(Task.scala:127)
2021-11-23T14:01:58.4141318Z spark               | 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
2021-11-23T14:01:58.4142916Z spark               | 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
2021-11-23T14:01:58.4144102Z spark               | 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
2021-11-23T14:01:58.4145708Z spark               | 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
2021-11-23T14:01:58.4147516Z spark               | 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
2021-11-23T14:01:58.4148740Z spark               | 	at java.base/java.lang.Thread.run(Thread.java:829)

@cla-bot cla-bot bot added the cla-signed label Nov 24, 2021
@findepi findepi force-pushed the findepi/use-zstd-by-default-in-iceberg-c091ac branch from a15e6b7 to 54cc00d Compare November 29, 2021 12:48
@findepi
Copy link
Member Author

findepi commented Nov 29, 2021

(just rebased)

@findepi
Copy link
Member Author

findepi commented Nov 29, 2021

Reported problem as apache/iceberg#3621, but it might as simple as our test setup's versions bump.

pom.xml Outdated
@@ -70,7 +70,7 @@
<!-- TODO(https://github.com/airlift/airbase/pull/281): Required by testcontainers, remove when pulled from Airbase -->
<dep.slf4j.version>1.7.32</dep.slf4j.version>

<dep.docker.images.version>52</dep.docker.images.version>
<dep.docker.images.version>03aecb7</dep.docker.images.version>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi
Copy link
Member Author

findepi commented Nov 30, 2021

39 successful and 2 failing checks

ci / pt (hdp3, suite-2, 11, false) (pull_request) — pt (hdp3, suite-2, 11, false)
ci / pt (default, suite-7-non-generic, 11) (pull_request) — pt (default, suite-7-non-generic, 11)

suite-7-non-generic is iceberg test failure: expected exception message mismatch, probably due to spark version change.

@findepi findepi force-pushed the findepi/use-zstd-by-default-in-iceberg-c091ac branch from d34c631 to 163be63 Compare November 30, 2021 08:31
This fixed Iceberg on Spark reads of ZSTD-compressed Parquet files and
updates Spark version used in *-spark-iceberg environment.
@findepi findepi force-pushed the findepi/use-zstd-by-default-in-iceberg-c091ac branch from 8540fdc to f3b5358 Compare November 30, 2021 13:43
@findepi findepi force-pushed the findepi/use-zstd-by-default-in-iceberg-c091ac branch from f3b5358 to 62c0417 Compare November 30, 2021 15:48
@findepi findepi merged commit 31d4758 into trinodb:master Dec 1, 2021
@findepi findepi deleted the findepi/use-zstd-by-default-in-iceberg-c091ac branch December 1, 2021 13:40
@github-actions github-actions bot added this to the 365 milestone Dec 1, 2021
@findepi findepi mentioned this pull request Dec 1, 2021
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Use ZSTD by default in Iceberg
3 participants