Failure when reading ZSTD-compressed Parquet file: NoSuchMethodError: 'void com.github.luben.zstd.ZstdInputStream.<init>(java.io.InputStream, com.github.luben.zstd.BufferPool) #3621

findepi · 2021-11-29T13:43:05Z

I am using iceberg-spark3-runtime-0.12.jar (https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark3-runtime/0.12.0/iceberg-spark3-runtime-0.12.0.jar) with Spark 3.0.0 (https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz)

When attempting to read ZSTD-compressed Parquet file the query fails

java.lang.NoSuchMethodError: 'void com.github.luben.zstd.ZstdInputStream.<init>(java.io.InputStream, com.github.luben.zstd.BufferPool)'
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.codec.ZstdDecompressorStream.<init>(ZstdDecompressorStream.java:39)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.codec.ZstandardCodec.createInputStream(ZstandardCodec.java:94)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.codec.ZstandardCodec.createInputStream(ZstandardCodec.java:83)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.CodecFactory$HeapBytesDecompressor.decompress(CodecFactory.java:111)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ColumnChunkPageReadStore$ColumnChunkPageReader$1.visit(ColumnChunkPageReadStore.java:139)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ColumnChunkPageReadStore$ColumnChunkPageReader$1.visit(ColumnChunkPageReadStore.java:131)
	at org.apache.iceberg.shaded.org.apache.parquet.column.page.DataPageV1.accept(DataPageV1.java:120)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ColumnChunkPageReadStore$ColumnChunkPageReader.readPage(ColumnChunkPageReadStore.java:131)
	at org.apache.iceberg.parquet.BaseColumnIterator.advance(BaseColumnIterator.java:60)
	at org.apache.iceberg.parquet.BaseColumnIterator.setPageSource(BaseColumnIterator.java:51)
	at org.apache.iceberg.parquet.ParquetValueReaders$PrimitiveReader.setPageSource(ParquetValueReaders.java:185)
	at org.apache.iceberg.parquet.ParquetValueReaders$OptionReader.setPageSource(ParquetValueReaders.java:369)
	at org.apache.iceberg.parquet.ParquetValueReaders$StructReader.setPageSource(ParquetValueReaders.java:685)
	at org.apache.iceberg.parquet.ParquetReader$FileIterator.advance(ParquetReader.java:142)
	at org.apache.iceberg.parquet.ParquetReader$FileIterator.next(ParquetReader.java:112)
	at org.apache.iceberg.io.FilterIterator.advance(FilterIterator.java:66)
	at org.apache.iceberg.io.FilterIterator.hasNext(FilterIterator.java:50)
	at org.apache.iceberg.spark.source.BaseDataReader.next(BaseDataReader.java:87)
	at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
	at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

The text was updated successfully, but these errors were encountered:

findepi · 2021-11-29T13:44:03Z

Iceberg 0.12 (iceberg-spark3-runtime-0.12.jar) bundles shaded version of
Parquet 1.12.

Since
apache/parquet-java@279255d
(apache-parquet-1.12.0-rc2 and newer), Parquet requires
com.github.luben.zstd.ZstdInputStream to have 2-arg constructor (InputStream,
BufferPool). This is available since
luben/zstd-jni@dd2588e,
i.e. zstd-jni v1.4.5-8 and newer.

my spark runtime
(https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz)
includes zstd-jni-1.4.4-3.jar, so too old for Parquet 1.12 to decode
ZSTD-compressed parquet files.
iceberg seems to bundle zstd-jni too, not sure which version
(might relate to Build: Exclude zstd-jni from iceberg-spark3-runtime #3058)

i tried using newer spark versions, 3.1.2 and 3.2.0 (both seem to bundle
sufficiently new zstd-jni), but was getting

java.lang.NoSuchMethodError: 'void org.apache.spark.sql.internal.VariableSubstitution.<init>(org.apache.spark.sql.internal.SQLConf)

(i am not sure whether i was testing this with Iceberg 0.12 or 0.11 though)

findepi · 2021-11-29T15:21:58Z

Iceberg 0.12.1 + Spark 3.1.1 seem to work fine.

RussellSpitzer · 2021-11-29T15:29:40Z

Ah yeah we hit this and changed the Spark jni version. @kbendick may remember more?

findepi · 2021-11-29T16:15:45Z

@RussellSpitzer thanks for looking into this.

Ah yeah we hit this and changed the Spark jni version.

you mean replacing zstd-jni-1.4.4-3.jar with a newer version?
this could work too, as long as version shipping with Iceberg isn't picked (#3058 doesn't seem to be on 0.12.x branch)

kbendick · 2021-11-29T18:46:33Z

I believe @RussellSpitzer is referring to upgrading the zstd-jni version with a later version. So your question is correct @findepi.

Ideally, it should come from Spark (at least for spark3-runtime) which was the point of #3058

this could work too, as long as version shipping with Iceberg isn't picked (#3058 doesn't seem to be on 0.12.x branch)

You are right, it does appear that #3058 was never included in 0.12.1 or the 0.12.x branch in general.

0.12.0 was released prior to that PR being merged, and then with the repo layout changes and only grabbing for bugs it seems we might have missed it when preparing 0.12.1.

I'll be sure to add that to the upcoming 0.13.0 release. Not sure if #3058 itself merits a patch release. Would it be possible to exclude the dependency from the trino side momentarily?

What do others think?

kbendick · 2021-11-29T18:55:49Z

It might be possible to disable the buffer pool (which would not be great from a performance standpoint, but might be helpful in working around the original issue): apache/parquet-java#903

The relevant parquet config is parquet.compression.codec.zstd.bufferPool.enabled. That might stop it from using that particular constructor.

jackye1995 · 2021-11-29T18:59:15Z

Not sure if #3058 itself merits a patch release

I wonder if Trino can hold on trinodb/trino#10045 to wait for 0.13.0 for the fix given it's just around the corner.

But from correctness perspective I think it's good to have a 0.12.2 for this later.

findepi · 2021-11-29T20:17:29Z

Would it be possible to exclude the dependency from the trino side momentarily?

Trino doesn't use the Parquet reader bundled with Iceberg, so it is not affected.

I faced this problem in Trino's compatibility tests against Spark, where the Iceberg+Spark setup we have was failing. I thing I solved this (#3621 (comment), trinodb/docker-images#119). Will close this issue once i can confirm with the CI.

The relevant parquet config is parquet.compression.codec.zstd.bufferPool.enabled. That might stop it from using that particular constructor.

for posterity -- my cursory reading of Parquet code suggests that this flag controls what kind of BufferPool is passed to the problematic constructor, not whether the constructor is used. Thus, it seems not effective to enable running with older zstd-jni version.

jackye1995 · 2021-11-29T23:35:43Z

I faced this problem in Trino's compatibility tests against Spark

Yes the compatibility test is what I am talking about on Trino side. If that's fixed then it should be fine.

findepi · 2021-11-30T17:19:38Z

thanks for all the comments!

findepi mentioned this issue Nov 29, 2021

Use ZSTD by default in Iceberg trinodb/trino#10045

Merged

findepi closed this as completed Nov 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure when reading ZSTD-compressed Parquet file: NoSuchMethodError: 'void com.github.luben.zstd.ZstdInputStream.<init>(java.io.InputStream, com.github.luben.zstd.BufferPool) #3621

Failure when reading ZSTD-compressed Parquet file: NoSuchMethodError: 'void com.github.luben.zstd.ZstdInputStream.<init>(java.io.InputStream, com.github.luben.zstd.BufferPool) #3621

findepi commented Nov 29, 2021

findepi commented Nov 29, 2021 •

edited

Loading

findepi commented Nov 29, 2021

RussellSpitzer commented Nov 29, 2021

findepi commented Nov 29, 2021

kbendick commented Nov 29, 2021 •

edited

Loading

kbendick commented Nov 29, 2021

jackye1995 commented Nov 29, 2021

findepi commented Nov 29, 2021 •

edited

Loading

jackye1995 commented Nov 29, 2021

findepi commented Nov 30, 2021

Failure when reading ZSTD-compressed Parquet file: NoSuchMethodError: 'void com.github.luben.zstd.ZstdInputStream.<init>(java.io.InputStream, com.github.luben.zstd.BufferPool) #3621

Failure when reading ZSTD-compressed Parquet file: NoSuchMethodError: 'void com.github.luben.zstd.ZstdInputStream.<init>(java.io.InputStream, com.github.luben.zstd.BufferPool) #3621

Comments

findepi commented Nov 29, 2021

findepi commented Nov 29, 2021 • edited Loading

findepi commented Nov 29, 2021

RussellSpitzer commented Nov 29, 2021

findepi commented Nov 29, 2021

kbendick commented Nov 29, 2021 • edited Loading

kbendick commented Nov 29, 2021

jackye1995 commented Nov 29, 2021

findepi commented Nov 29, 2021 • edited Loading

jackye1995 commented Nov 29, 2021

findepi commented Nov 30, 2021

findepi commented Nov 29, 2021 •

edited

Loading

kbendick commented Nov 29, 2021 •

edited

Loading

findepi commented Nov 29, 2021 •

edited

Loading