Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure when reading ZSTD-compressed Parquet file: NoSuchMethodError: 'void com.github.luben.zstd.ZstdInputStream.<init>(java.io.InputStream, com.github.luben.zstd.BufferPool) #3621

Closed
findepi opened this issue Nov 29, 2021 · 10 comments

Comments

@findepi
Copy link
Member

findepi commented Nov 29, 2021

I am using iceberg-spark3-runtime-0.12.jar (https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark3-runtime/0.12.0/iceberg-spark3-runtime-0.12.0.jar) with Spark 3.0.0 (https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz)

When attempting to read ZSTD-compressed Parquet file the query fails

java.lang.NoSuchMethodError: 'void com.github.luben.zstd.ZstdInputStream.<init>(java.io.InputStream, com.github.luben.zstd.BufferPool)'
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.codec.ZstdDecompressorStream.<init>(ZstdDecompressorStream.java:39)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.codec.ZstandardCodec.createInputStream(ZstandardCodec.java:94)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.codec.ZstandardCodec.createInputStream(ZstandardCodec.java:83)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.CodecFactory$HeapBytesDecompressor.decompress(CodecFactory.java:111)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ColumnChunkPageReadStore$ColumnChunkPageReader$1.visit(ColumnChunkPageReadStore.java:139)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ColumnChunkPageReadStore$ColumnChunkPageReader$1.visit(ColumnChunkPageReadStore.java:131)
	at org.apache.iceberg.shaded.org.apache.parquet.column.page.DataPageV1.accept(DataPageV1.java:120)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ColumnChunkPageReadStore$ColumnChunkPageReader.readPage(ColumnChunkPageReadStore.java:131)
	at org.apache.iceberg.parquet.BaseColumnIterator.advance(BaseColumnIterator.java:60)
	at org.apache.iceberg.parquet.BaseColumnIterator.setPageSource(BaseColumnIterator.java:51)
	at org.apache.iceberg.parquet.ParquetValueReaders$PrimitiveReader.setPageSource(ParquetValueReaders.java:185)
	at org.apache.iceberg.parquet.ParquetValueReaders$OptionReader.setPageSource(ParquetValueReaders.java:369)
	at org.apache.iceberg.parquet.ParquetValueReaders$StructReader.setPageSource(ParquetValueReaders.java:685)
	at org.apache.iceberg.parquet.ParquetReader$FileIterator.advance(ParquetReader.java:142)
	at org.apache.iceberg.parquet.ParquetReader$FileIterator.next(ParquetReader.java:112)
	at org.apache.iceberg.io.FilterIterator.advance(FilterIterator.java:66)
	at org.apache.iceberg.io.FilterIterator.hasNext(FilterIterator.java:50)
	at org.apache.iceberg.spark.source.BaseDataReader.next(BaseDataReader.java:87)
	at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
	at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
@findepi
Copy link
Member Author

findepi commented Nov 29, 2021

Iceberg 0.12 (iceberg-spark3-runtime-0.12.jar) bundles shaded version of
Parquet 1.12.

Since
apache/parquet-java@279255d
(apache-parquet-1.12.0-rc2 and newer), Parquet requires
com.github.luben.zstd.ZstdInputStream to have 2-arg constructor (InputStream,
BufferPool). This is available since
luben/zstd-jni@dd2588e,
i.e. zstd-jni v1.4.5-8 and newer.

i tried using newer spark versions, 3.1.2 and 3.2.0 (both seem to bundle
sufficiently new zstd-jni), but was getting

java.lang.NoSuchMethodError: 'void org.apache.spark.sql.internal.VariableSubstitution.<init>(org.apache.spark.sql.internal.SQLConf)

(i am not sure whether i was testing this with Iceberg 0.12 or 0.11 though)

@findepi
Copy link
Member Author

findepi commented Nov 29, 2021

Iceberg 0.12.1 + Spark 3.1.1 seem to work fine.

@RussellSpitzer
Copy link
Member

Ah yeah we hit this and changed the Spark jni version. @kbendick may remember more?

@findepi
Copy link
Member Author

findepi commented Nov 29, 2021

@RussellSpitzer thanks for looking into this.

Ah yeah we hit this and changed the Spark jni version.

you mean replacing zstd-jni-1.4.4-3.jar with a newer version?
this could work too, as long as version shipping with Iceberg isn't picked (#3058 doesn't seem to be on 0.12.x branch)

@kbendick
Copy link
Contributor

kbendick commented Nov 29, 2021

I believe @RussellSpitzer is referring to upgrading the zstd-jni version with a later version. So your question is correct @findepi.

Ideally, it should come from Spark (at least for spark3-runtime) which was the point of #3058

this could work too, as long as version shipping with Iceberg isn't picked (#3058 doesn't seem to be on 0.12.x branch)

You are right, it does appear that #3058 was never included in 0.12.1 or the 0.12.x branch in general.

0.12.0 was released prior to that PR being merged, and then with the repo layout changes and only grabbing for bugs it seems we might have missed it when preparing 0.12.1.

I'll be sure to add that to the upcoming 0.13.0 release. Not sure if #3058 itself merits a patch release. Would it be possible to exclude the dependency from the trino side momentarily?

What do others think?

@kbendick
Copy link
Contributor

It might be possible to disable the buffer pool (which would not be great from a performance standpoint, but might be helpful in working around the original issue): apache/parquet-java#903

The relevant parquet config is parquet.compression.codec.zstd.bufferPool.enabled. That might stop it from using that particular constructor.

@jackye1995
Copy link
Contributor

Not sure if #3058 itself merits a patch release

I wonder if Trino can hold on trinodb/trino#10045 to wait for 0.13.0 for the fix given it's just around the corner.

But from correctness perspective I think it's good to have a 0.12.2 for this later.

@findepi
Copy link
Member Author

findepi commented Nov 29, 2021

Would it be possible to exclude the dependency from the trino side momentarily?

Trino doesn't use the Parquet reader bundled with Iceberg, so it is not affected.

I faced this problem in Trino's compatibility tests against Spark, where the Iceberg+Spark setup we have was failing. I thing I solved this (#3621 (comment), trinodb/docker-images#119). Will close this issue once i can confirm with the CI.

The relevant parquet config is parquet.compression.codec.zstd.bufferPool.enabled. That might stop it from using that particular constructor.

for posterity -- my cursory reading of Parquet code suggests that this flag controls what kind of BufferPool is passed to the problematic constructor, not whether the constructor is used. Thus, it seems not effective to enable running with older zstd-jni version.

@jackye1995
Copy link
Contributor

I faced this problem in Trino's compatibility tests against Spark

Yes the compatibility test is what I am talking about on Trino side. If that's fixed then it should be fine.

@findepi findepi closed this as completed Nov 30, 2021
@findepi
Copy link
Member Author

findepi commented Nov 30, 2021

thanks for all the comments!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants