Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iceberg query failure because of predicate pushdown with iceberg column id #22347

Closed
findinpath opened this issue Jun 10, 2024 · 0 comments · Fixed by #22367
Closed

Iceberg query failure because of predicate pushdown with iceberg column id #22347

findinpath opened this issue Jun 10, 2024 · 0 comments · Fixed by #22367
Labels
bug Something isn't working iceberg Iceberg connector

Comments

@findinpath
Copy link
Contributor

findinpath commented Jun 10, 2024

Context

we have some Iceberg tables where we manually generate the parquet data files and Iceberg metadata ourselves.

When id for the type is missing from the parquet file, this can cause a query failure on the call:

TupleDomain<ColumnDescriptor> parquetTupleDomain = options.isIgnoreStatistics() ? TupleDomain.all() : getParquetTupleDomain(descriptorsByPath, effectivePredicate);

Relevant stack trace

Query 20240609_161836_02951_zpmtj failed: Error opening Iceberg split /path/data/file.parquet (offset=0, length=1660): Cannot invoke "org.apache.parquet.schema.Type$ID.intValue()" because the return value of "org.apache.parquet.schema.PrimitiveType.getId()" is null
io.trino.spi.TrinoException: Error opening Iceberg split /mnt/shavast01_datalake/iceberg/temp/_managedtmp/6p7pl0cwmpdw/iceberg_caches/iceberg-temp-ghjkloyvmfircvttdsyufowgjqtmgzwh/data/file.parquet (offset=0, length=1660): Cannot invoke "org.apache.parquet.schema.Type$ID.intValue()" because the return value of "org.apache.parquet.schema.PrimitiveType.getId()" is null
        at io.trino.plugin.iceberg.IcebergPageSourceProvider.createParquetPageSource(IcebergPageSourceProvider.java:1132)
        at io.trino.plugin.iceberg.IcebergPageSourceProvider.createDataPageSource(IcebergPageSourceProvider.java:633)
        at io.trino.plugin.iceberg.IcebergPageSourceProvider.createPageSource(IcebergPageSourceProvider.java:373)
        at io.trino.plugin.iceberg.IcebergPageSourceProvider.createPageSource(IcebergPageSourceProvider.java:265)
        at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:48)
        at io.trino.split.PageSourceManager.createPageSource(PageSourceManager.java:61)
        at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:264)
        at io.trino.operator.Driver.processInternal(Driver.java:403)
        at io.trino.operator.Driver.lambda$process$8(Driver.java:306)
        at io.trino.operator.Driver.tryWithLock(Driver.java:709)
        at io.trino.operator.Driver.process(Driver.java:298)
        at io.trino.operator.Driver.processForDuration(Driver.java:269)
        at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:890)
        at io.trino.execution.executor.dedicated.SplitProcessor.run(SplitProcessor.java:77)
        at io.trino.execution.executor.dedicated.TaskEntry$VersionEmbedderBridge.lambda$run$0(TaskEntry.java:191)
        at io.trino.$gen.Trino_448____20240607_201000_2.run(Unknown Source)
        at io.trino.execution.executor.dedicated.TaskEntry$VersionEmbedderBridge.run(TaskEntry.java:192)
        at io.trino.execution.executor.scheduler.FairScheduler.runTask(FairScheduler.java:168)
        at io.trino.execution.executor.scheduler.FairScheduler.lambda$submit$0(FairScheduler.java:155)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
        at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
        at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:76)
        at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1570)
Caused by: java.lang.NullPointerException: Cannot invoke "org.apache.parquet.schema.Type$ID.intValue()" because the return value of "org.apache.parquet.schema.PrimitiveType.getId()" is null
        at io.trino.plugin.iceberg.IcebergPageSourceProvider.lambda$getParquetTupleDomain$35(IcebergPageSourceProvider.java:1504)
        at com.google.common.collect.CollectCollectors.lambda$toImmutableMap$7(CollectCollectors.java:195)
        at java.base/java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
        at java.base/java.util.HashMap$ValueSpliterator.forEachRemaining(HashMap.java:1787)
        at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:556)
        at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:546)
        at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
        at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:265)
        at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:702)
        at io.trino.plugin.iceberg.IcebergPageSourceProvider.getParquetTupleDomain(IcebergPageSourceProvider.java:1504)
        at io.trino.plugin.iceberg.IcebergPageSourceProvider.createParquetPageSource(IcebergPageSourceProvider.java:1017)
        ... 25 more

Slack discussion

https://trinodb.slack.com/archives/CJ6UC075E/p1717777005901119

Technical notes

The issue reported here is likely related to #19066

Avoid doing the predicate pushdown when any of the descriptors from the parquet schema is missing the id value.

For the corresponding PR - add an integration test to ensure we'll not be dealing with further regressions (either through hive migrate procedure or potentially https://iceberg.apache.org/docs/latest/spark-procedures/#add_files)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working iceberg Iceberg connector
Development

Successfully merging a pull request may close this issue.

1 participant