Querying partitioned S3 data source error #24840

chippyash · 2025-01-29T17:24:22Z

I have some parquet files in s3 in partitions: year=, month=day=, e.g.
bucket1/year=2025/month=01/day=01/metrics.parquet
bucket1/year=2025/month=01/day=02/metrics.parquet

I can retrieve the whole data set with 'select * from <table_name> but as soon as I try to add a where clause either on the partition value or some value in the data, I get errors.

Here's my setup:

Dockerised Trino + Hive
AWS RDS Mysql 8
S3 buckets for non relational data

NB. Trino ops against RDS have no problem.

Hive has an IAM user that is able to operate on the target S3 bucket. IAM user has following policy (some details changed for security)

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:Put*",
                "s3:Get*",
                "s3:List*",
                "S3:Head*",
                "s3:DeleteObject*"
            ],
            "Resource": [
                "arn:aws:s3:::bucket1",
                "arn:aws:s3:::bucket1/*",
            ],
            "Effect": "Allow"
        }
    ]
}

The user that is being used to log into Trino, has a set of catalogs they can see, determined by a file based ACL. The user also has an associated IAM profile, and their IAM keys etc are stored as part of the catalog description. The IAM user has an associated role that allows

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutBucketTagging",
                "s3:GetObjectAttributes",
                "s3:GetObjectTagging",
                "s3:ListBucket",
                "s3:PutObjectTagging"
            ],
            "Resource": [
                "arn:aws:s3:::bucket1",
                "arn:aws:s3:::bucket1/*"
            ]
        }
    ]
}

My catalog is defined as (mycat.properties)

connector.name=hive
fs.native-s3.enabled=true
hive.metastore.uri=thrift\://hive-metastore\:9083
s3.aws-access-key=<key>
s3.aws-secret-key=<secret>
s3.endpoint=https\://s3.eu-west-2.amazonaws.com
s3.iam-role=<role_arn>
s3.path-style-access=true
s3.region=eu-west-2
hive.recursive-directories=true

My schema is defined as:

CREATE SCHEMA mycat.bucket1
WITH (
   location = 's3://bucket1/'
)

The table creation statement is:

create table mycat.bucket1.daily (
    namespace varchar,
    operation varchar,
    dimensions varchar,
    orgId integer,
    contextId varchar,
    datetime timestamp,
    artifact varchar,
    size double,
    year integer,
    month integer,
    day integer
) with (
    external_location = 's3://bucket1/daily/',
    format = 'PARQUET',
    partitioned_by = ARRAY['year','month','day']
);

Finally, I call:

use mycat.bucket1;
call system.sync_partition_metadata('bucket1', 'daily', 'ADD');

If I do a simple select * from daily limit 10; I get a result.

If I do select * from daily where namespace = 'da-incident-etl.load' and day=28 limit 10; I get an error:

in intellij connected to Trino:

Query failed (#20250129_171206_01078_mkuzn): Error opening Hive split s3://bucket1/daily/year=2025/month=01/day=28/metrics.parquet (offset=0, length=88709): Index 10 out of bounds for length 10
[2025-01-29 17:12:07] java.lang.ArrayIndexOutOfBoundsException: Index 10 out of bounds for length 10

In dBeaver connected to Trino:

SQL Error [16777217]: Query failed (#20250129_163346_01075_mkuzn): Failed to read Parquet file: s3://bucket1/daily/year=2025/month=01/day=28/metrics.parquet

I'd really appreciate some help here. It's taken months to get this set up and this is kind of the final stumbling block to full adoption of Trino.

Trino V 449

Any help appreciated.
Ashley
I have tried this with the parquet file being compressed and uncompressed

The text was updated successfully, but these errors were encountered:

ebyhr · 2025-01-30T03:45:41Z

Could you share the full stacktrace? You can get it with CLI debug option or Web UI.
cc: @raunaqmorarka

chippyash · 2025-01-30T07:41:44Z

@ebyhr as requested:

io.trino.spi.TrinoException: Error opening Hive split s3://bucket1/daily/year=2025/month=01/day=28/metrics.parquet (offset=0, length=88709): Index 10 out of bounds for length 10
	at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:306)
	at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:180)
	at io.trino.plugin.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:200)
	at io.trino.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:136)
	at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:48)
	at io.trino.split.PageSourceManager$PageSourceProviderInstance.createPageSource(PageSourceManager.java:79)
	at io.trino.operator.ScanFilterAndProjectOperator$SplitToPages.process(ScanFilterAndProjectOperator.java:260)
	at io.trino.operator.ScanFilterAndProjectOperator$SplitToPages.process(ScanFilterAndProjectOperator.java:191)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:359)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:412)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:346)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:412)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:346)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:412)
	at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:261)
	at io.trino.operator.WorkProcessorUtils.lambda$processStateMonitor$2(WorkProcessorUtils.java:240)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:412)
	at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:261)
	at io.trino.operator.WorkProcessorUtils.lambda$finishWhen$3(WorkProcessorUtils.java:255)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:412)
	at io.trino.operator.WorkProcessorSourceOperatorAdapter.getOutput(WorkProcessorSourceOperatorAdapter.java:133)
	at io.trino.operator.Driver.processInternal(Driver.java:403)
	at io.trino.operator.Driver.lambda$process$8(Driver.java:306)
	at io.trino.operator.Driver.tryWithLock(Driver.java:709)
	at io.trino.operator.Driver.process(Driver.java:298)
	at io.trino.operator.Driver.processForDuration(Driver.java:269)
	at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:890)
	at io.trino.execution.executor.dedicated.SplitProcessor.run(SplitProcessor.java:77)
	at io.trino.execution.executor.dedicated.TaskEntry$VersionEmbedderBridge.lambda$run$0(TaskEntry.java:191)
	at io.trino.$gen.Trino_449____20250125_111737_2.run(Unknown Source)
	at io.trino.execution.executor.dedicated.TaskEntry$VersionEmbedderBridge.run(TaskEntry.java:192)
	at io.trino.execution.executor.scheduler.FairScheduler.runTask(FairScheduler.java:168)
	at io.trino.execution.executor.scheduler.FairScheduler.lambda$submit$0(FairScheduler.java:155)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:76)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1570)
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 10 out of bounds for length 10
	at org.apache.parquet.internal.column.columnindex.OffsetIndexBuilder$OffsetIndexImpl.getFirstRowIndex(OffsetIndexBuilder.java:68)
	at org.apache.parquet.internal.filter2.columnindex.RowRanges.create(RowRanges.java:144)
	at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.applyPredicate(ColumnIndexFilter.java:189)
	at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:160)
	at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)
	at org.apache.parquet.filter2.predicate.Operators$UserDefined.accept(Operators.java:450)
	at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:83)
	at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:77)
	at org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:151)
	at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:77)
	at io.trino.parquet.reader.ParquetReader.calculateFilteredRowRanges(ParquetReader.java:588)
	at io.trino.parquet.reader.ParquetReader.<init>(ParquetReader.java:175)
	at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.lambda$createPageSource$2(ParquetPageSourceFactory.java:286)
	at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createParquetPageSource(ParquetPageSourceFactory.java:469)
	at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:288)
	... 39 more

raunaqmorarka · 2025-01-30T10:53:00Z

I think setting parquet.use-column-index to false will allow you to work around this.
You should check the output of parquet pages and parquet column-index and see if the page count lines up with the number of entries in offset and column indexes.
There is a known bug in parquet-go library writer where it messes up writing page indexes after xitongsys/parquet-go@1027acc
I suggest migrating to arrow-go and fixing your data.

chippyash · 2025-01-31T06:28:07Z

@raunaqmorarka
I think setting parquet.use-column-index to false will allow you to work around this

set session mycat.parquet_use_column_index = false;

Worked.

I assume this has a performance hit, so I'll look at using the alternative library you suggest. The documentation for it is very poor however. Thanks for your help.

raunaqmorarka · 2025-01-31T07:03:56Z

I assume this has a performance hit

Parquet column indexes don't help on most queries, they help a little bit on needle-in-a-haystack style queries

chippyash mentioned this issue Jan 29, 2025

Incorrect page indexes possibly causing Athena query engine 3 to crash xitongsys/parquet-go#547

Open

ebyhr added the hive Hive connector label Jan 30, 2025

chippyash mentioned this issue Jan 31, 2025

Corrupted Parquet Statistics in Trino SQL xitongsys/parquet-go#587

Open

wendigo closed this as completed Feb 4, 2025

wendigo reopened this Feb 4, 2025

wendigo closed this as not planned Won't fix, can't repro, duplicate, stale Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Querying partitioned S3 data source error #24840

Querying partitioned S3 data source error #24840

chippyash commented Jan 29, 2025 •

edited

Loading

ebyhr commented Jan 30, 2025 •

edited

Loading

chippyash commented Jan 30, 2025 •

edited

Loading

raunaqmorarka commented Jan 30, 2025

chippyash commented Jan 31, 2025

raunaqmorarka commented Jan 31, 2025

Querying partitioned S3 data source error #24840

Querying partitioned S3 data source error #24840

Comments

chippyash commented Jan 29, 2025 • edited Loading

ebyhr commented Jan 30, 2025 • edited Loading

chippyash commented Jan 30, 2025 • edited Loading

raunaqmorarka commented Jan 30, 2025

chippyash commented Jan 31, 2025

raunaqmorarka commented Jan 31, 2025

chippyash commented Jan 29, 2025 •

edited

Loading

ebyhr commented Jan 30, 2025 •

edited

Loading

chippyash commented Jan 30, 2025 •

edited

Loading