Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Querying partitioned S3 data source error #24840

Closed
chippyash opened this issue Jan 29, 2025 · 5 comments
Closed

Querying partitioned S3 data source error #24840

chippyash opened this issue Jan 29, 2025 · 5 comments
Labels
hive Hive connector

Comments

@chippyash
Copy link

chippyash commented Jan 29, 2025

I have some parquet files in s3 in partitions: year=, month=day=, e.g.
bucket1/year=2025/month=01/day=01/metrics.parquet
bucket1/year=2025/month=01/day=02/metrics.parquet

I can retrieve the whole data set with 'select * from <table_name> but as soon as I try to add a where clause either on the partition value or some value in the data, I get errors.

Here's my setup:

  • Dockerised Trino + Hive
  • AWS RDS Mysql 8
  • S3 buckets for non relational data

NB. Trino ops against RDS have no problem.

Hive has an IAM user that is able to operate on the target S3 bucket. IAM user has following policy (some details changed for security)

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:Put*",
                "s3:Get*",
                "s3:List*",
                "S3:Head*",
                "s3:DeleteObject*"
            ],
            "Resource": [
                "arn:aws:s3:::bucket1",
                "arn:aws:s3:::bucket1/*",
            ],
            "Effect": "Allow"
        }
    ]
}

The user that is being used to log into Trino, has a set of catalogs they can see, determined by a file based ACL. The user also has an associated IAM profile, and their IAM keys etc are stored as part of the catalog description. The IAM user has an associated role that allows

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutBucketTagging",
                "s3:GetObjectAttributes",
                "s3:GetObjectTagging",
                "s3:ListBucket",
                "s3:PutObjectTagging"
            ],
            "Resource": [
                "arn:aws:s3:::bucket1",
                "arn:aws:s3:::bucket1/*"
            ]
        }
    ]
}

My catalog is defined as (mycat.properties)

connector.name=hive
fs.native-s3.enabled=true
hive.metastore.uri=thrift\://hive-metastore\:9083
s3.aws-access-key=<key>
s3.aws-secret-key=<secret>
s3.endpoint=https\://s3.eu-west-2.amazonaws.com
s3.iam-role=<role_arn>
s3.path-style-access=true
s3.region=eu-west-2
hive.recursive-directories=true

My schema is defined as:

CREATE SCHEMA mycat.bucket1
WITH (
   location = 's3://bucket1/'
)

The table creation statement is:

create table mycat.bucket1.daily (
    namespace varchar,
    operation varchar,
    dimensions varchar,
    orgId integer,
    contextId varchar,
    datetime timestamp,
    artifact varchar,
    size double,
    year integer,
    month integer,
    day integer
) with (
    external_location = 's3://bucket1/daily/',
    format = 'PARQUET',
    partitioned_by = ARRAY['year','month','day']
);

Finally, I call:

use mycat.bucket1;
call system.sync_partition_metadata('bucket1', 'daily', 'ADD');

If I do a simple select * from daily limit 10; I get a result.

If I do select * from daily where namespace = 'da-incident-etl.load' and day=28 limit 10; I get an error:

  • in intellij connected to Trino:
Query failed (#20250129_171206_01078_mkuzn): Error opening Hive split s3://bucket1/daily/year=2025/month=01/day=28/metrics.parquet (offset=0, length=88709): Index 10 out of bounds for length 10
[2025-01-29 17:12:07] java.lang.ArrayIndexOutOfBoundsException: Index 10 out of bounds for length 10
  • In dBeaver connected to Trino:
SQL Error [16777217]: Query failed (#20250129_163346_01075_mkuzn): Failed to read Parquet file: s3://bucket1/daily/year=2025/month=01/day=28/metrics.parquet

I'd really appreciate some help here. It's taken months to get this set up and this is kind of the final stumbling block to full adoption of Trino.

Trino V 449

Any help appreciated.
Ashley
I have tried this with the parquet file being compressed and uncompressed

@ebyhr
Copy link
Member

ebyhr commented Jan 30, 2025

Could you share the full stacktrace? You can get it with CLI debug option or Web UI.
cc: @raunaqmorarka

@ebyhr ebyhr added the hive Hive connector label Jan 30, 2025
@chippyash
Copy link
Author

chippyash commented Jan 30, 2025

@ebyhr as requested:

io.trino.spi.TrinoException: Error opening Hive split s3://bucket1/daily/year=2025/month=01/day=28/metrics.parquet (offset=0, length=88709): Index 10 out of bounds for length 10
	at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:306)
	at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:180)
	at io.trino.plugin.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:200)
	at io.trino.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:136)
	at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:48)
	at io.trino.split.PageSourceManager$PageSourceProviderInstance.createPageSource(PageSourceManager.java:79)
	at io.trino.operator.ScanFilterAndProjectOperator$SplitToPages.process(ScanFilterAndProjectOperator.java:260)
	at io.trino.operator.ScanFilterAndProjectOperator$SplitToPages.process(ScanFilterAndProjectOperator.java:191)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:359)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:412)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:346)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:412)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:346)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:412)
	at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:261)
	at io.trino.operator.WorkProcessorUtils.lambda$processStateMonitor$2(WorkProcessorUtils.java:240)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:412)
	at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:261)
	at io.trino.operator.WorkProcessorUtils.lambda$finishWhen$3(WorkProcessorUtils.java:255)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:412)
	at io.trino.operator.WorkProcessorSourceOperatorAdapter.getOutput(WorkProcessorSourceOperatorAdapter.java:133)
	at io.trino.operator.Driver.processInternal(Driver.java:403)
	at io.trino.operator.Driver.lambda$process$8(Driver.java:306)
	at io.trino.operator.Driver.tryWithLock(Driver.java:709)
	at io.trino.operator.Driver.process(Driver.java:298)
	at io.trino.operator.Driver.processForDuration(Driver.java:269)
	at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:890)
	at io.trino.execution.executor.dedicated.SplitProcessor.run(SplitProcessor.java:77)
	at io.trino.execution.executor.dedicated.TaskEntry$VersionEmbedderBridge.lambda$run$0(TaskEntry.java:191)
	at io.trino.$gen.Trino_449____20250125_111737_2.run(Unknown Source)
	at io.trino.execution.executor.dedicated.TaskEntry$VersionEmbedderBridge.run(TaskEntry.java:192)
	at io.trino.execution.executor.scheduler.FairScheduler.runTask(FairScheduler.java:168)
	at io.trino.execution.executor.scheduler.FairScheduler.lambda$submit$0(FairScheduler.java:155)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:76)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1570)
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 10 out of bounds for length 10
	at org.apache.parquet.internal.column.columnindex.OffsetIndexBuilder$OffsetIndexImpl.getFirstRowIndex(OffsetIndexBuilder.java:68)
	at org.apache.parquet.internal.filter2.columnindex.RowRanges.create(RowRanges.java:144)
	at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.applyPredicate(ColumnIndexFilter.java:189)
	at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:160)
	at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)
	at org.apache.parquet.filter2.predicate.Operators$UserDefined.accept(Operators.java:450)
	at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:83)
	at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:77)
	at org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:151)
	at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:77)
	at io.trino.parquet.reader.ParquetReader.calculateFilteredRowRanges(ParquetReader.java:588)
	at io.trino.parquet.reader.ParquetReader.<init>(ParquetReader.java:175)
	at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.lambda$createPageSource$2(ParquetPageSourceFactory.java:286)
	at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createParquetPageSource(ParquetPageSourceFactory.java:469)
	at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:288)
	... 39 more

@raunaqmorarka
Copy link
Member

I think setting parquet.use-column-index to false will allow you to work around this.
You should check the output of parquet pages and parquet column-index and see if the page count lines up with the number of entries in offset and column indexes.
There is a known bug in parquet-go library writer where it messes up writing page indexes after xitongsys/parquet-go@1027acc
I suggest migrating to arrow-go and fixing your data.

@chippyash
Copy link
Author

@raunaqmorarka
I think setting parquet.use-column-index to false will allow you to work around this

set session mycat.parquet_use_column_index = false;

Worked.

I assume this has a performance hit, so I'll look at using the alternative library you suggest. The documentation for it is very poor however. Thanks for your help.

@raunaqmorarka
Copy link
Member

I assume this has a performance hit

Parquet column indexes don't help on most queries, they help a little bit on needle-in-a-haystack style queries

@wendigo wendigo closed this as completed Feb 4, 2025
@wendigo wendigo reopened this Feb 4, 2025
@wendigo wendigo closed this as not planned Won't fix, can't repro, duplicate, stale Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hive Hive connector
Development

No branches or pull requests

4 participants