Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure when Hive partition value contains + and hive.recursive-directories is enabled #18149

Closed
JulianGoede opened this issue Jul 6, 2023 · 11 comments · Fixed by #18167
Closed
Assignees
Labels
bug Something isn't working RELEASE-BLOCKER

Comments

@JulianGoede
Copy link

JulianGoede commented Jul 6, 2023

After upgrading from trino 417 to 420 querying tables
with url_encoded partition values results in an java.lang.IllegalArgumentException in trino-hive/src/main/java/io/trino/plugin/hive/fs/HiveFileIterator.java:
path <eradicated>/foo/part=Erw. 40%252B/20230706_124818_00867_wqv8v_06d14113-abfa-4298-8746-816dc7818928 does not start with prefix <eradicated>/foo/part=Erw.+40%252B

Here is a minimal setup to reproduce this error (hive-connector with s3 storage):

CREATE TABLE foo ( 
  x varchar,
  part varchar
)                                       
WITH (                                  
  format = 'ORC',                      
  partitioned_by = ARRAY['part']       
);

INSERT INTO foo
SELECT  'x', url_encode('Erw. 40+');

SELECT * FROM foo;
@findinpath findinpath self-assigned this Jul 6, 2023
@findepi
Copy link
Member

findepi commented Jul 6, 2023

I tried to reproduce this using io.trino.plugin.hive.s3.S3HiveQueryRunner#main on current master, but the SELECT returned correct results (didn't fail)

trino:s3> SELECT "$path", * FROM foo;
                                                $path                                                | x |    part
-----------------------------------------------------------------------------------------------------+---+------------
 s3://tpch/s3/foo/part=Erw.+40%252B/20230706_150455_00054_derg7_d0253eff-4df2-492c-a640-02604746ffb5 | x | Erw.+40%2B

I did not check with real S3.

@findinpath
Copy link
Contributor

trino> create table hive.default.findinpathhiveorc18149_1 (x varchar, part varchar) with (format = 'ORC', partitioned_by = array['part'], external_location='s3://ub40/findinpathhiveorc18149_1');
CREATE TABLE

trino> insert into hive.default.findinpathhiveorc18149_1 select 'x', url_encode('Erw. 40+');
INSERT: 1 row

trino> select * from hive.default."findinpathhiveorc18149_1$partitions";
    part    
------------
 Erw.+40%2B 
(1 row)

trino> select * from hive.default.findinpathhiveorc18149_1;
 x |    part    
---+------------
 x | Erw.+40%2B 
(1 row)

trino> select "$path" from hive.default.findinpathhiveorc18149_1;
                                                                $path                                                                
-------------------------------------------------------------------------------------------------------------------------------------
 s3://ub40/findinpathhiveorc18149_1/part=Erw.+40%252B/20230706_152532_00004_76chu_359ad588-ad49-4289-918d-4ec3c30d55e5 
(1 row)

Tested with the latest master code from Trino and couldn't reproduce the issue.
However, I find it strange that the partition value is Erw.+40%2B and not Erw.%2040%2B

trino> select url_encode('Erw. 40+');
   _col0    
------------
 Erw.+40%2B 

@JulianGoede
Copy link
Author

Okay thank you for checking.
I'll try to dig deeper into the problem tomorrow.

@findinpath
Copy link
Contributor

@JulianGoede pls add the full stack trace of the issue.
Please also do a tree listing of the s3 directory corresponding to the table.

@findepi
Copy link
Member

findepi commented Jul 6, 2023

However, I find it strange that the partition value is Erw.+40%2B and not Erw.%2040%2B

i don't think the value is changed by hive connector (the examples above show that it was faithfully preserved), so it's whatever url_encode returned.
If this is of any concern, let's create a separate issue.

@findepi
Copy link
Member

findepi commented Jul 6, 2023

Closing for now. @JulianGoede please reopen with new information.

@findepi findepi closed this as not planned Won't fix, can't repro, duplicate, stale Jul 6, 2023
@JulianGoede
Copy link
Author

Hi again, I just retried the queries from @findinpath (now on trino v421) but it threw an exception nevertheless.
Actually, I found out that we actually do not even need the ulr_encode function to produce an error but a simple + suffices.
Note that I could not reproduce the error with other symbols like ? or &.

Here again, the set of queries:

trino:temp>  CREATE TABLE localhive.temp.plus_error (                                                                                                                                                                                                                                                                        
         ->     x varchar,                                                                                                                                                                                                                                                                                                   
         ->     code varchar                                                                                                                                                                                                                                                                                                 
         ->  )                                                                                                                                                                                                                                                                                                               
         ->  WITH (                                                                                                                                            
         ->     external_location = 's3a://{bucket}/tmp/trino_temp_schema_stage/plus_error',                                                                                                                                                                                                                  
         ->     format = 'ORC',                                                                                                                                                                                                                                                                                              
         ->     partitioned_by = ARRAY['code']                                                                                                                                                                                                                                                                               
         ->  )                                                                                                                                                                                                                                                                                                               
         -> ;                                                                                                                                                                                                                                                                                                                
CREATE TABLE                                                                                                                                                                                  
trino:temp> insert into plus_error values ('foo', 'foo+bar');                                                                                                                                                                                                                                                                
INSERT: 1 row                                                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                             
Query 20230707_083158_00079_72z5g, FINISHED, 2 nodes                                                                                                          
http://trino.atv-stage.svc.k8s.local/ui/query.html?20230707_083158_00079_72z5g                                                                                                                                                                                                                                               
Splits: 38 total, 38 done (100.00%)                                                                                                                                                                                                                                                                                          
CPU Time: 0.0s total,     0 rows/s,     0B/s, 25% active                                                                                                                                                                                                                                                                     
Per Node: 0.0 parallelism,     0 rows/s,     0B/s                                                                                                                                                                                                                                                                            
Parallelism: 0.1                                                                                                                                              
Peak Memory: 2.52KB                                                                                                                                                                                                                                                                                                          
0.37 [0 rows, 0B] [0 rows/s, 0B/s]                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                             
trino:temp> select * from plus_error;                                                                                                                                                                                                                                                                                        
Query 20230707_083207_00080_72z5g failed: path s3a://{bucket}/tmp/trino_temp_schema_stage/plus_error/code=foo bar/20230707_083158_00079_72z5g_bfe2bfc8-c64e-49a9-9755-fbe0d2e06d39 does not start with prefix s3a://{bucket}/tmp/trino_temp_schema_stage/plus_error/code=foo+bar               
io.trino.spi.TrinoException: path s3a://{bucket}/tmp/trino_temp_schema_stage/plus_error/code=foo bar/20230707_083158_00079_72z5g_bfe2bfc8-c64e-49a9-9755-fbe0d2e06d39 does not start with prefix s3a://{bucket}/tmp/trino_temp_schema_stage/plus_error/code=foo+bar                            
        at io.trino.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:318)                                                                                                                                                                                                    
        at io.trino.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:38)                                                                                                                                                                                                                                            
        at io.trino.$gen.Trino_421____20230707_080001_2.run(Unknown Source)                                                                                                                                                                                                                                                  
        at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:79)                                                                                                                                                                                                                                         
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)                                                                                                                                                                                                                         
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)                                                           
        at java.base/java.lang.Thread.run(Thread.java:833)                                                                                                                                                                                                                                                                   
Caused by: java.lang.IllegalArgumentException: path s3a://{bucket}/tmp/trino_temp_schema_stage/plus_error/code=foo bar/20230707_083158_00079_72z5g_bfe2bfc8-c64e-49a9-9755-fbe0d2e06d39 does not start with prefix s3a://{bucket}/tmp/trino_temp_schema_stage/plus_error/code=foo+bar                                                                          
        at com.google.common.base.Preconditions.checkArgument(Preconditions.java:445)                                                                                                         
        at io.trino.plugin.hive.fs.HiveFileIterator.isHiddenOrWithinHiddenParentDirectory(HiveFileIterator.java:127)
        at io.trino.plugin.hive.fs.HiveFileIterator.computeNext(HiveFileIterator.java:83)
        at io.trino.plugin.hive.fs.HiveFileIterator.computeNext(HiveFileIterator.java:39)                                                                                                     
        at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:145)                        
        at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:140)       
        at java.base/java.util.Spliterators$IteratorSpliterator.tryAdvance(Spliterators.java:1855)
        at java.base/java.util.stream.StreamSpliterators$WrappingSpliterator.lambda$initPartialTraversalState$0(StreamSpliterators.java:292)
        at java.base/java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.fillBuffer(StreamSpliterators.java:206)
        at java.base/java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.doAdvance(StreamSpliterators.java:161)
        at java.base/java.util.stream.StreamSpliterators$WrappingSpliterator.tryAdvance(StreamSpliterators.java:298)
        at java.base/java.util.Spliterators$1Adapter.hasNext(Spliterators.java:681)
        at io.trino.plugin.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:405)
        at io.trino.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:311)
        ... 6 more        

When I look into s3 it correctly wrote the object with the code=foo+bar prefix as mentioned in the error message:

aws s3 ls {bucket}/tmp/trino_temp_schema_stage/plus_error/         
                           PRE code=foo+bar/

@JulianGoede JulianGoede reopened this Jul 7, 2023
@findepi
Copy link
Member

findepi commented Jul 7, 2023

@JulianGoede thanks for providing more info. Especially the stacktrace is useful, since now i see it's related to hive.recursive-directories. That's probably why we didn't reproduce it initially.

@findepi findepi changed the title [Bug] Select from url_encoded partition column fails Failure when Hive partition value contains + and hive.recursive-directories is enabled Jul 7, 2023
@findepi findepi added the bug Something isn't working label Jul 7, 2023
@findepi
Copy link
Member

findepi commented Jul 7, 2023

@findinpath the exception is because we have some URLDecoder.decode call within isHiddenOrWithinHiddenParentDirectory

String pathString = decode(path.toUri().toString());

I don't know why it's there, and there is no comment explaining it, so intuitively we should be good just removing it.
However, it was added quite recently (#17624, 419)
cc @guyco33 @electrum

@findepi
Copy link
Member

findepi commented Jul 7, 2023

Marking this as RELEASE-BLOCKER since it's a recent regression (419).

@findepi
Copy link
Member

findepi commented Jul 7, 2023

#18167 might fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working RELEASE-BLOCKER
Development

Successfully merging a pull request may close this issue.

3 participants