Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose erroneous handling for recursive listing in S3 #18173

Conversation

findinpath
Copy link
Contributor

@findinpath findinpath commented Jul 7, 2023

Description

AWS S3 allows having keys containing multiple slashes

e.g. : s3://bucket/schema/table//file

This test cases exposes the fact that the recursive listing in
Hive does erroneously handle keys containing multiple slashes.

Contains cherry-pick from #18167

Additional context and related issues

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

…values

Fix query failure when `hive.recursive-directories` is used and the
partition directory contains any url-encoded characters, i.e. when
partition value contains pretty much anything other than letters/digits.
@cla-bot cla-bot bot added the cla-signed label Jul 7, 2023
@findinpath findinpath force-pushed the findinpath/s3-too-slashed-recursive-listing branch from 347ffad to d01d43f Compare July 7, 2023 12:54
@findinpath findinpath added the bug Something isn't working label Jul 7, 2023
@findepi
Copy link
Member

findepi commented Jul 7, 2023

I don’t think it’s a user problem, since we restored normalization — no double slashes should be there for hive connector.

in theory we could still have problem if hive table has double slashes due to external setup (eg declared explicitly in Glue), but we didn’t see this in real life, and currently believe this not to be a real problem (and it never worked correctly anyway)

We still can add a regression test, which would prevent us from removing table location normalization without fixing the problem you found. Is it possible to write such a test? Note that a unit test on HiveFileIterator level doesn’t have this property, because it’s independent on whether hive locations are normalized or not.

@github-actions github-actions bot added tests:hive hive Hive connector labels Jul 7, 2023
@findinpath findinpath force-pushed the findinpath/s3-too-slashed-recursive-listing branch from d01d43f to 30482e8 Compare July 7, 2023 15:40
AWS S3 allows having keys containing multiple slashes

e.g. : `s3://bucket/schema/table//file`

This test cases exposes the fact that the recursive listing in
Hive does erroneously handle keys containing multiple slashes.
@findinpath findinpath force-pushed the findinpath/s3-too-slashed-recursive-listing branch from 30482e8 to ed45c4d Compare July 7, 2023 15:53
@findinpath findinpath closed this Sep 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cla-signed hive Hive connector
Development

Successfully merging this pull request may close these issues.

2 participants