Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add special handling for multiple trailing slashes in the parent directory #18155

Closed

Conversation

findinpath
Copy link
Contributor

@findinpath findinpath commented Jul 6, 2023

Description

AWS S3 object storage provider allows paths containing multiple trailing slashes

s3://bucket/path/directory//

This leads to the situation where a file with the following location

s3://bucket/path/directory//file

would have the parent

s3://bucket/path/directory//

This PR adds special handling in retrieving the parent directory for a file in case it contains trailing slashes.

Fixes #17966

Additional context and related issues

Release notes

(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@findepi
Copy link
Member

findepi commented Jul 6, 2023

i am not great fan of this special handling / inconsistency:

parent("dir/name") -> "dir"  -- slash is stripped
parent("dir//name") -> "dir//"  -- no slash is stripped

Clearly parent("dir//name") has no well-defined, obviously-correct good return value, so i would prefer to throw instead. dir// may be good in some cases, but may be problematic in some other cases.

Fixes #17966

if the only fix possible would be modifying the Location behavior, I would rather block tables with location// OR normalize the location not to have trailing slashes (like Iceberg does).

If we want to fix this, I would look what it takes to support this on connector level.
See #17966 (comment) for some ideas.

@findinpath
Copy link
Contributor Author

findinpath commented Jul 6, 2023

i am not great fan of this special handling / inconsistency:

parent("dir/name") -> "dir"  -- slash is stripped
parent("dir//name") -> "dir//"  -- no slash is stripped

What if we return consistently a '/' at the end of the parent directory actually?

parent("dir/name") -> "dir/"
parent("dir//name") -> "dir//"

I'm fine with abandoning at the moment the current PR and disallowing on the Delta Lake connector CREATE TABLE with a location containing multiple trailing slashes.
I'd like however to have a better understanding why we find the current handling for retrieving the parent directory acceptable when it is clearly faulty:

assertParentDirectory("scheme://userInfo@host/path//name", Location.of("scheme://userInfo@host/path/"));

cc @electrum

@findepi
Copy link
Member

findepi commented Jul 6, 2023

I'd like however to have a better understanding why we find the current handling for retrieving the parent directory acceptable when it is clearly faulty:

i agree it's faulty and propose to throw for parent("dir//name")

@github-actions github-actions bot added tests:hive hive Hive connector labels Jul 6, 2023
@electrum
Copy link
Member

electrum commented Jul 6, 2023

I'm confused about the problem being solved here. The stack trace in the linked issue is using the deprecated Locations.getParent() method, but this is changing the behavior of the Location.parentDirectory() method.

What if we return consistently a '/' at the end of the parent directory actually?

parent("dir/name") -> "dir/"
parent("dir//name") -> "dir//"

This seems like the best behavior, as it will work consistently in both scenarios. It will break the behavior of location.parentDirectory().parentDirectory(), but that is already documented to not work. It will also match the documented behavior of the method:

Creates a new location with all characters removed after the last slash in the path.

An issue with the current behavior of stripping the trailing slash is that appendPath() treats zero or one trailing slash the same, thus we have the following inconsistent behavior today:

parentDirectory("dir/name").appendPath("abc") -> "dir/abc"
parentDirectory("dir//name").appendPath("abc") -> "dir/abc"
parentDirectory("dir///name").appendPath("abc") -> "dir//abc"

Consistently returning a slash will result in the following, which seems better:

parentDirectory("dir/name").appendPath("abc") -> "dir/abc"
parentDirectory("dir//name").appendPath("abc") -> "dir//abc"
parentDirectory("dir///name").appendPath("abc") -> "dir///abc"

@findinpath
Copy link
Contributor Author

Thank you for the feedback.
I'll close the current PR.

I'll create a new PR based on the feedback from #17966 (comment)
I'll create also another follow-up PR to see where "correcting" .parentDirectory() behavior will take us. Will link them here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector
Development

Successfully merging this pull request may close these issues.

Writes to Delta table fail when location ends with two slashes
3 participants