-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor S3 path handling #9092
Conversation
! S3 Directory Handling | ||
|
||
Note that regular S3 buckets do not have a 'native' notion of | ||
directories, instead they are emulated using prefixes and a delimiter | ||
(in Enso, the delimiter is set to "/"). | ||
|
||
The trailing slash determines if the given path is treated as a | ||
directory or as a regular file. | ||
|
||
Because of that, `root / "foo" / "bar"` will fail, as the entry denoted | ||
by "foo" will be treated as a regular file which cannot have children. | ||
Instead, you'd have to ensure that "foo" remains a directory by | ||
remembering about the slash: `root / "foo/" / "bar"`. This can be | ||
written shorter using just one operation: `root / "foo/bar"`. | ||
|
||
See: https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-folders.html | ||
/ : Text -> S3_File | ||
/ self subpath = if self.is_directory.not then Error.throw (S3_Error.Error "Only folders can have children." self.uri) else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it will be good to add a user-facing doc explaining this, as not everyone may be familiar.
I'm not really sure if the described behaviour is best:
root / "foo" / "bar"
not working and requiring instead to remember about the trailing slash:
root / "foo/" / "bar"
is a bit annoying.
We could make it so that even though root / "foo"
is interpreted as a regular file foo
, if /
is used with it, we 'reinterpret' it as a directory and add the slash ourselves. It seems pragmatic and could be more useful for our users.
On the other hand, it will make code like root / "picture.png" / "sound.wav"
work OK and create a file at s3://bucket/picture.png/sound.wav
which is technically OK for S3, but seems not quite desired.
Still I think it would not be harmful, so I'm leaning on the side of being more lenient with this check here. @jdunkerley @AdRiley @Cassandra-Clark @GregoryTravis what do you think about that?
It is also tempting to have a 'heuristic'. If the name contains dots it's likely a filename, so it would not work, but otherwise it could. So
root / "foo" / "bar"
would be ok butroot / "a.png" / "b.bmp"
would not. But I don't think such a heuristic is OK as then it will also rejectroot / "2024.02.19" / "mylog.txt"
and require the user to writeroot / "2024.02.19/" / "mylog.txt"
. So such a heuristic would be too surprising IMO and if anything, I'd prefer just a more lenient check.
One more thing to consider is how then S3_File.list
should behave?
Nominally, root / "foo"
is a regular file, so one cannot list it.
But once we do root / "foo" / "bar"
and we say it's allowed, one would be tempted to also want to (root / "foo").list
. But currently that will not work as one cannot list a regular file. I feel like it may actually be dangerous to make S3 files such 'Schrödinger files/directories' which may behave like a file and directory depending on context. Because of that, after further consideration I think the current behaviour is better as it avoids surprises, even if the requirement to remember about the trailing /
may be slightly annoying.
I'm still curious what others think here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the lenient approach is more like what users will expect.
Maybe for the listing example you give, we would require the trailing slash:
(root / "foo/").list
This way, users will have to explicitly indicate that they mean to usee foo
as a directory. But within a path like root / "foo" / "bar"
, it would be implicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although not ideal. I think if a user asks for root / "picture.png" / "sound.wav" then that is what they get. Windows also supports . in folder names so this isn't too unique to S3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the difference in S3 is picture.png can be a file and a folder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But yes agree with Greg that users will expect the more lenient approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback, in that case I'm implementing that :)
! S3 Directory Handling | ||
|
||
Note that regular S3 buckets do not have a 'native' notion of | ||
directories, instead they are emulated using prefixes and a delimiter | ||
(in Enso, the delimiter is set to "/"). | ||
|
||
The trailing slash determines if the given path is treated as a | ||
directory or as a regular file. | ||
|
||
Because of that, `root / "foo" / "bar"` will fail, as the entry denoted | ||
by "foo" will be treated as a regular file which cannot have children. | ||
Instead, you'd have to ensure that "foo" remains a directory by | ||
remembering about the slash: `root / "foo/" / "bar"`. This can be | ||
written shorter using just one operation: `root / "foo/bar"`. | ||
|
||
See: https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-folders.html | ||
/ : Text -> S3_File | ||
/ self subpath = if self.is_directory.not then Error.throw (S3_Error.Error "Only folders can have children." self.uri) else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the lenient approach is more like what users will expect.
Maybe for the listing example you give, we would require the trailing slash:
(root / "foo/").list
This way, users will have to explicitly indicate that they mean to usee foo
as a directory. But within a path like root / "foo" / "bar"
, it would be implicit.
# Conflicts: # distribution/lib/Standard/AWS/0.0.0-dev/src/S3/S3_File.enso
I re-requested the reviews, as I have made significant changes to the path handling logic.
|
group_builder.specify "will normalize paths upon parsing" <| | ||
S3_File.new "s3://bucketA/a/b/c/../././d/../e" . path . should_equal "s3://bucketA/a/b/e" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a test showing that we normalize the ..
s in the S3_File.new
constructor.
I added this, because we do such normalization in the /
operator anyway.
IMO it would be weird if S3_File.new "s3://a/b/../c"
denoted a file with key b/../c
but after S3_File.new "s3://a/b/../c" / "d"
that would now be transformed into key c/d
because of the normalization only happening after /
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, we have to keep in mind that, while un-intuitive, it is technically possible to have an S3 file with a key a/../b
. The ..
does not have any special meaning in S3 as far as I can tell. Moreover, the /
is not very special either - the delimiter to use for 'virtual' directories on S3 can be customized to be any other character.
I think in the future we may want to have something like raw_uri=True
optional argument
(or whatever other name, e.g. emulate_directories=False
) which will allow the user to skip the ..
normalization and access such weird files.
But I think we are fine for now. Worst case the user can use the lower-level S3
methods that do not do any parsing like this. It is a trade-off between S3_File
being more file-like and also handling such weird conventions.
This feels much more intuitive now |
Pull Request Description
Important Notes
Checklist
Please ensure that the following checklist has been satisfied before submitting the PR:
Scala,
Java,
and
Rust
style guides. In case you are using a language not listed above, follow the Rust style guide.
./run ide build
.