Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor S3 path handling #9092

Merged
merged 24 commits into from
Feb 22, 2024
Merged

Refactor S3 path handling #9092

merged 24 commits into from
Feb 22, 2024

Conversation

radeusgd
Copy link
Member

@radeusgd radeusgd commented Feb 19, 2024

Pull Request Description

Important Notes

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

  • The documentation has been updated, if necessary.
  • Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
  • All code follows the
    Scala,
    Java,
    and
    Rust
    style guides. In case you are using a language not listed above, follow the Rust style guide.
  • All code has been tested:
    • Unit tests have been written where possible.
    • If GUI codebase was changed, the GUI was tested when built using ./run ide build.

@radeusgd radeusgd added the CI: No changelog needed Do not require a changelog entry for this PR. label Feb 19, 2024
@radeusgd radeusgd self-assigned this Feb 19, 2024
Comment on lines 229 to 246
! S3 Directory Handling

Note that regular S3 buckets do not have a 'native' notion of
directories, instead they are emulated using prefixes and a delimiter
(in Enso, the delimiter is set to "/").

The trailing slash determines if the given path is treated as a
directory or as a regular file.

Because of that, `root / "foo" / "bar"` will fail, as the entry denoted
by "foo" will be treated as a regular file which cannot have children.
Instead, you'd have to ensure that "foo" remains a directory by
remembering about the slash: `root / "foo/" / "bar"`. This can be
written shorter using just one operation: `root / "foo/bar"`.

See: https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-folders.html
/ : Text -> S3_File
/ self subpath = if self.is_directory.not then Error.throw (S3_Error.Error "Only folders can have children." self.uri) else
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it will be good to add a user-facing doc explaining this, as not everyone may be familiar.

I'm not really sure if the described behaviour is best:

root / "foo" / "bar"

not working and requiring instead to remember about the trailing slash:

root / "foo/" / "bar"

is a bit annoying.

We could make it so that even though root / "foo" is interpreted as a regular file foo, if / is used with it, we 'reinterpret' it as a directory and add the slash ourselves. It seems pragmatic and could be more useful for our users.

On the other hand, it will make code like root / "picture.png" / "sound.wav" work OK and create a file at s3://bucket/picture.png/sound.wav which is technically OK for S3, but seems not quite desired.

Still I think it would not be harmful, so I'm leaning on the side of being more lenient with this check here. @jdunkerley @AdRiley @Cassandra-Clark @GregoryTravis what do you think about that?


It is also tempting to have a 'heuristic'. If the name contains dots it's likely a filename, so it would not work, but otherwise it could. So root / "foo" / "bar" would be ok but root / "a.png" / "b.bmp" would not. But I don't think such a heuristic is OK as then it will also reject root / "2024.02.19" / "mylog.txt" and require the user to write root / "2024.02.19/" / "mylog.txt". So such a heuristic would be too surprising IMO and if anything, I'd prefer just a more lenient check.


One more thing to consider is how then S3_File.list should behave?

Nominally, root / "foo" is a regular file, so one cannot list it.
But once we do root / "foo" / "bar" and we say it's allowed, one would be tempted to also want to (root / "foo").list. But currently that will not work as one cannot list a regular file. I feel like it may actually be dangerous to make S3 files such 'Schrödinger files/directories' which may behave like a file and directory depending on context. Because of that, after further consideration I think the current behaviour is better as it avoids surprises, even if the requirement to remember about the trailing / may be slightly annoying.

I'm still curious what others think here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the lenient approach is more like what users will expect.

Maybe for the listing example you give, we would require the trailing slash:

(root / "foo/").list

This way, users will have to explicitly indicate that they mean to usee foo as a directory. But within a path like root / "foo" / "bar", it would be implicit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although not ideal. I think if a user asks for root / "picture.png" / "sound.wav" then that is what they get. Windows also supports . in folder names so this isn't too unique to S3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the difference in S3 is picture.png can be a file and a folder.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yes agree with Greg that users will expect the more lenient approach

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, in that case I'm implementing that :)

Comment on lines 229 to 246
! S3 Directory Handling

Note that regular S3 buckets do not have a 'native' notion of
directories, instead they are emulated using prefixes and a delimiter
(in Enso, the delimiter is set to "/").

The trailing slash determines if the given path is treated as a
directory or as a regular file.

Because of that, `root / "foo" / "bar"` will fail, as the entry denoted
by "foo" will be treated as a regular file which cannot have children.
Instead, you'd have to ensure that "foo" remains a directory by
remembering about the slash: `root / "foo/" / "bar"`. This can be
written shorter using just one operation: `root / "foo/bar"`.

See: https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-folders.html
/ : Text -> S3_File
/ self subpath = if self.is_directory.not then Error.throw (S3_Error.Error "Only folders can have children." self.uri) else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the lenient approach is more like what users will expect.

Maybe for the listing example you give, we would require the trailing slash:

(root / "foo/").list

This way, users will have to explicitly indicate that they mean to usee foo as a directory. But within a path like root / "foo" / "bar", it would be implicit.

# Conflicts:
#	distribution/lib/Standard/AWS/0.0.0-dev/src/S3/S3_File.enso
@radeusgd
Copy link
Member Author

I re-requested the reviews, as I have made significant changes to the path handling logic.

  1. I have implemented the suggested approach - now root / "foo" / "bar" works OK.
  2. Added more test cases as the coverage wasn't very good for the many edge cases of path handling.
  3. I have now refactored the (1) splitting into logical parts, (2) joining paths, (3) normalizing a path containing .. and . and (4) re-constructing the S3 key into separate parts, I think the code is much cleaner now.

Comment on lines +84 to +85
group_builder.specify "will normalize paths upon parsing" <|
S3_File.new "s3://bucketA/a/b/c/../././d/../e" . path . should_equal "s3://bucketA/a/b/e"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a test showing that we normalize the ..s in the S3_File.new constructor.

I added this, because we do such normalization in the / operator anyway.

IMO it would be weird if S3_File.new "s3://a/b/../c" denoted a file with key b/../c but after S3_File.new "s3://a/b/../c" / "d" that would now be transformed into key c/d because of the normalization only happening after /.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, we have to keep in mind that, while un-intuitive, it is technically possible to have an S3 file with a key a/../b. The .. does not have any special meaning in S3 as far as I can tell. Moreover, the / is not very special either - the delimiter to use for 'virtual' directories on S3 can be customized to be any other character.

I think in the future we may want to have something like raw_uri=True optional argument
(or whatever other name, e.g. emulate_directories=False) which will allow the user to skip the .. normalization and access such weird files.

But I think we are fine for now. Worst case the user can use the lower-level S3 methods that do not do any parsing like this. It is a trade-off between S3_File being more file-like and also handling such weird conventions.

@AdRiley
Copy link
Member

AdRiley commented Feb 21, 2024

This feels much more intuitive now

@radeusgd radeusgd added the CI: Ready to merge This PR is eligible for automatic merge label Feb 22, 2024
@mergify mergify bot merged commit d845e70 into develop Feb 22, 2024
29 checks passed
@mergify mergify bot deleted the wip/radeusgd/s3-refactor-paths branch February 22, 2024 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: No changelog needed Do not require a changelog entry for this PR. CI: Ready to merge This PR is eligible for automatic merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor: move S3 path handling logic to a separate file
3 participants