-
-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow plugins to add custom schema/authority URL handler rules #17898
Conversation
Pending tests, then no longer WIP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool stuff :)
How's the plan to satisfy the botocore
dependency when using the S3 url handler?
You can add |
Hah, neat trick with the custom fork.. been thinking about doing the same for tensorflow in hopes of reducing the 700MB payload for that dep alone. I saw the I would hope there'd be something more automagical perhaps. What if we introduce "extras" on the |
This mirrors
The alternative is requesting a PEX with If/when we peel off "standard-but-separate" plugins, having it depend on |
Huh, OK then. Still think it's unfortunate, but nothing new. I'll ponder on if we can't find a nicer way to handle this in the future.. :) |
let mut headers = HeaderMap::new(); | ||
for (k, v) in auth_headers.iter() { | ||
headers.insert( | ||
HeaderName::from_bytes(k.as_bytes()).unwrap(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe avoid the .unwrap
and return an error instead (using ?
operator to just do it inline)? I.e., .map_err(|err| ...)?
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment for the following line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case it'd be a dev mistake, so I'd rather crash-and-burn to make it obvious.
register both. | ||
""" | ||
|
||
match_authority: ClassVar[Optional[str]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to consider is allowing wildcards. I suspect at <company> we'll want to use https://*.s3.amazonaws.com
. I can upstream that as well.
I'll also admit I'm not sure how all this works with regions. If anyone wants to try with a non |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Just tried it on private artifact and it went boom. I'll post the code soon, and fix the tests later. |
oh and i want to test on a box with no creds, as some files are public |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With #17901 landed, perhaps you want to go back to using None
as your mock result sentinel value..? :)
virtual_hosted_url = f"https://{request.bucket}.s3" | ||
if request.region: | ||
virtual_hosted_url += f".{request.region}" | ||
virtual_hosted_url += f".amazonaws.com/{request.key}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is request.key
guaranteed to not start with a /
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I should add an assert
@frozen_after_init | ||
@dataclass(unsafe_hash=True) | ||
class NativeDownloadFile: | ||
"""Retrieve the contents of a file via an HTTP GET request or directly for local file:// URLs. | ||
|
||
This request is handled directly by the native engine without any additional coercion by plugins, | ||
and therefore should only be used in cases where the URL is known to be publicly accessible. | ||
Otherwise, callers should use `DownloadFile`. | ||
|
||
The auth_headers are part of this nodes' cache key for memoization (changing a header invalidates | ||
prior results) but are not part of the underlying cache key for the local/remote cache (changing | ||
a header won't re-download a file if the file was previously downloaded). | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this extra overloading is slightly odd, because as you mention, the "native" implementation already supports multiple URL schemes.
What was the reasoning behind doing this in Python via a union, rather than by adding another implementation on the rust side? A preference for boto
? A binary size concern? A desire to use the @union
to plug in other private implementations? Because it seems like it would be simpler to continue to expand the existing scheme based switching on the Rust side (which would get you retry for the auth portion as well), rather than have two code paths parsing URLs to decide which implementation to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a closed PR in your inbox that has this on the rust side... But much less powerful.
Three big things stick out:
- handling auth headers for any URL. As I said in the PR description we might use artifactory, so vanilla http with an auth header is required. That's currently impossible to have rust-side with a flexible enough client-side
- s3 URLs come in many flavors as the S3 tests show. Trying to match them all is folly.
- bloating the engine binary with AWS is just plain rude
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be clearer if we just made this type private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be clearer if we just made this type private?
Yea, I would feel better about that. I know that we don't have any explicitly public / stable plugin APIs, but it'd be good not to commit to this being an extension point if we think that support for s3 / auth are things that we should support natively (I think that they are.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, the native type would be private. The extension point would remain public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you use the extension point if the native type was private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same way I am today, but with an underscore 😌
More honestly, we'd just keep usage internal to the repo I guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh maybe there's confusion. Plugin code is currently using DownloadFile, which will become the extension point. Then the extender(?) Is the one using the "native" type.
I have an example usage in this PR for S3, check it out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh and maybe you didn't see, the plugin implementation boils down to NativeDownloadFile, so we still get retry and fast-return-if-cached
You’ve put a lot of effort into this, and I think that we could probably port it elsewhere and deprecate the existing API fairly easily if we find a better solution. Thanks! |
To wrap up the |
At <company> we're looking at either using a versioned S3 bucket or Artifactory for binary artifact storage. In either case, I'd love to use
http_source
on afile
/resource
to have Pants manage the download/caching. But in order to do so Pants would need to make the request with the right auth, which it can't to today.Ideally, the solution to this problem is flexible enough to handle either case (and many more), so that usually means unions get involved.
This change does 3 things:
DownloadFile
node (by renaming the existing classNativeDownloadFile
for the intrinsic input). The behavior is first-one-wins and existing-behavior-by-default, and the implementing rule must return the URL'sDigest
.auth_headers
mapping toNativeDownloadFile
to be used when making the request but NOT part of the underlying cache key. The expectation being that most authenticated HTTP requests just need the right auth header. The nameauth_headers
is purposefully a misnomer, as any header could be specified, but is named to nudge the user to only specifying headers required for auth (and is also a hint to the possible sensitivity of the data).