Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow plugins to add custom schema/authority URL handler rules #17898

Merged
merged 7 commits into from
Jan 4, 2023

Conversation

thejcannon
Copy link
Member

@thejcannon thejcannon commented Dec 29, 2022

At <company> we're looking at either using a versioned S3 bucket or Artifactory for binary artifact storage. In either case, I'd love to use http_source on a file/resource to have Pants manage the download/caching. But in order to do so Pants would need to make the request with the right auth, which it can't to today.

Ideally, the solution to this problem is flexible enough to handle either case (and many more), so that usually means unions get involved.

This change does 3 things:

  1. adds the ability for plugins to add custom rules which handle file download matched by scheme or authority. This is automatically hooked into by the DownloadFile node (by renaming the existing class NativeDownloadFile for the intrinsic input). The behavior is first-one-wins and existing-behavior-by-default, and the implementing rule must return the URL's Digest.
  2. adds a auth_headers mapping to NativeDownloadFile to be used when making the request but NOT part of the underlying cache key. The expectation being that most authenticated HTTP requests just need the right auth header. The name auth_headers is purposefully a misnomer, as any header could be specified, but is named to nudge the user to only specifying headers required for auth (and is also a hint to the possible sensitivity of the data).
  3. Adds a url handler for S3. Both for exposition and because S3 is an extremely popular storage solution

@thejcannon
Copy link
Member Author

Pending tests, then no longer WIP.

@thejcannon thejcannon changed the title WIP: Allow plugins to add custom schema/authority handler rules WIP: Allow plugins to add custom schema/authority URL handler rules Dec 29, 2022
Copy link
Member

@kaos kaos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool stuff :)

How's the plan to satisfy the botocore dependency when using the S3 url handler?

@thejcannon
Copy link
Member Author

You can add botocore to plugins (the test will exercise this). Or if you're me you'll add botocore@git+https://github.com/thejcannon/botocore.git@8de7ace79d71c6f2bc3e02875075b834ada6e6ed which isn't 70MB lol

@kaos
Copy link
Member

kaos commented Dec 29, 2022

You can add botocore to plugins (the test will exercise this). Or if you're me you'll add botocore@git+https://github.com/thejcannon/botocore.git@8de7ace79d71c6f2bc3e02875075b834ada6e6ed which isn't 70MB lol

Hah, neat trick with the custom fork.. been thinking about doing the same for tensorflow in hopes of reducing the 700MB payload for that dep alone.

I saw the plugins option to solve it, but I'm hesitant if we really want to go down that path as it would mean there'll be a dependency to add to your plugins whenever you enable the S3 backend.

I would hope there'd be something more automagical perhaps. What if we introduce "extras" on the pantsbuild.pants dist, and the bootstrap uses the proper extras based on enabled backends (for those that have special requirements)?
Another option I see could be to package the backend itself as a dedicated dist (or grouped together with other backends) in order to have the package dependency resolution as usual (this is the option I've leaned towards for the explorer backend; not finalized yet i.e. not publishing it yet)

@thejcannon
Copy link
Member Author

thejcannon commented Dec 29, 2022

This mirrors --stats-log's dependency on hdrhistogram:

For histogram summaries to work, you must add `hdrhistogram` to `[GLOBAL].plugins`.

The alternative is requesting a PEX with botocore in it (or equivalent). I'd honestly love that but it would require that we generate the auth token in a PEX subprocess which seems waaaay less secure 😓

If/when we peel off "standard-but-separate" plugins, having it depend on botocore would be ideal.

@kaos
Copy link
Member

kaos commented Dec 29, 2022

Huh, OK then. Still think it's unfortunate, but nothing new. I'll ponder on if we can't find a nicer way to handle this in the future.. :)

@thejcannon thejcannon changed the title WIP: Allow plugins to add custom schema/authority URL handler rules Allow plugins to add custom schema/authority URL handler rules Dec 29, 2022
src/python/pants/engine/download_file.py Show resolved Hide resolved
src/rust/engine/src/downloads.rs Outdated Show resolved Hide resolved
let mut headers = HeaderMap::new();
for (k, v) in auth_headers.iter() {
headers.insert(
HeaderName::from_bytes(k.as_bytes()).unwrap(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe avoid the .unwrap and return an error instead (using ? operator to just do it inline)? I.e., .map_err(|err| ...)?.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment for the following line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case it'd be a dev mistake, so I'd rather crash-and-burn to make it obvious.

register both.
"""

match_authority: ClassVar[Optional[str]] = None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to consider is allowing wildcards. I suspect at <company> we'll want to use https://*.s3.amazonaws.com. I can upstream that as well.

@thejcannon
Copy link
Member Author

I'll also admit I'm not sure how all this works with regions. If anyone wants to try with a non us-east-1 region and see if it crashes, please let me know!

Copy link
Member

@kaos kaos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@thejcannon
Copy link
Member Author

Just tried it on private artifact and it went boom. I'll post the code soon, and fix the tests later.

@thejcannon
Copy link
Member Author

oh and i want to test on a box with no creds, as some files are public

Copy link
Member

@kaos kaos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With #17901 landed, perhaps you want to go back to using None as your mock result sentinel value..? :)

virtual_hosted_url = f"https://{request.bucket}.s3"
if request.region:
virtual_hosted_url += f".{request.region}"
virtual_hosted_url += f".amazonaws.com/{request.key}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is request.key guaranteed to not start with a /?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I should add an assert

Comment on lines +252 to +264
@frozen_after_init
@dataclass(unsafe_hash=True)
class NativeDownloadFile:
"""Retrieve the contents of a file via an HTTP GET request or directly for local file:// URLs.

This request is handled directly by the native engine without any additional coercion by plugins,
and therefore should only be used in cases where the URL is known to be publicly accessible.
Otherwise, callers should use `DownloadFile`.

The auth_headers are part of this nodes' cache key for memoization (changing a header invalidates
prior results) but are not part of the underlying cache key for the local/remote cache (changing
a header won't re-download a file if the file was previously downloaded).
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this extra overloading is slightly odd, because as you mention, the "native" implementation already supports multiple URL schemes.

What was the reasoning behind doing this in Python via a union, rather than by adding another implementation on the rust side? A preference for boto? A binary size concern? A desire to use the @union to plug in other private implementations? Because it seems like it would be simpler to continue to expand the existing scheme based switching on the Rust side (which would get you retry for the auth portion as well), rather than have two code paths parsing URLs to decide which implementation to use.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a closed PR in your inbox that has this on the rust side... But much less powerful.

Three big things stick out:

  • handling auth headers for any URL. As I said in the PR description we might use artifactory, so vanilla http with an auth header is required. That's currently impossible to have rust-side with a flexible enough client-side
  • s3 URLs come in many flavors as the S3 tests show. Trying to match them all is folly.
  • bloating the engine binary with AWS is just plain rude

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be clearer if we just made this type private?

Copy link
Member

@stuhood stuhood Jan 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be clearer if we just made this type private?

Yea, I would feel better about that. I know that we don't have any explicitly public / stable plugin APIs, but it'd be good not to commit to this being an extension point if we think that support for s3 / auth are things that we should support natively (I think that they are.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, the native type would be private. The extension point would remain public.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you use the extension point if the native type was private?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same way I am today, but with an underscore 😌

More honestly, we'd just keep usage internal to the repo I guess

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh maybe there's confusion. Plugin code is currently using DownloadFile, which will become the extension point. Then the extender(?) Is the one using the "native" type.

I have an example usage in this PR for S3, check it out!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and maybe you didn't see, the plugin implementation boils down to NativeDownloadFile, so we still get retry and fast-return-if-cached

@stuhood
Copy link
Member

stuhood commented Jan 4, 2023

You’ve put a lot of effort into this, and I think that we could probably port it elsewhere and deprecate the existing API fairly easily if we find a better solution. Thanks!

@thejcannon thejcannon merged commit d098b58 into pantsbuild:main Jan 4, 2023
@thejcannon thejcannon deleted the user-defined-download branch January 4, 2023 18:03
@thejcannon
Copy link
Member Author

To wrap up the botocore issue above: https://pypi.org/project/botocore-a-la-carte/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants