-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: allow object store registration from datafusion-cli #3540
Conversation
873019b
to
2df56c3
Compare
datafusion-cli/src/main.rs
Outdated
#[clap( | ||
short = 'o', | ||
long = "object-store", | ||
help = "Register an object-store to Datafusion's execution context", | ||
requires = "bucket-name" | ||
)] | ||
object_store_option: Option<ObjectStoreRegistrationOptions>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this param be repetitive?
datafusion-cli/src/main.rs
Outdated
); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should abort with error if only one of two is supplied
datafusion-cli/src/main.rs
Outdated
if let Ok(region) = env::var("AWS_REGION") { | ||
builder = builder.with_region(region); | ||
} | ||
|
||
if let Ok(access_key) = env::var("AWS_ACCESS_KEY_ID") { | ||
builder = builder.with_access_key_id(access_key); | ||
} | ||
|
||
if let Ok(key) = env::var("AWS_SECRET_ACCESS_KEY") { | ||
builder = builder.with_secret_access_key(key); | ||
} | ||
|
||
if let Ok(token) = env::var("AWS_SESSION_TOKEN") { | ||
builder = builder.with_token(token); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clap supports getting env var into param, see here
This PR is super exciting! THank you @turbo1912 and @jimexist |
I think that rather than any options, the support for object_store should be invisible to the user. So just running this SQL from datafusion-cli should be enough to trigger the cli to register the s3 objectstore:
You could do that by creating the logical plan from the user's given SQL, then only registering an object store if they're creating an external location that's in the cloud. Something like: let logical_plan = ctx.create_logical_plan(sql)?;
match logical_plan {
LogicalPlan::CreateExternalTable(external) => {
if external.location.to_lowercase().starts_with("s3://") {
// use https://docs.rs/object_store/latest/object_store/path/struct.Path.html
// parse the path, find the bucket
// register the object store
}
}
_ => {},
}
// continue on and execute the logical plan |
datafusion-cli/src/main.rs
Outdated
@@ -132,6 +179,48 @@ pub async fn main() -> Result<()> { | |||
Ok(()) | |||
} | |||
|
|||
fn build_s3_object_store_from_env(bucket_name: &str) -> Result<AmazonS3> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh haha didn't realize this existed. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only exists as of object_store v0.5.0 (cookbook still targets v0.4.0).
Should update the recipes to the latest version and leverage their features..
Yes @kmitchener! I thought of this as well, but then wasn't sure if every object-store type can be inferred from the url so took the less risky approach, that doesn't close any doors for the future. For example, can That being said, looks like there is some object-store code already that makes some assumptions in regards to the url. I am happy to go with this approach if there is interest! @alamb, @tustvold and @jimexist any thoughts? |
I think it would be awesome to support this out of the box, and ObjectStoreRegistry should be able to handle this, after all a non-trivial amount of effort went into the design of ObjectStoreUrl to make sure this works. In particular I think it should just be a case of providing an ObjectStoreProvider that can create the various different schema. apache/arrow-rs#2304 may also be relevant, FYI @roeap |
In delta-rs I have been playing around with some of this, in the hopes to eventually get to a point to move this upstream into object store. Specifically, I took the Also, I added parsing for some well known uris - especially azure has many different variants :) (https://github.com/delta-io/delta-rs/blob/0fdaeed734b036d2bd7734eb3c4b3e56680dffff/rust/src/builder/mod.rs#L428-L537) Last but not least, we had several users who wanted to manage connections to several stores, so we added some logic to process property bags with several commonly used aliases for some of the config keys. and making sure passed config always takes precedence over config taken from the environment. (https://github.com/delta-io/delta-rs/blob/main/rust/src/builder/azure.rs) Not sure how much of this - if anything :) - might be useful here or is "worthy" or moving upstream :). |
dc19f1c
to
42e1982
Compare
datafusion-cli/src/object_storage.rs
Outdated
match AmazonS3Builder::from_env().with_bucket_name(host).build() { | ||
Ok(s3) => Some(Arc::new(s3)), | ||
Err(_) => None, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally I would like to write this as:
AmazonS3Builder::from_env()
.with_bucket_name(host).build()
.map(Arc::new)
.ok();
but getting the following type error:
mismatched types
expected enum `Option<Arc<(dyn object_store::ObjectStore + 'static)>>`
found enum `Option<Arc<AmazonS3>>`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could write
Some(AmazonS3Builder::from_env() .with_bucket_name(host).build() .map(Arc::new).ok()?)
But it is a bit derived, and I think what you have written is better. The problem is that whilst Arc<AmazonS3>
can be coerced to Arc<dyn ObjectStore>
, Option<Arc<AmazonS3>>
can't be coerced to Option<Arc<dyn ObjectStore>>
|
I don't think there would be any objections to changing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it, happy to see these things coming together. I've created #3584 to address the error handling. Other than that and some very minor nits, this looks good to go 👍
datafusion-cli/src/object_storage.rs
Outdated
match AmazonS3Builder::from_env().with_bucket_name(host).build() { | ||
Ok(s3) => Some(Arc::new(s3)), | ||
Err(_) => None, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could write
Some(AmazonS3Builder::from_env() .with_bucket_name(host).build() .map(Arc::new).ok()?)
But it is a bit derived, and I think what you have written is better. The problem is that whilst Arc<AmazonS3>
can be coerced to Arc<dyn ObjectStore>
, Option<Arc<AmazonS3>>
can't be coerced to Option<Arc<dyn ObjectStore>>
42e1982
to
ac66db8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just needs clippy to be pacified and then this can go in 👍
ac66db8
to
a810dcf
Compare
a810dcf
to
e2295cc
Compare
Thank you for this, this is really cool functionality to have out of the box |
Benchmark runs are scheduled for baseline = 49b9c67 and contender = b625277. b625277 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really neat -- thank you @turbo1912
+1. Thanks, really nice! |
Which issue does this PR close?
Closes #3424.
Rationale for this change
This change provides an easy way for datafusion cli users to register object stores.
(inspiration from https://github.com/datafusion-contrib/datafusion-cookbook)
What changes are included in this PR?
Added environment variable support for registering s3 and gcp object stores. Currently users can only specify bucket name from the CLI and the rest of the configuration is setup from environment variables.
Are there any user-facing changes?
Datafusion cli users can now register an object store when they are starting the datafusion cli
and then create external tables from object stores in the interactive shell: