Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow object store registration from datafusion-cli #3540

Merged
merged 1 commit into from
Sep 23, 2022

Conversation

turbo1912
Copy link
Contributor

Which issue does this PR close?

Closes #3424.

Rationale for this change

This change provides an easy way for datafusion cli users to register object stores.

(inspiration from https://github.com/datafusion-contrib/datafusion-cookbook)

What changes are included in this PR?

Added environment variable support for registering s3 and gcp object stores. Currently users can only specify bucket name from the CLI and the rest of the configuration is setup from environment variables.

Are there any user-facing changes?

Datafusion cli users can now register an object store when they are starting the datafusion cli

datafusion cli --object-store s3 --bucket-name name

and then create external tables from object stores in the interactive shell:

create external table unicorns stored as parquet location 's3://my_bucket/lineitem/'

@turbo1912 turbo1912 force-pushed the cli-object-store branch 2 times, most recently from 873019b to 2df56c3 Compare September 20, 2022 01:09
Comment on lines 85 to 91
#[clap(
short = 'o',
long = "object-store",
help = "Register an object-store to Datafusion's execution context",
requires = "bucket-name"
)]
object_store_option: Option<ObjectStoreRegistrationOptions>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this param be repetitive?

);
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should abort with error if only one of two is supplied

Comment on lines 185 to 199
if let Ok(region) = env::var("AWS_REGION") {
builder = builder.with_region(region);
}

if let Ok(access_key) = env::var("AWS_ACCESS_KEY_ID") {
builder = builder.with_access_key_id(access_key);
}

if let Ok(key) = env::var("AWS_SECRET_ACCESS_KEY") {
builder = builder.with_secret_access_key(key);
}

if let Ok(token) = env::var("AWS_SESSION_TOKEN") {
builder = builder.with_token(token);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clap supports getting env var into param, see here

@alamb
Copy link
Contributor

alamb commented Sep 20, 2022

This PR is super exciting! THank you @turbo1912 and @jimexist

@kmitchener
Copy link
Contributor

I think that rather than any options, the support for object_store should be invisible to the user. So just running this SQL from datafusion-cli should be enough to trigger the cli to register the s3 objectstore:

create external table unicorns stored as parquet location 's3://my_bucket/lineitem/';

You could do that by creating the logical plan from the user's given SQL, then only registering an object store if they're creating an external location that's in the cloud. Something like:

    let logical_plan = ctx.create_logical_plan(sql)?;
    match logical_plan {
        LogicalPlan::CreateExternalTable(external) => {
            if external.location.to_lowercase().starts_with("s3://") {
                // use https://docs.rs/object_store/latest/object_store/path/struct.Path.html
                // parse the path, find the bucket
                // register the object store 
            }
        }
        _ => {},
    }
    // continue on and execute the logical plan

@@ -132,6 +179,48 @@ pub async fn main() -> Result<()> {
Ok(())
}

fn build_s3_object_store_from_env(bucket_name: &str) -> Result<AmazonS3> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh haha didn't realize this existed. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only exists as of object_store v0.5.0 (cookbook still targets v0.4.0).

Should update the recipes to the latest version and leverage their features..

@turbo1912
Copy link
Contributor Author

I think that rather than any options, the support for object_store should be invisible to the user. So just running this SQL from datafusion-cli should be enough to trigger the cli to register the s3 objectstore:

create external table unicorns stored as parquet location 's3://my_bucket/lineitem/';

You could do that by creating the logical plan from the user's given SQL, then only registering an object store if they're creating an external location that's in the cloud. Something like:

    let logical_plan = ctx.create_logical_plan(sql)?;
    match logical_plan {
        LogicalPlan::CreateExternalTable(external) => {
            if external.location.to_lowercase().starts_with("s3://") {
                // use https://docs.rs/object_store/latest/object_store/path/struct.Path.html
                // parse the path, find the bucket
                // register the object store 
            }
        }
        _ => {},
    }
    // continue on and execute the logical plan

Yes @kmitchener! I thought of this as well, but then wasn't sure if every object-store type can be inferred from the url so took the less risky approach, that doesn't close any doors for the future. For example, can file://localhost mean multiple things?

That being said, looks like there is some object-store code already that makes some assumptions in regards to the url.

I am happy to go with this approach if there is interest! @alamb, @tustvold and @jimexist any thoughts?

@tustvold
Copy link
Contributor

tustvold commented Sep 20, 2022

I think it would be awesome to support this out of the box, and ObjectStoreRegistry should be able to handle this, after all a non-trivial amount of effort went into the design of ObjectStoreUrl to make sure this works. In particular I think it should just be a case of providing an ObjectStoreProvider that can create the various different schema.

apache/arrow-rs#2304 may also be relevant, FYI @roeap

@roeap
Copy link
Contributor

roeap commented Sep 20, 2022

In delta-rs I have been playing around with some of this, in the hopes to eventually get to a point to move this upstream into object store. Specifically, I took the OjectStoreUrl and gave it some "awareness" of what service it is referencing (s3 etc ..). (https://github.com/delta-io/delta-rs/blob/0fdaeed734b036d2bd7734eb3c4b3e56680dffff/rust/src/builder/mod.rs#L281-L407)

Also, I added parsing for some well known uris - especially azure has many different variants :) (https://github.com/delta-io/delta-rs/blob/0fdaeed734b036d2bd7734eb3c4b3e56680dffff/rust/src/builder/mod.rs#L428-L537)

Last but not least, we had several users who wanted to manage connections to several stores, so we added some logic to process property bags with several commonly used aliases for some of the config keys. and making sure passed config always takes precedence over config taken from the environment. (https://github.com/delta-io/delta-rs/blob/main/rust/src/builder/azure.rs)

Not sure how much of this - if anything :) - might be useful here or is "worthy" or moving upstream :).

@turbo1912 turbo1912 force-pushed the cli-object-store branch 2 times, most recently from dc19f1c to 42e1982 Compare September 22, 2022 06:24
Comment on lines 56 to 59
match AmazonS3Builder::from_env().with_bucket_name(host).build() {
Ok(s3) => Some(Arc::new(s3)),
Err(_) => None,
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally I would like to write this as:

AmazonS3Builder::from_env()
    .with_bucket_name(host).build()
    .map(Arc::new)
    .ok();

but getting the following type error:

mismatched types
expected enum `Option<Arc<(dyn object_store::ObjectStore + 'static)>>`
found enum `Option<Arc<AmazonS3>>`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could write

Some(AmazonS3Builder::from_env() .with_bucket_name(host).build() .map(Arc::new).ok()?)

But it is a bit derived, and I think what you have written is better. The problem is that whilst Arc<AmazonS3> can be coerced to Arc<dyn ObjectStore>, Option<Arc<AmazonS3>> can't be coerced to Option<Arc<dyn ObjectStore>>

@turbo1912
Copy link
Contributor Author

ObjectStoreProvider works great, thanks for the suggestion @tustvold! I gave it a try, one thing that I am a little worried about is that there is no way for ObjectStoreProvider to communicate errors, for example if the url for a s3 object store is malformed provider would just return None and the ObjectStoreRegistry would silently not find the suitable object store.

@tustvold
Copy link
Contributor

I don't think there would be any objections to changing ObjectStoreProvider to return an error, after all if it returns None it just gets converted to an error anyway by ObjectStoreRegistry::get_by_url. I'll get a quick PR up to do this

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it, happy to see these things coming together. I've created #3584 to address the error handling. Other than that and some very minor nits, this looks good to go 👍

datafusion-cli/src/object_storage.rs Outdated Show resolved Hide resolved
datafusion-cli/src/object_storage.rs Outdated Show resolved Hide resolved
Comment on lines 56 to 59
match AmazonS3Builder::from_env().with_bucket_name(host).build() {
Ok(s3) => Some(Arc::new(s3)),
Err(_) => None,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could write

Some(AmazonS3Builder::from_env() .with_bucket_name(host).build() .map(Arc::new).ok()?)

But it is a bit derived, and I think what you have written is better. The problem is that whilst Arc<AmazonS3> can be coerced to Arc<dyn ObjectStore>, Option<Arc<AmazonS3>> can't be coerced to Option<Arc<dyn ObjectStore>>

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just needs clippy to be pacified and then this can go in 👍

@tustvold tustvold merged commit b625277 into apache:master Sep 23, 2022
@tustvold
Copy link
Contributor

Thank you for this, this is really cool functionality to have out of the box

@ursabot
Copy link

ursabot commented Sep 23, 2022

Benchmark runs are scheduled for baseline = 49b9c67 and contender = b625277. b625277 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really neat -- thank you @turbo1912

@timvw
Copy link
Contributor

timvw commented Sep 24, 2022

+1. Thanks, really nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add datafusion-cli support of external table locations that object_store supports
8 participants