Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

object_store: Instantiate object store from provided url with store options #4047

Closed
chitralverma opened this issue Apr 11, 2023 · 15 comments · Fixed by #4200
Closed

object_store: Instantiate object store from provided url with store options #4047

chitralverma opened this issue Apr 11, 2023 · 15 comments · Fixed by #4200
Assignees
Labels
enhancement Any new improvement worthy of a entry in the changelog object-store Object Store Interface

Comments

@chitralverma
Copy link

chitralverma commented Apr 11, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently, in the projects that are using object_store - datafusion/ delta-rs/ pola-rs etc, a dyn ObjectStore has to be created manually by parsing the provided URL, checking the scheme and providing the options.

It would be great to have this capability directly provided by the crate.

Describe the solution you'd like
My proposal is to standardize this implementation and bring it into this crate itself exposed by a simple function like the below as it would make things significantly simple for developers using the crate.

/// Taken a reference from implementation in delta-rs 
https://github.com/delta-io/delta-rs/blob/c8371b38fdf22802f0f91b4ddc2a47da6be97c68/rust/src/storage/mod.rs#L85-L100

#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct StorageOptions(pub HashMap<String, String>);

#[derive(Debug, Clone)]
pub struct AObjectStore {
    storage: Arc<dyn ObjectStore>,
    location: Url,
    options: StorageOptions,
}

/// Try creating a new instance of [`AObjectStore`]
pub fn get_object_store(location: Url, options: impl Into<StorageOptions> + Clone) -> Result<AObjectStore> {
    let prefix = Path::from(location.path());
   
    // parse URL to a kind (s3/ aws/ ... )
    let kind = ObjectStoreKind::parse_url(&location)?;

    // instantiate object store
    let store = kind.into_impl( .... );

    // return
    Ok(Self {
        store,
        location,
        options: options.into(),
    })
}

For any new storage backends that may come up in the future, they can be added to the ObjectStoreKind along with a small implementation in into_impl. Users of the crate will only have to bump up the crate version.

Describe alternatives you've considered
Without this, each lib using object_store has to implement its own parsing.

Examples:

  • See datafusion registry here.
  • See delta-rs implementation here

Also without this each time this crate adds a new backend, the users of this crate will have to bump up the version and add implementation for the backends by themselves.

Additional context
This idea is also implemented by,

@chitralverma chitralverma added the enhancement Any new improvement worthy of a entry in the changelog label Apr 11, 2023
@chitralverma
Copy link
Author

@houqp what are your thoughts on this ?

@tustvold
Copy link
Contributor

tustvold commented Apr 11, 2023

There was some prior discussion on #2304

FYI @roeap

I would not object to providing a

pub fn parse_url<I>(location: Url, options: I) -> Result<Arc<dyn ObjectStore>>
    where I: IntoIterator<Item = (impl AsRef<str>, impl Into<String>)> {
    match location.scheme() {
        "gs" => Ok(Arc::new(GoogleCloudStorageBuilder::from_env().with_url(url).try_with_options(options).build()?)),
        "s3" => ...
    }
}

I'm not sure what the advantage of providing AObjectStore would be?

@chitralverma
Copy link
Author

I'm not sure what the advantage of providing AObjectStore would be?

This is just an example of "a object store". As the psuedo code I added to OP was taken from delta where they return DeltaObjectStore. We can live without AObjectStore :)

@chitralverma
Copy link
Author

chitralverma commented Apr 11, 2023

@tustvold @roeap @houqp If there are no objections or other ideas for this I can raise a PR to close this and #2304.

@roeap
Copy link
Contributor

roeap commented Apr 11, 2023

From my perspective, centralising this logic makes a lot of sense. We actually have some preparatory work already done within object store. Specifically, extracting relevant information from the URLs is available on the builders via respective parse_url methods. I guess it makes sense to reuse that logic somehow.

One thing we discussed back then was the option to go as far as moving the ObjectStoreFactory (not sure what its actually called right now 😆 ) trait from datafusion into the object store crate. But not sure what my stance on that would be today. I guess a simple "global" parse_url function like @tustvold suggest would also be my preference today.

There is one more kind of related thing I recently looked into, but haven't converged on a good solution yet. Essentially using from_env may lead to a situation where a complete credential specified in the storage options will not be chosen, b/c there is sufficient information available in the environment to in principle authorise. On the other hand, creating an ObjectStore without initializing options from the environment will always succeed (at least for AWS and Azure) since there are config-free options available via the metadata endpoint.

My current thinking was to check storage options if they contain a full credential (this would have to be build) and if not, then initialise from the environment.

In azure the default credential allows for explicitly configuring to omit certain options as an additional guard against picking the wrong credential, which may be an option as well.

All in all though I think this is a gerat addition to object store.

Update:

One more consideration would be to not return an Arced object store, so that it can be wrapped in the other layers like ThrottleStore store etc. or maybe configure wrapping or pass in an optional closure that wraps the store...

@tustvold
Copy link
Contributor

tustvold commented Apr 11, 2023

in the other layers like ThrottleStore store etc

We could return Box<dyn ObjectStore> which implements ObjectStore. Alternatively we could derive ObjectStore for Arc<dyn ObjectStore>. Tbh the wrappers probably shouldn't be generics, it doesn't really yield much, so we could also just change that

Essentially using from_env may lead to a situation where a complete credential specified in the storage options will not be chosen,

Perhaps we don't call from_env and punt this to the user to decide, I don't have a good solution for this...

@chitralverma
Copy link
Author

doesn't 'from_env' imply that most creds, options will come from env and that env vars should be preferred?

I think an Arced dyn ObjectStore can't be returned anyways as the true size of the object cannot be inferred. The compiler will throw an error, but I may be I'm wrong

@roeap
Copy link
Contributor

roeap commented Apr 11, 2023

doesn't 'from_env' imply that most creds, options will come from env and that env vars should be preferred?

It might :) - but from what I see there is no way to tell the user intent inside the proposed parse_url function. I.e. we would be making the decision to call from_env, not the user. If I as a user pass in configuration for a specific credential, I would expect that to be used, and not overwritten by the environment.

I guess thats why it might just be best to just let the caller figure that out, to keep things simpler for now. If we see a solution emerging in several protects, we can still move that upstream?

@chitralverma
Copy link
Author

doesn't 'from_env' imply that most creds, options will come from env and that env vars should be preferred?

It might :) - but from what I see there is no way to tell the user intent inside the proposed parse_url function. I.e. we would be making the decision to call from_env, not the user. If I as a user pass in configuration for a specific credential, I would expect that to be used, and not overwritten by the environment.

I guess thats why it might just be best to just let the caller figure that out, to keep things simpler for now. If we see a solution emerging in several protects, we can still move that upstream?

or wait, my bad i didn't see the from_env in the snippet. for this PR, i would like to keep it very simple - Given a URL and storage options, a user should be able to instantiate an object store.

so this PR can be separated from the registry part and the env part - nothing is to be read from the env

for the registries - I think it should still be a part of the projects, because they all may want to handle this their own way.

@chitralverma
Copy link
Author

@roeap @tustvold I found some time today to work on this part.

A prototype of how it looks - master...chitralverma:arrow-rs:parse_url

Please let me know what are your initial thoughts, based on it I can open a PR.

@roeap
Copy link
Contributor

roeap commented Apr 12, 2023

From my end this looks like a good starting point. On first glance we may have toto put a bit more effort into handling "https" schemes, since at least for azure and aws we can also parse the respective http based urls.

Also just realized, since a recent fix to allow "." in bucket names, we may right now not properly recognize aws urls in the virtual-hosted style, as we split on a static number of "." to dissect the url.

@chitralverma
Copy link
Author

From my end this looks like a good starting point. On first glance we may have toto put a bit more effort into handling "https" schemes, since at least for azure and aws we can also parse the respective http based URLs.

Yes, this I have already done, not pushed yet.

Also just realized, since a recent fix to allow "." in bucket names, we may right now not properly recognize aws urls in the virtual-hosted style, as we split on a static number of "." to dissect the URL.

Can you please point me to some existing MR or issue for this, I will cross check the behaviours.

@roeap
Copy link
Contributor

roeap commented Apr 13, 2023

Can you please point me to some existing MR or issue for this, I will cross check the behaviours.

turns out i was wrong, while . is generally permitted in bucket names, the docs state they are not permitted for virtual-style hosted buckets. So the way we extract bucket names from urls should be fine.

@chitralverma
Copy link
Author

alright, so I'll raise a PR now and we can do formal review there

tustvold added a commit that referenced this issue May 12, 2023
* Add parse_url function (#4047)

* Clippy

* Fix copypasta

* Fix wasm32 build

* More wasm fixes

* Return remaining path

* Don't use from_env
@tustvold
Copy link
Contributor

label_issue.py automatically added labels {'object-store'} from #4077

@tustvold tustvold added the object-store Object Store Interface label May 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog object-store Object Store Interface
Projects
None yet
3 participants