-
Notifications
You must be signed in to change notification settings - Fork 869
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[object store] Parsing of well-known uri formats #2304
Comments
We added something which might be similar to this in DataFusion, called cc @yahoNanJing and @mingmwang |
Also related, whilst integrating object_store into DataFusion I added an abstraction to do something similar - apache/datafusion#2578. Perhaps we could lift some of that into this crate? 🤔 I certainly was not aware that Azure URLs were different, so thank you for bringing that to my attention 👍 |
Had a look at the The one thing missing would be to extract the information we can gather from the results that are specific to azure / aws etc and somehow present that to the consumer. maybe add conversions into something like. pub enum StorageLocation {
S3(S3Info { ... }),
Azure(AzureInfo { ... })
...
} In that case, would we just want to copy some logic or move the whole implementation over here? Some of the methods seem particularly useful when working with the actual |
I don't have a strong opinion about this. Maybe other contributors do |
It seems like the existing registry stores concrete instances, from what I can see in the tests (please correct me if I'm misreading): let sut = ObjectStoreRegistry::default();
sut.register_store("hdfs", "localhost:8020", Arc::new(LocalFileSystem::new()));
let url = ListingTableUrl::parse("hdfs://localhost:8020/key").unwrap();
sut.get_by_url(&url).unwrap(); This seems problematic for stores like I'm inclined toward something inspired by the Filesystem API in Arrow C++: each implementation has a constructor let sut = ObjectStoreRegistry::default();
sut.register_store("s3", object_store::aws::AmazonS3::FromUri); By dispatching to the stores specific Does this align with what you are thinking @roeap? |
Not 100% sure if I understood the problem with the buckets. Wouldn't the Store uri just be "s3://"? and as such we could register multiple buckets? I.e. scheme and host are used to store the store. Maybe the equivalent to /// Object store provider can detector an object store based on the url
pub trait ObjectStoreProvider: Send + Sync + 'static {
/// Detector a suitable object store based on its url if possible
/// Return the key and object store
fn get_by_url(&self, url: &Url) -> Option<Arc<dyn ObjectStore>>;
} My main intent was to have something that sanitises user input and handles all the variation one might refer to a local location on all platforms. This to me, The mentioned bonus API to unify creation of object stores, is likely best served using something like the provider (which could have the While a bit worried to overload the In short - @wjones127, yes it does match 😄. |
Agreed
Moving |
FWIW I plan to make an It might be too much to try to get this in by then, but I also think if this is important I could maybe help and postpone the release until we have implemented this ticket. |
@alamb - Since most of the code already exists, I am fairly confident I can prepare a PR for this this WE. |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
For many storage services there exists standardized, or de-facto standardized uri formats (e.g.
s3://<bucket>/path/to/blob
for AWS S3 services) and are often the entry point in applications for referencing storage locations. While it is often straight forward to work with these, I made the experience that there are always edge cases to consider and keep on implementing similar logic in different projects.Had a small discussion around this with @wjones127 in delta-io/delta-rs#721
cc @tustvold @alamb
Describe the solution you'd like
Provide a dedicated implementation for parsing storage uris within the object_store crate and maybe offer a somewhat higher level API that selects stores based on the results. As a plus this would encourage unified handling among all adopters of the crate.
Describe alternatives you've considered
Letting consumers of object_store take care of this.
Additional context
If we decide to follow this, I'd be happy to come up with a proposal. Hoping that this would just be a fairly thin wrapper around
Url
.The text was updated successfully, but these errors were encountered: