Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[object store] Parsing of well-known uri formats #2304

Closed
roeap opened this issue Aug 3, 2022 · 9 comments · Fixed by #3327
Closed

[object store] Parsing of well-known uri formats #2304

roeap opened this issue Aug 3, 2022 · 9 comments · Fixed by #3327
Assignees
Labels
enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers help wanted

Comments

@roeap
Copy link
Contributor

roeap commented Aug 3, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

For many storage services there exists standardized, or de-facto standardized uri formats (e.g. s3://<bucket>/path/to/blob for AWS S3 services) and are often the entry point in applications for referencing storage locations. While it is often straight forward to work with these, I made the experience that there are always edge cases to consider and keep on implementing similar logic in different projects.

Had a small discussion around this with @wjones127 in delta-io/delta-rs#721

cc @tustvold @alamb

Describe the solution you'd like

Provide a dedicated implementation for parsing storage uris within the object_store crate and maybe offer a somewhat higher level API that selects stores based on the results. As a plus this would encourage unified handling among all adopters of the crate.

Describe alternatives you've considered

Letting consumers of object_store take care of this.

Additional context

If we decide to follow this, I'd be happy to come up with a proposal. Hoping that this would just be a fairly thin wrapper around Url.

@roeap roeap added the enhancement Any new improvement worthy of a entry in the changelog label Aug 3, 2022
@alamb
Copy link
Contributor

alamb commented Aug 3, 2022

We added something which might be similar to this in DataFusion, called ObjectStoreProvider in apache/datafusion#2906 though I think the usecase there was to support HDFS

cc @yahoNanJing and @mingmwang

@tustvold
Copy link
Contributor

tustvold commented Aug 3, 2022

Also related, whilst integrating object_store into DataFusion I added an abstraction to do something similar - apache/datafusion#2578. Perhaps we could lift some of that into this crate? 🤔

I certainly was not aware that Azure URLs were different, so thank you for bringing that to my attention 👍

@roeap
Copy link
Contributor Author

roeap commented Aug 4, 2022

Had a look at the ListingTableUri implementation, and it does a very large chunk of what i thought such an abstraction could do.

The one thing missing would be to extract the information we can gather from the results that are specific to azure / aws etc and somehow present that to the consumer. maybe add conversions into something like.

pub enum StorageLocation {
    S3(S3Info { ... }),
    Azure(AzureInfo { ... })
    ...
}

In that case, would we just want to copy some logic or move the whole implementation over here? Some of the methods seem particularly useful when working with the actual ListingTable, but also do not hurt here. We would also need to move the ObjectStoreUrl over, but that seems related anyhow.

@alamb
Copy link
Contributor

alamb commented Aug 4, 2022

I don't have a strong opinion about this. Maybe other contributors do

@wjones127
Copy link
Member

wjones127 commented Aug 4, 2022

It seems like the existing registry stores concrete instances, from what I can see in the tests (please correct me if I'm misreading):

let sut = ObjectStoreRegistry::default();
sut.register_store("hdfs", "localhost:8020", Arc::new(LocalFileSystem::new()));
let url = ListingTableUrl::parse("hdfs://localhost:8020/key").unwrap();
sut.get_by_url(&url).unwrap();

This seems problematic for stores like object_store::aws::AmazonS3 because a particular instance is bucket-specific. IIUC that means an ObjectStoreRegistry could only allow for one bucket at a time, which would rule out many use cases.

I'm inclined toward something inspired by the Filesystem API in Arrow C++: each implementation has a constructor FromUri(uri: Uri) -> (Path, Self) and there is a function ObjectStoreFromUri(uri: Uri) -> (Path, impl ObjectStore) that can dispatch to the correct store based on the scheme. Not sure how we might register, but I could imagine some API like this might be possible:

let sut = ObjectStoreRegistry::default();
sut.register_store("s3", object_store::aws::AmazonS3::FromUri);

By dispatching to the stores specific FromUri method we allow each store to handle mapping the URI parts and even query parameters into their own options.

Does this align with what you are thinking @roeap?

@roeap
Copy link
Contributor Author

roeap commented Aug 5, 2022

Not 100% sure if I understood the problem with the buckets. Wouldn't the Store uri just be "s3://"? and as such we could register multiple buckets? I.e. scheme and host are used to store the store.

Maybe the equivalent to ObjectStoreFromUri is more the ObjectStoreProvider:

/// Object store provider can detector an object store based on the url
pub trait ObjectStoreProvider: Send + Sync + 'static {
    /// Detector a suitable object store based on its url if possible
    /// Return the key and object store
    fn get_by_url(&self, url: &Url) -> Option<Arc<dyn ObjectStore>>;
}

My main intent was to have something that sanitises user input and handles all the variation one might refer to a local location on all platforms. This to me, ListingTableUrl is doing that very well. For me a further useful piece would to interpret the uri and map it to a specific service / implementation - i.e s3 / azure / ... - based on well-known uris for these technologies, but more or less just extracting all information that can be gathered from that. Consumer could then decide what to do with that information.

The mentioned bonus API to unify creation of object stores, is likely best served using something like the provider (which could have the ObjectStoreFromUri behaviour. I guess that could be part of the object store trait or a separate FromUri trait that does just that.

While a bit worried to overload the object_store_rs crate when moving / implementing these structures, I am slightly leaning towards also moving registry / provider, since managing multiple separate locations is something that at least for me comes up regularly.

In short - @wjones127, yes it does match 😄.

@tustvold
Copy link
Contributor

tustvold commented Aug 5, 2022

Maybe the equivalent to ObjectStoreFromUri is more the ObjectStoreProvider:

Agreed

also moving registry / provider,

Moving ObjectStoreProvider and ObjectStoreRegistry to object_store_rs makes sense to me. I would be more apprehensive of moving ListingTableUrl as aspects like the interpretation of glob paths is likely not very portable, but from my understanding of the issue it shouldn't be necessary to port this - as any type that implements AsRef<Url> will work

@alamb
Copy link
Contributor

alamb commented Aug 5, 2022

FWIW I plan to make an object_store 0.4 release early next week: #2261

It might be too much to try to get this in by then, but I also think if this is important I could maybe help and postpone the release until we have implemented this ticket.

@roeap
Copy link
Contributor Author

roeap commented Aug 5, 2022

@alamb - Since most of the code already exists, I am fairly confident I can prepare a PR for this this WE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers help wanted
Projects
None yet
4 participants