-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Are the Azure paths inconsitent with other remote store paths? #721
Comments
IIUC I'm not sure about fsspec, but in PyArrow there is a notion of a SubTreeFilesystem, so the object store is equivalent to one of those with a base_dir at the bucket/container level. Does |
You are absolutely right. I think the important point, "What is a store?" and in When we use the URIs we have the bucket / container in the uri. Your idea of having parsing utilities for "well-known" uri schemes like |
Just to clarify that we're all using the same words for things 😅 For the URL -
Is this a standard representation of Azure paths, or something unique to delta-rs? If it is a standard representation that would definitely pose some interesting complexities w.r.t handling it consistently... If it isn't a standard representation, I would definitely recommend moving to only encoding the |
I don't have much experience with azure, but is there any reason why we are not adopting the URI syntax documented in the official doc at https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction-abfs-uri? Basically my recommendation on this is closer to what @tustvold proposed. i.e. we should go with the standard representation used by the Azure community for the best user experience and least surprise. Ideally, things just work if they provide the uri in the format that's used in other parts of their Azure stack. |
I remember discussing this with @thovoll - essentially the argument back then was that
wabs seems to more or less follow the same pattern. Personally I look at this more like "can we read data from a location specified by that uri" rather then we will use as specific driver to do so. The bahaviours are clearly defined by the As such I could implement parsing for these and some other known formats - and maybe contribute that upstream? (apache/arrow-rs#2304) |
Description
for remote stores, we currently support Azure, AWS, and GCP which have the following uri schemes:
s3://<bucket>/path/to/table
gs://<bucket>/path/to/table
adls2://<account>/<container>/path/to/table
The main source of difference is that - to the best of my knowledge - the concept of an account does not exists for s3/gs. Essentially buckets must be unique for a region, where containers must be unique per account. However regions also exist in azure. On the other hand, to root of an object store is bucket / container, and also from how urls / paths are constructed bucket and container are more or less the same. It seems others (see adlfs) felt like accounts are the appropriate lowest level in the path / uri where the account (much like region in S3) is configuration of the store.
Thus I propose to "drop" the account from our azure paths. While this is certainly a major breaking change, my hope is that users appreciate consistency with e.g. fsspec. Given that we aim to closely integrate with (py)arrow, it seems to me that this would be more consistent on that level as well.
From an implementation standpoint, we are already picking up the account from configuration, so the path segement is effectively unused.
As a side note - this would also be consistent in how
object_store
treats paths ...cc @thovoll, @wjones127 @houqp
Use Case
have a nicer user facing API.
Related Issue(s)
The text was updated successfully, but these errors were encountered: