Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are the Azure paths inconsitent with other remote store paths? #721

Closed
roeap opened this issue Aug 1, 2022 · 5 comments · Fixed by #761
Closed

Are the Azure paths inconsitent with other remote store paths? #721

roeap opened this issue Aug 1, 2022 · 5 comments · Fixed by #761
Labels
enhancement New feature or request

Comments

@roeap
Copy link
Collaborator

roeap commented Aug 1, 2022

Description

for remote stores, we currently support Azure, AWS, and GCP which have the following uri schemes:

  • AWS: s3://<bucket>/path/to/table
  • GCP: gs://<bucket>/path/to/table
  • Azure adls2://<account>/<container>/path/to/table

The main source of difference is that - to the best of my knowledge - the concept of an account does not exists for s3/gs. Essentially buckets must be unique for a region, where containers must be unique per account. However regions also exist in azure. On the other hand, to root of an object store is bucket / container, and also from how urls / paths are constructed bucket and container are more or less the same. It seems others (see adlfs) felt like accounts are the appropriate lowest level in the path / uri where the account (much like region in S3) is configuration of the store.

Thus I propose to "drop" the account from our azure paths. While this is certainly a major breaking change, my hope is that users appreciate consistency with e.g. fsspec. Given that we aim to closely integrate with (py)arrow, it seems to me that this would be more consistent on that level as well.

From an implementation standpoint, we are already picking up the account from configuration, so the path segement is effectively unused.

As a side note - this would also be consistent in how object_store treats paths ...

cc @thovoll, @wjones127 @houqp

Use Case

have a nicer user facing API.

Related Issue(s)

@roeap roeap added the enhancement New feature or request label Aug 1, 2022
@wjones127
Copy link
Collaborator

IIUC object_store_rs actually has <bucket> and <container> as part of the store config and not as the path, where as fsspec/pyarrow expect both of those to be within the path. Am I wrong on that?

I'm not sure about fsspec, but in PyArrow there is a notion of a SubTreeFilesystem, so the object store is equivalent to one of those with a base_dir at the bucket/container level.

Does object_store_rs have any utils for parsing URIs? That might be something we want to contribute upstream as part of our integration work.

@roeap
Copy link
Collaborator Author

roeap commented Aug 3, 2022

IIUC object_store_rs actually has and as part of the store config and not as the path, where as
fsspec/pyarrow expect both of those to be within the path. Am I wrong on that?

You are absolutely right. I think the important point, "What is a store?" and in object_store_rs the bucket/container level is the store root, so its not in the paths anymore.

When we use the URIs we have the bucket / container in the uri. Your idea of having parsing utilities for "well-known" uri schemes like s3{a}://, abfs:// etc in object_store_rs is a great one I think. TO me it seems this might be in scope for that crate and could be used for initialization ..

@tustvold
Copy link

tustvold commented Aug 3, 2022

IIUC object_store_rs actually has and as part of the store config and not as the path, where as fsspec/pyarrow expect both of those to be within the path.

Just to clarify that we're all using the same words for things 😅

For the URL - s3://<bucket>/path/to/table

  • schema - "s3"
  • host - "<bucket>"
  • path - "/path/to/table"

object_store_rs is only concerned with the path portion of this. apache/datafusion#2578 then added logic to interpret URIs, in a manner consistent with this in DataFusion.

Azure adls2://<account>/<container>/path/to/table

Is this a standard representation of Azure paths, or something unique to delta-rs? If it is a standard representation that would definitely pose some interesting complexities w.r.t handling it consistently... If it isn't a standard representation, I would definitely recommend moving to only encoding the <container> in the URL as suggested

@houqp
Copy link
Member

houqp commented Aug 4, 2022

I don't have much experience with azure, but is there any reason why we are not adopting the URI syntax documented in the official doc at https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction-abfs-uri?

Basically my recommendation on this is closer to what @tustvold proposed. i.e. we should go with the standard representation used by the Azure community for the best user experience and least surprise. Ideally, things just work if they provide the uri in the format that's used in other parts of their Azure stack.

@roeap
Copy link
Collaborator Author

roeap commented Aug 4, 2022

I remember discussing this with @thovoll - essentially the argument back then was that wabs etc refer to specific drivers. Looking also at fsspec which offcially adopted az and abfs which sais:

Accepted protocol / uri formats include:
'PROTOCOL://container/path-part/file'
'PROTOCOL://[email protected]/path-part/file'

wabs seems to more or less follow the same pattern. Personally I look at this more like "can we read data from a location specified by that uri" rather then we will use as specific driver to do so. The bahaviours are clearly defined by the object_store_rs APIs and deliberately hide specifics from the backend.

As such I could implement parsing for these and some other known formats - and maybe contribute that upstream? (apache/arrow-rs#2304)

@roeap roeap mentioned this issue Aug 22, 2022
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants