Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

az:// URL scheme isn't really a URL because it is not unique #157

Open
analog-cbarber opened this issue Aug 28, 2021 · 3 comments
Open

az:// URL scheme isn't really a URL because it is not unique #157

analog-cbarber opened this issue Aug 28, 2021 · 3 comments
Labels
Azure help wanted Extra attention is needed

Comments

@analog-cbarber
Copy link

The existing 'az://' pseudo-URL scheme violates the principal of URLs in that it is not universal and unique.

The 's3://' bucket syntax is valid because S3 buckets are universally unique. You may need account credentials to access an S3 bucket but you don't need additional information to locate it.

With Azure blob storage you need the storage account as well. Without that you have no way to locate the blobs even if they are publically accessible.

A better scheme would be something like:

az://<storage-account>/<container>/...

which would turn into:

https://<storage-account>.blob.core.windows.net/<container>/...

You could allow the optional addition of .blob or .dfs, so one could write:

az://<storage-account>.dfs/<container>/...

@pjbull
Copy link
Member

pjbull commented Aug 28, 2021

This is a great point, thanks @analog-cbarber.

Do you know if there is a list of all the different kinds of storage URLs we might need to generate to work with Azure storage services somewhere? It would be ideal to make the design of this comprehensive.

Also, curious if as an ugly workaround passing a BlobServiceClient with the right storage account to AzureBlobClient then using that to generate paths works to at least be able to access the blobs properly.

@analog-cbarber
Copy link
Author

I can indeed construct an AzureBlobClient to use with a AzureBlobPath that works, but it would be a lot nicer, if you could just directly construct an AzureBlobPath object with a string and perhaps an optional SAS token:

AzureBlobPath('az://myaccount/mycontainer?token')

or

AzureBlobPath('az://myaccount/mycontainer', token=token)

I don't know of a list of cloud storage container URLs, nor do I think that URL schemes necessarily even exist. Microsoft has not defined an azure:// or az:// scheme. I think it is fine to make up you own version, as long as it can be used to locate the resource without any extra information (apart from authentication credentials).

Since Azure cloud uses a REST api with https urls, you probably should also support the raw http url as well:

AzureBlobPath('https://myaccount.blob.core.widows.net/mycontainer/')

@analog-cbarber
Copy link
Author

FYI, here is quick hack that extends the existing implementation to support the azure://<account>/<container>/... scheme:

@register_client_class('azure2')
class AzureClient(AzureBlobClient):
    pass

@register_path_class('azure2')
class AzurePath(AzureBlobPath):
    client: 'AzureClient'
    cloud_prefix: str = "azure://"

    def __init__(self, cloud_path: Union[str, CloudPath],
                 client: Optional[AzureBlobClient] = None,
                 token: Optional[str] = None
                 ):
        if isinstance(cloud_path, str):
            parsed = urlparse(cloud_path)
            m = re.match(r'(?P<account>[a-z0-9]+)(\.(?P<type>blob|dfs)(\.core\.windows\.net)?)?',
                         parsed.netloc,
                         flags=re.IGNORECASE)
            if m is None:
                raise ValueError(f"Bad azure path '{cloud_path}'")
            account = m.group('account')
            fstype = m.group('type') or 'blob'
            account_url = f'https://{account}.{fstype}.core.windows.net/'
            optional_type = '' if fstype == 'blob' else '.' + fstype
            cloud_path = f"azure://{account}{optional_type}/{parsed.path.lstrip('/')}"
            if client is None or parsed.query or token or client.service_client.account_name != account:
                if token is not None:
                    token = '?' + token.lstrip('?')
                elif parsed.query:
                    token = '?' + parsed.query
                client = AzureClient(account_url, token)
        super().__init__(cloud_path, client = client)

    @classmethod
    def is_valid_cloudpath(cls, path: Union[str, CloudPath], raise_on_error=False) -> bool:
        valid = bool(re.match(r'(azure://|https://[a-z0-9]+\.(blob|dfs)\.core\.windows\.net)', str(path).lower()))

        if raise_on_error and not valid:
            raise InvalidPrefixError(
                f"'{path}' is not a valid path since it does not start with '{cls.cloud_prefix}' "
                "or valid Azure https blob or dfs location."
            )

        return valid

    @property
    def container(self) -> str:
        return self._no_prefix.split('/',2)[1]

    @property
    def blob(self) -> str:
        return super().blob.split('/',1)[1]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Azure help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants