Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE REQ] azure-storage-blob-nio: set default configuration #23653

Closed
droazen opened this issue Aug 18, 2021 · 29 comments
Closed

[FEATURE REQ] azure-storage-blob-nio: set default configuration #23653

droazen opened this issue Aug 18, 2021 · 29 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. feature-request This issue requires a new behavior in the product in order be resolved. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team Storage Storage Service (Queues, Blobs, Files)

Comments

@droazen
Copy link

droazen commented Aug 18, 2021

The NIO plugin for Google Cloud Storage (https://github.com/googleapis/java-storage-nio) has a convenient feature that lets you authenticate transparently via environment variables, instead of passing in credentials explicitly. You just run gcloud auth application-default login, and then the NIO plugin gets the credentials automatically from the user's environment.

It would be great if the NIO plugin for Azure had an equivalent feature. This is helpful in the case with large projects that are using the NIO subsystem to support multiple clouds, where propagating authentication info for specific clouds down the stack would require cumbersome special-casing that is not always easy (or even possible).

@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Aug 18, 2021
@rickle-msft rickle-msft added feature-request This issue requires a new behavior in the product in order be resolved. Storage Storage Service (Queues, Blobs, Files) labels Aug 18, 2021
@rickle-msft rickle-msft self-assigned this Aug 18, 2021
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Aug 18, 2021
@rickle-msft rickle-msft removed the question The issue doesn't require a change to the product in order to be resolved. Most issues start as that label Aug 18, 2021
@joshfree joshfree added the Client This issue points to a problem in the data-plane of the library. label Aug 18, 2021
@ghost ghost added the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Aug 18, 2021
@joshfree
Copy link
Member

Hi @droazen are you able to use these environment variables in your app: https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/identity/azure-identity#environment-variables
/cc @g2vinay

@rickle-msft
Copy link
Contributor

Thanks @joshfree. That will be helpful to get a read on. I think I might also have to do some work to pick up on those or at least document their existence in our javadocs

@droazen
Copy link
Author

droazen commented Aug 18, 2021

@joshfree The request is for azure-storage-blob-nio to check for these (or similar) environment variables on its own, and when they are present authenticate automatically without the need for client intervention. Adding cloud-specific manual authentication steps in our particular codebase is difficult, in part because the NIO-facing code lives several layers down in our dependency tree in libraries that we don't directly control.

@rickle-msft
Copy link
Contributor

@droazen Stepping back a bit. My understanding of nio usage is that typically on startup, FileSystemProvider.createFileSystem(Map<String, String> config) is called to initialize the new FileSystem. It sounds like constructinga config map to pass here might not actually be feasible for you at all, in which case there's a bit of a fundamental barrier in place for you guys using the package as it is now. Is that the case? If so, I think I'll need to do some work for it to default read from the environment so that we can get all the necessary configuration from the environment. Can you give me more details on how your Google Cloud file systems are initialized and configured so we can target the same experience?

@droazen
Copy link
Author

droazen commented Aug 18, 2021

@rickle-msft We never instantiate the FileSystem objects directly -- that happens inside the NIO subsystem. For example, we might be dealing with one or more URIs that could have any of several schemes such as gs://, hdfs://, https://, etc., supported by different NIO plugins in our classpath, and when we call java.nio.file.Paths.get(URI), the correct FileSystem for the current scheme gets loaded and stored inside the returned java.nio.file.Path objects, which we can then perform I/O upon using methods like java.nio.file.Files.newInputStream(). But most of this, as I mentioned, happens deep inside libraries that we can't easily modify.

What we can do in some cases, however, is call static configuration methods (that don't involve any instantiation) to set defaults on startup. We do this with Google's NIO library in our main toolkit GATK to adjust the default timeouts and a few other settings, via the static method com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.setDefaultCloudStorageConfiguration(). But we'd prefer not to use such an approach for authentication, if possible, since we have a few projects that are NIO-enabled but aren't distributed with the NIO plugins actually in their classpaths due to their size. The Google plugin is able to work with such projects (when downstream users add it to their classpaths) mainly because it is able to automatically get the auth info from the environment.

Thanks @rickle-msft and @joshfree for your time, and for considering this request!

@rickle-msft
Copy link
Contributor

I see.

when we call java.nio.file.Paths.get(URI), the correct FileSystem for the current scheme gets loaded and stored inside the returned java.nio.file.Path objects.

I think this is the information I was looking for. I was trying to figure out what the entry point was into this system because a fileSystem has to be created at some point. This method eventually calls into the FileSystemProvider.getPath(). We can add some logic in there such that if there are no fileSystems already constructed, it can try to create a default one based on the environment automatically and use the environment for authentication.

Can you explain a little more about the static method Google is using? If it's already able to pull creds from the environment, why use this method on top of that? Why not just have it pull all the necessary configs from the environment while it's at it?

@droazen
Copy link
Author

droazen commented Aug 19, 2021

@rickle-msft The reason we pull the auth info from the environment, but not other settings like timeouts, etc., is because for things like timeouts we want to bake sensible defaults into our toolkit that are appropriate for genomic-scale workloads and will work well for most of our users. Additionally, the environment variables with the user's credentials are created automatically on Google Cloud instances, and so most of the time GCP authentication "just works" for our users without any extra steps. This is a big deal for us, since most of our users are scientists who are not always able to deal with technical issues when things don't work out-of-the-box.

For the Azure NIO plugin, I'm curious as to what additional configuration info typically needs to be specified apart from the credentials?

@rickle-msft
Copy link
Contributor

Sure. We can definitely make it work! I'm just trying to get a full understanding of how it all fits together so we can give a consistent experience.

Here is a list of all the config options that are typically expected to be passed in at config time, but we can pull this information from another source. Given our discussion on credentials from the environment, you can probably ignore what it says about setting that information in the config map. But I think the ones relevant for this part of the discussion are:
AzureStorageMaxTries:Integer
AzureStorageTryTimeout:Integer
AzureStorageRetryDelayInMs:Long
AzureStorageMaxRetryDelayInMs:Long
AzureStorageRetryPolicyType:RetryPolicyType
AzureStorageDownloadResumeRetries:Integer

Thoughts?

@droazen
Copy link
Author

droazen commented Aug 20, 2021

@rickle-msft Thanks, this is all very helpful in clarifying the request in this ticket! I guess my ultimate wishlist would look like this:

  1. The Azure NIO plugin checks automatically for the user's credentials in environment variables, and, if present, doesn't require them to be passed in manually. Ideally these environment variables would be standardized ones already recognized / set by other Azure SDK tools.

  2. Global defaults for other settings like retries and timeouts can be set via a static method on the FileSystemProvider, without needing to actually instantiate a FileSystem.

  3. When java.nio.file.Paths.get(URI) is called, a new FileSystem is created on-demand using the preconfigured defaults, if one hasn't already been created, and referenced inside the returned java.nio.file.Path objects

Even just # 1 on its own would be massively useful to us. With all 3 features I think we'd achieve parity with Google's NIO plugin and be able to use the Azure plugin equally seamlessly in our toolkit.

Many thanks for your help!

@rickle-msft
Copy link
Contributor

@droazen Just wanted to ack this. We've had some other work come up spontaneously that is fairly urgent, which is why I haven't kept up here, but I will circle back to work on adding these features as soon as possible.

@droazen
Copy link
Author

droazen commented Aug 30, 2021

@rickle-msft Thanks for the update! Please let me know if there's any further information I can provide on this ticket.

@rickle-msft
Copy link
Contributor

@droazen Wanted to give you an update that I am beginning work on this item. Apologies for the delay and thank you for your patience. I'll start working on the first part

@rickle-msft
Copy link
Contributor

rickle-msft commented Feb 10, 2022

@droazen What authentication method are you using? TokenAuth (RBAC), SasToken, or SharedKey? And as far as the interface for the static global configuration method, would passing a map with everything besides the credentials be reasonable?

@droazen
Copy link
Author

droazen commented Feb 16, 2022

@rickle-msft Thanks for the update, glad you are able to look at this ticket again! To answer your questions:

  • Passing a settings map into the static global configuration method sounds very reasonable. Actually, thinking about it a bit more, once we have this ability to statically set global defaults in the NIO library, it would be fine for us to pass the auth info in manually via that method as well, and check for environment variables, etc. ourselves. Our original problem was that we couldn't do this at the individual FileSystem level, but we could certainly do it globally on startup, provided that calling java.nio.file.Paths.get(uri) on an azb:// URI would result in a new Azure FileSystem getting created from the settings in the global settings map, if present.

  • We will certainly want to support the SAS token method, which (as I understand it) is analogous to a pre-signed URL where the auth info is embedded in the URL itself. However, we'd also want some way of doing account-level authentication similar to Google's application default credentials -- I'm not sure which of the available options is closest to that.

@rickle-msft
Copy link
Contributor

Oh wow. Yea that's a good update to have. There are some sdk wide changes coming to how the sdks interact with the environment, so I was gunna have to wait for that actually, but if you can pass everything into that method, then I can get working on it more immediately. And I can ensure that Paths.get() opens a file system with the defaults if none is present.

Since we'll be able to authenticate via this config map as well, both sas tokens and shared keys (account level access) are both already supported there

@rickle-msft
Copy link
Contributor

@droazen Question for you on this. Since you are calling into Paths.get(uri), presumably you are fine passing a full uri here? That is, including the scheme, path, and query component, e.g. azb://<filepath>?endpoint=<endpoint>. This will ensure that we know what account to use to back the resulting path object.

@droazen
Copy link
Author

droazen commented Feb 28, 2022

@rickle-msft We'll certainly always have a scheme and path in our URIs, but since our URIs are user input I'm not certain whether they will always contain a query string with the endpoint. We're mainly concerned here with not placing unnecessary / onerous requirements on our users -- is it easy for (potentially non-technical) Azure users to obtain full URIs to their files including the endpoint? If the endpoint were missing, could the default / statically-configured account info be used, or is this field something that would necessarily vary with each individual URI?

@rickle-msft rickle-msft changed the title [FEATURE REQ] azure-storage-blob-nio: authenticate using environment variables [FEATURE REQ] azure-storage-blob-nio: set default configuration Feb 28, 2022
@rickle-msft
Copy link
Contributor

I see. It's kind of a yes and no answer here. It is, by my estimation, very easy for a customer to get a full uri to their file or the endpoint to their account. Both are readily available in the Azure Portal. I would go so far as to say it would be harder for a non technical user to get a reference (uri or otherwise) to a given blob that doesn't contain the domain/full endpoint than it would be for them to get one with the information we're looking for. What I will say is that the uri the portal gives is a different format from the format nio tends to work with (and I don't think this is unique to us if I had to guess), so it might take a bit of manipulation on the part of your application to rearrange some things. Thoughts?

The endpoint will be constant per filesystem, though it is possible to have multiple open filesystems. I think the idea here was that Paths.get() accepts a uri and you're passing a map to set the global defaults, which are the required parameters for creating a filesystem that we already have code in place to handle. I propose that we continue with this track for now and that if you find it is unreasonably complicated for your customers to gather the requisite information, we can look into adding an extra endpoint option to the config map and elide the necessity of its presence on the uri. Thoughts?

@rickle-msft
Copy link
Contributor

We did recently add a utility method to AzurePath that allows for converting from the style of uri given by the portal into a Path object to help bridge these two formats. I am also willing to look into adding support for this behavior in that method (i.e. checking for default configs) if that would be useful, but my sense is that you're trying to avoid provider specific code and would rather not interact with AzurePath directly.

@droazen
Copy link
Author

droazen commented Mar 1, 2022

@rickle-msft If the endpoint is constant per filesystem, could we add a "default endpoint" attribute into the static config map that, if set, the code falls back upon when a URI is missing the endpoint? That would alleviate this issue for us completely. Otherwise I foresee a lot of unnecessary pain/confusion for our users :) Would this be reasonably simple to implement, or are there structural issues in the existing code that would make it difficult?

@rickle-msft
Copy link
Contributor

I think we can add that. It's less a matter of technical difficulty and more a matter of design principles. We try to not to add methods and options eagerly because it's easier to add than take away, hence my slight reluctance to add the option. This may seem a bit odd initially, but what if the method was setDefaultConfigs(Map<String, Object> configMap, String defaultEndpoint)? i.e. the endpoint is a separate parameter. The reason I suggest this is this way we can use the exact same map options and documentation both for this method and for creating a file system "normally" through the provider. If we add the option to the map, then it may become confusing as to how setting the endpoint in the map and also in the uri would interact. Hopefully this way we can maintain the current set of constants for the config map and also allow you to set the endpoint via the method rather than on the uri.

That said, are you intending to use the overload of Paths.get() that accepts strings? Or the one that accepts a uri? And if the one that accepts a uri, what is the format of the uri you intend to pass?

@droazen
Copy link
Author

droazen commented Mar 1, 2022

@rickle-msft A signature like setDefaultConfigs(Map<String, Object> configMap, String defaultEndpoint) with the default endpoint separated from the config map would be totally fine with us -- whatever you think would fit best with the existing API. I do think that if the endpoint is actually present in the URI query string it should override the one in the default config, if that's also set.

We are using the Paths.get() overload that takes a URI -- the overload that takes a String is hardcoded to use the default FileSystem, I believe, rather than searching for a FileSystem that matches the provided scheme. The URIs we pass in will always have a scheme + path, but may or may not have a query string depending on whether, eg., the URI is pre-signed / contains embedded auth info.

@rickle-msft
Copy link
Contributor

@droazen Apologies again for the delay. There's been some internal back and forth about the best way to support setting default configurations which has held up the progress here.

There are some concerns from some folks on the team that a static method to set global configurations opens the door for some concerns such as concurrent/multiple sets that we'd like to avoid. Because of that, we've explored two possibilities here.

The first, which we think is simpler and therefore preferable, is to have all these configurations (creds and options) all set in the environment. This mitigates both of the concerns I mentioned as they would be checked once upon startup and cached thereafter. There's also a standardized way of accessing the environment across the entire Azure SDK that we can leverage. If this is conceptually fine with you, I can share a brief sample to confirm before we proceed.

The other option, which is more advanced and maybe overkill but also ultimately more flexible, is to use SPI (Service Provider Interfaces) that we can use to load a provider to serve us these default values.

I know we've sort of haggled back and forth over this multiple times, but at this point does putting everything in the environment work for you technically?

@droazen
Copy link
Author

droazen commented Apr 13, 2022

@rickle-msft From a technical standpoint, putting everything in the environment would work for us, yes. My only concern with that approach has to do with changes over time to the set of environment variables that get checked. With a static configuration method, the compiler will tell us if a setting changed. With environment variables across multiple upgrades of the NIO library over several years, is there a convenient mechanism we can use to automatically check whether something changes (ie., an environment variable gets renamed / added / removed), so that we can implement a CI test that will fail for us in this case? Thanks for continuing to follow up on this request!

@rickle-msft
Copy link
Contributor

@droazen Sure thing. And thank you for continuing to work with us on getting the best experience out.

I hear your concerns. Once we GA, Azure SDK has a policy of no breaks, so we would not be removing or renaming any existing options. Even if we choose to deprecate one, it will continue to function. Additions would be non breaking, so they wouldn't interfere with existing options anyway.

Additionally, because we'll be loading these options on startup to cache for the lifetime of the application, if there's an error in your configuration, your test infrastructure should also see that as soon as it loads the provider.

How does that sound to you

@droazen
Copy link
Author

droazen commented Apr 14, 2022

@rickle-msft Sounds good to me! I was mainly worried about the case where we set a variable X in the environment, and then the library one day stops checking for that particular variable and silently continues along without error, but with the wrong value for, eg., the number of retries. But if there's a policy of no removals then this should not be a problem!

@droazen
Copy link
Author

droazen commented Jul 18, 2022

@rickle-msft Any updates on this ticket?

@rickle-msft
Copy link
Contributor

@droazen Unfortunately the last few months I had something cut in line for priority. That's been pretty all consuming but hopefully wrapping up in the next couple weeks. If all goes well, I should be able to pick this back up once the other item is complete. Apologies that there have been further delays on this.

Copy link

Hi @droazen, we deeply appreciate your input into this project. Regrettably, this issue has remained inactive for over 2 years, leading us to the decision to close it. We've implemented this policy to maintain the relevance of our issue queue and facilitate easier navigation for new contributors. If you still believe this topic requires attention, please feel free to create a new issue, referencing this one. Thank you for your understanding and ongoing support.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 15, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Mar 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. feature-request This issue requires a new behavior in the product in order be resolved. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team Storage Storage Service (Queues, Blobs, Files)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants