-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use custom ENV variables for Azure Storage Remote. #2256
Use custom ENV variables for Azure Storage Remote. #2256
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff @kopytjuk !
Added 1 comment and few suggestions, mostly to match test formatting and how we fill strings in the rest of the project. Though I have one question, do you know, if by any chance, connection string can start with $
sign? If so, we need to adress situation where config contains connection string, and we try to read it as environmental variable.
According to Azure documentation a connection string can start with |
5bfcb80
to
ca58f55
Compare
if ( | ||
connection_string_from_config is not None | ||
) and connection_string_from_config.startswith("$"): | ||
# read custom environment variable discarding the '$' character |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need $ though? Maybe I'm missing something, but I am not able to find $ used in connection strings at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realised that that ^ didn't come up very clear 😅 I meant that I am not able to find where such trick with env vars is used elsewhere (e.g. in other projects that use azure). Or is it something that you've come up with yourself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, maybe a little background. When working with Azure, it is handy to save the storage account credentials as connection strings in a environment variables and retrieve them from client libraries (like azure-storage-blob
). I explicitly used plural form to highlight the fact that there are multiple storage accounts. Now working with dvc remotes on different storages you need a way to select the right destination of data. The solution with ENV vars would keep the connection string private while showing the name of the ENV variable (which itself encodes the destination with e.g. the storage-account name). Team members can, after setting up their .bash_profile
, contribute to projects.
You are right, it is nothing Azure specific - maybe a helper function for other remote types would be handy...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For other remotes it might be dangerous. What if someone would want to do something like $PART1_$PART2 ? This env evaluation in config feels like it is too powerful and doesn't really belong in the current form in the config file itself. It is also weird that it will be supported like that for 1 specific config option for 1 specific remote and if we try to support it for other options later, it has a potential to run into conflicts, where we might no longer be able to tell easily if it is an env var that we need to evaluate. But just mixing config and env doesn't feel right in this scenario.
Feels like the proper way to work with this is to use .dvc/config.local, as you've described. You will be sharing your .bash_profile anyway, so why not just share .dvc/config.local privately instead? You can use placeholders in .dvc/config like so too if you want to:
dvc remote modify myremote connection_string MY_PLACEHOLDER
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For other remotes it might be dangerous. What if someone would want to do something like $PART1_$PART2 ? This env evaluation in config feels like it is too powerful and doesn't really belong in the current form in the config file itself. It is also weird that it will be supported like that for 1 specific config option for 1 specific remote and if we try to support it for other options later, it has a potential to run into conflicts, where we might no longer be able to tell easily if it is an env var that we need to evaluate. But just mixing config and env doesn't feel right in this scenario.
Feels like the proper way to work with this is to use .dvc/config.local, as you've described. You will be sharing your .bash_profile anyway, so why not just share .dvc/config.local privately instead? You can use placeholders in .dvc/config like so too if you want to:
dvc remote modify myremote connection_string MY_PLACEHOLDER
I get your first point about the power of such parsing functionality - agree on that completely.
On the other hand (your second point) there exists a specific solution with azure where we read a connection string from ENV with AZURE_STORAGE_CONNECTION_STRING
- working with one storage account will be fine. Having different storage accounts with different levels of security/back up strategy will be hard (since we distribute write/delete/list connection strings).
Why I do not want to share .dvc/config.local is because you have to do that for each project - plus you have to track who got which token for which project instead of having a list of team members and tokens once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your points, but I'm just not sure that this is the proper way to solve your problem. With things like s3, you would probably just use profiles, right? Does azure have those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kopytjuk Or maybe file-based auth?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No Azure client libraries (at least the official one in Python) supports authentication via SAS tokens or primary key of the storage account.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kopytjuk maybe something like
client = get_client_from_auth_file(StorageManagementClient)
storage_keys = storage_client.storage_accounts.list_keys(resource_group_name, storage_account_name)
storage_keys = {v.key_name: v.value for v in storage_keys.keys}
storage_client = CloudStorageAccount(storage_account_name, storage_keys['key1'])
blob_service = storage_client.create_block_blob_service()
ca58f55
to
89059b6
Compare
Closing as stale. |
Have you followed the guidelines in our
Contributing document?
Does your PR affect documented changes or does it add new functionality
that should be documented? If yes, have you created a PR for
dvc.org documenting it or at
least opened an issue for it? If so, please add a link to it.
A solution for https://github.com/iterative/dvc/issues/2255
Appending docs: iterative/dvc.org#482