Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use custom ENV variables for Azure Storage Remote. #2256

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 23 additions & 1 deletion dvc/remote/azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,33 @@
from dvc.config import Config
from dvc.remote.base import RemoteBASE
from dvc.path_info import CloudURLInfo
from dvc.exceptions import DvcException


logger = logging.getLogger(__name__)


def _read_connection_string_from_config(config):
connection_string_from_config = config.get(
Config.SECTION_AZURE_CONNECTION_STRING
)
if (
connection_string_from_config is not None
) and connection_string_from_config.startswith("$"):
# read custom environment variable discarding the '$' character
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need $ though? Maybe I'm missing something, but I am not able to find $ used in connection strings at all.

Copy link
Contributor

@efiop efiop Jul 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realised that that ^ didn't come up very clear 😅 I meant that I am not able to find where such trick with env vars is used elsewhere (e.g. in other projects that use azure). Or is it something that you've come up with yourself?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, maybe a little background. When working with Azure, it is handy to save the storage account credentials as connection strings in a environment variables and retrieve them from client libraries (like azure-storage-blob). I explicitly used plural form to highlight the fact that there are multiple storage accounts. Now working with dvc remotes on different storages you need a way to select the right destination of data. The solution with ENV vars would keep the connection string private while showing the name of the ENV variable (which itself encodes the destination with e.g. the storage-account name). Team members can, after setting up their .bash_profile, contribute to projects.

You are right, it is nothing Azure specific - maybe a helper function for other remote types would be handy...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other remotes it might be dangerous. What if someone would want to do something like $PART1_$PART2 ? This env evaluation in config feels like it is too powerful and doesn't really belong in the current form in the config file itself. It is also weird that it will be supported like that for 1 specific config option for 1 specific remote and if we try to support it for other options later, it has a potential to run into conflicts, where we might no longer be able to tell easily if it is an env var that we need to evaluate. But just mixing config and env doesn't feel right in this scenario.

Feels like the proper way to work with this is to use .dvc/config.local, as you've described. You will be sharing your .bash_profile anyway, so why not just share .dvc/config.local privately instead? You can use placeholders in .dvc/config like so too if you want to:

dvc remote modify myremote connection_string MY_PLACEHOLDER

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other remotes it might be dangerous. What if someone would want to do something like $PART1_$PART2 ? This env evaluation in config feels like it is too powerful and doesn't really belong in the current form in the config file itself. It is also weird that it will be supported like that for 1 specific config option for 1 specific remote and if we try to support it for other options later, it has a potential to run into conflicts, where we might no longer be able to tell easily if it is an env var that we need to evaluate. But just mixing config and env doesn't feel right in this scenario.

Feels like the proper way to work with this is to use .dvc/config.local, as you've described. You will be sharing your .bash_profile anyway, so why not just share .dvc/config.local privately instead? You can use placeholders in .dvc/config like so too if you want to:

dvc remote modify myremote connection_string MY_PLACEHOLDER

I get your first point about the power of such parsing functionality - agree on that completely.

On the other hand (your second point) there exists a specific solution with azure where we read a connection string from ENV with AZURE_STORAGE_CONNECTION_STRING - working with one storage account will be fine. Having different storage accounts with different levels of security/back up strategy will be hard (since we distribute write/delete/list connection strings).

Why I do not want to share .dvc/config.local is because you have to do that for each project - plus you have to track who got which token for which project instead of having a list of team members and tokens once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your points, but I'm just not sure that this is the proper way to solve your problem. With things like s3, you would probably just use profiles, right? Does azure have those?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kopytjuk Or maybe file-based auth?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No Azure client libraries (at least the official one in Python) supports authentication via SAS tokens or primary key of the storage account.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kopytjuk maybe something like

client = get_client_from_auth_file(StorageManagementClient)
storage_keys = storage_client.storage_accounts.list_keys(resource_group_name, storage_account_name)
storage_keys = {v.key_name: v.value for v in storage_keys.keys}

storage_client = CloudStorageAccount(storage_account_name, storage_keys['key1'])
blob_service = storage_client.create_block_blob_service()

from Azure/azure-sdk-for-python#1310 (comment) ?

env_var_name = connection_string_from_config[1:]
try:
connection_string_from_config = os.environ[env_var_name]
except KeyError as exc:
raise DvcException(
(
"Variable '${}' is not set in your current environment."
).format(env_var_name),
exc,
)
return connection_string_from_config


class Callback(object):
def __init__(self, name):
self.name = name
Expand Down Expand Up @@ -58,7 +80,7 @@ def __init__(self, repo, config):
)

self.connection_string = (
config.get(Config.SECTION_AZURE_CONNECTION_STRING)
_read_connection_string_from_config(config)
or match.group("connection_string") # backward compatibility
or os.getenv("AZURE_STORAGE_CONNECTION_STRING")
)
Expand Down
28 changes: 28 additions & 0 deletions tests/unit/remote/test_azure.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
from unittest import TestCase
import os

from dvc.remote.azure import RemoteAZURE
from dvc.exceptions import DvcException


class TestRemoteAZURE(TestCase):
Expand All @@ -21,6 +23,7 @@ def test_init_compat(self):
)
config = {"url": url}
remote = RemoteAZURE(None, config)

self.assertEqual(remote.path_info, "azure://" + self.container_name)
self.assertEqual(remote.connection_string, self.connection_string)

Expand All @@ -29,5 +32,30 @@ def test_init(self):
url = "azure://{}/{}".format(self.container_name, prefix)
config = {"url": url, "connection_string": self.connection_string}
remote = RemoteAZURE(None, config)

self.assertEqual(remote.path_info, url)
self.assertEqual(remote.connection_string, self.connection_string)

def test_init_from_env(self):
prefix = "some/prefix"
url = "azure://{}/{}".format(self.container_name, prefix)
env_var_name = "AZURE_TEST_VAR"
os.environ[env_var_name] = self.connection_string
kopytjuk marked this conversation as resolved.
Show resolved Hide resolved

config = {"url": url, "connection_string": "${}".format(env_var_name)}
remote = RemoteAZURE(None, config)

self.assertEqual(remote.path_info, url)
kopytjuk marked this conversation as resolved.
Show resolved Hide resolved
self.assertEqual(remote.connection_string, self.connection_string)

# clean var from env again
del os.environ[env_var_name]

def test_init_from_env_not_set(self):
prefix = "some/prefix"
url = "azure://{}/{}".format(self.container_name, prefix)
env_var_name = "AZURE_TEST_VAR"
kopytjuk marked this conversation as resolved.
Show resolved Hide resolved
config = {"url": url, "connection_string": "${}".format(env_var_name)}

with self.assertRaises(DvcException):
kopytjuk marked this conversation as resolved.
Show resolved Hide resolved
RemoteAZURE(None, config)