Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest): support for env variable in cli #3215

Merged
Merged
9 changes: 9 additions & 0 deletions docs/how/delete-metadata.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,15 @@ will allow you to customize the datahub instance you are communicating with.

_Note: Provide your GMS instance's host when the prompt asks you for the DataHub host._

Alternatively, you can set the following env variables if you don't want to use a config file
```
DATAHUB_SKIP_CONFIG=True
DATAHUB_GMS_HOST=http://localhost:8080
DATAHUB_GMS_TOKEN=
```

The env variables take precendence over what is in the config.

## Delete By Urn

To delete all the data related to a single entity, run
Expand Down
55 changes: 49 additions & 6 deletions metadata-ingestion/src/datahub/cli/cli_utils.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,26 @@
import json
import logging
import os.path
import sys
import typing
from datetime import datetime
from typing import Optional
from typing import List, Optional, Tuple

import click
import requests
import yaml
from pydantic import BaseModel, ValidationError

log = logging.getLogger(__name__)

DEFAULT_GMS_HOST = "http://localhost:8080"
CONDENSED_DATAHUB_CONFIG_PATH = "~/.datahubenv"
DATAHUB_CONFIG_PATH = os.path.expanduser(CONDENSED_DATAHUB_CONFIG_PATH)

ENV_SKIP_CONFIG = "DATAHUB_SKIP_CONFIG"
ENV_METADATA_HOST = "DATAHUB_GMS_HOST"
ENV_METADATA_TOKEN = "DATAHUB_GMS_TOKEN"


class GmsConfig(BaseModel):
server: str
Expand All @@ -35,18 +43,23 @@ def write_datahub_config(host: str, token: Optional[str]) -> None:
return None


def get_session_and_host():
session = requests.Session()
def should_skip_config() -> bool:
try:
return os.environ[ENV_SKIP_CONFIG] == "True"
except KeyError:
return False

gms_host = "http://localhost:8080"
gms_token = None

def ensure_datahub_config() -> None:
if not os.path.isfile(DATAHUB_CONFIG_PATH):
click.secho(
f"No {CONDENSED_DATAHUB_CONFIG_PATH} file found, generating one for you...",
bold=True,
)
write_datahub_config(gms_host, gms_token)
write_datahub_config(DEFAULT_GMS_HOST, None)


def get_details_from_config():
with open(DATAHUB_CONFIG_PATH, "r") as stream:
try:
config_json = yaml.safe_load(stream)
Expand All @@ -63,8 +76,38 @@ def get_session_and_host():

gms_host = gms_config.server
gms_token = gms_config.token
return gms_host, gms_token
except yaml.YAMLError as exc:
click.secho(f"{DATAHUB_CONFIG_PATH} malformatted, error: {exc}", bold=True)
return None, None


def get_details_from_env() -> Tuple[Optional[str], Optional[str]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this would be simpler to model as a Dict[str, str] being returned with keys being GMS_HOST and GMS_TOKEN .. then you don't need the first-non-null checks later on.. you can just pick the value of these variables from the Dict directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still need to use the Optional[str] in Dict because it is quite possible one of the variables was set and not the other. We cannot assume user will either provide both of the variables or none. The first non-null check is for that purpose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally not a fan of using Tuples .. because there is implied order here (first element is host, second element is token) and can lead to bugs later on if someone misunderstands the order ... with a dictionary like "host" -> value, and "token" -> value, it is harder to make those mistakes.

But I see that the rest of the code in this file has similar things. So I'll let this in and we can do a pass over these later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on the implied order. Had not considered from that point.

return os.environ.get(ENV_METADATA_HOST), os.environ.get(ENV_METADATA_TOKEN)


def first_non_null(ls: List[Optional[str]]) -> Optional[str]:
return next((el for el in ls if el is not None and el.strip() != ""), None)


def get_session_and_host():
session = requests.Session()

gms_host_env, gms_token_env = get_details_from_env()
if should_skip_config():
gms_host = gms_host_env
gms_token = gms_token_env
else:
ensure_datahub_config()
gms_host_conf, gms_token_conf = get_details_from_config()
gms_host = first_non_null([gms_host_env, gms_host_conf])
gms_token = first_non_null([gms_token_env, gms_token_conf])

if gms_host is None or gms_host.strip() == "":
log.error(
f"GMS Host is not set. Use datahub init command or set {ENV_METADATA_HOST} env var"
)
return None, None

session.headers.update(
{
Expand Down
11 changes: 11 additions & 0 deletions metadata-ingestion/tests/unit/test_cli_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
from datahub.cli import cli_utils


def test_first_non_null():
assert cli_utils.first_non_null([]) is None
assert cli_utils.first_non_null([None]) is None
assert cli_utils.first_non_null([None, "1"]) == "1"
assert cli_utils.first_non_null([None, "1", "2"]) == "1"
assert cli_utils.first_non_null(["3", "1", "2"]) == "3"
assert cli_utils.first_non_null(["", "1", "2"]) == "1"
assert cli_utils.first_non_null([" ", "1", "2"]) == "1"