Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BigQuery] Raw token authentication method #2802

Closed
davehughes opened this issue Sep 30, 2020 · 11 comments · Fixed by #2805
Closed

[BigQuery] Raw token authentication method #2802

davehughes opened this issue Sep 30, 2020 · 11 comments · Fixed by #2805
Labels
bigquery enhancement New feature or request

Comments

@davehughes
Copy link

Describe the feature

The existing auth methods for BigQuery provide a great experience for interactive users, allowing them to transparently provision access tokens in a variety of ways. However, as an automation implementer on a team trying to programmatically wield dbt on behalf of customers, I'd like to be able to bypass these niceties and just inject an externally provisioned token for access.

In our typical non-dbt scenarios, we create service accounts for customers, have them grant specific limited permissions to those accounts for our service's operations, then use our master account to issue scoped tokens for those accounts/operations when they need to execute. When running dbt operations, I'd like to avoid writing our master account credentials file to disk (as required by the service-account/service-account-json methods) to protect against reflected file attacks and/or potential vulnerabilities in dbt itself, which could potentially exfiltrate these creds. While I don't see these as particularly likely scenarios, I'd love to have the ability to directly control the blast radius via smaller scoped credentials.

I propose a new BigQuery auth method named 'service-token' and an additional field named service_token to provide its value, e.g.:

my-bigquery-db:
  target: dev
  outputs:
    dev:
      type: bigquery

      # something like this?
      method: service-token
      service_token: 'ya29.c.KqwC3gfRlXO...'

      project: [GCP project id]
      dataset: [the name of your dbt dataset]
      threads: [1 or more]
      timeout_seconds: 300
      priority: interactive
      retries: 1

Describe alternatives you've considered

With the addition of the impersonate_service_account field added in 0.18.0, I can connect as delegated services, but still need to provide the full master service keyfile rather than a more limited credential.

Additional context

This is specific to BigQuery database connections and automation use-cases. I think it's fair to say that standard interactive users would almost never want to use this authentication mode directly.

Who will this benefit?

  • Security-minded SaaS automators who are targeting multi-warehouse and connecting to BQ on behalf of customers
  • IT/Dataeng staff using automation to provision limited credentials as part of employee access control

Are you interested in contributing this feature?

Yes, happy to contribute to making this a reality. I have a simple (if slightly hacky) proof-of-concept that I can polish into a real PR if this seems interesting.

@davehughes davehughes added enhancement New feature or request triage labels Sep 30, 2020
@jtcohen6 jtcohen6 added bigquery and removed triage labels Sep 30, 2020
@jtcohen6
Copy link
Contributor

@davehughes I'm broadly supportive of the need you're describing. To clarify what you mean by service token: it sounds like this is a temporary access token generated via oauth, as in generateAccessToken, and that you handle generating and refreshing tokens through a separate automated process. Is that right?

The reason I ask is because we're planning to work later this year on supporting BigQuery oauth as a connection mechanism (#2344), with the goal of hooking into GSuite SSO in dbt Cloud. I could be wrong, but it sounds like there's potential overlap between the core work we'd need there and an even more robust version of the auth mechanism you're interested in.

@davehughes
Copy link
Author

Yes, we use generateAccessToken to generate the tokens that we'd want to pass in here, and your general understanding is correct.

In #2344, dbt still has creds to connect to the remote service and generate/fetch tokens, each of which is used for service auth until the token expires (at which point it can be refreshed with another call to the token_uri). Here, the token is completely opaque - not refreshable, may not have the right scopes, may already be expired, etc. - and it's the job of the external automated process to make sure scopes and expiry are properly set up. dbt would just use the token directly for service auth or die trying.

@drewbanin
Copy link
Contributor

or die trying

😇

@davehughes I checked out the docs that you linked to, and it sure seems like this one is doable. I see:

The desired lifetime duration of the access token in seconds.

By default, the maximum allowed value is 1 hour. To set a lifetime of up to 12 hours, you can add the service account as an allowed value in an Organization Policy that enforces the constraints/iam.allowServiceAccountCredentialLifetimeExtension constraint. See detailed instructions at https://cloud.google.com/iam/help/credentials/lifetime

The thing to watch out here would be the failure mode where a dbt run takes longer than the lifetime of the token. Sometimes access tokens have very short TTLs, but 1-12 hours seems super appropriate to me.

I have a simple (if slightly hacky) proof-of-concept that I can polish into a real PR if this seems interesting.

Can you just share a tiny bit about how you're creating a Credentials object in your proof-of-concept code? Are you creating a google.oauth2.credentials.Credentials object directly, and if so, are you just specifying a token to that constructor?

If so, then I think this change will indeed pair really well with #2344 (though there are two distinct code changes!). If that sounds about right to you, then we'd be really happy to accept a PR for this one :)

@davehughes
Copy link
Author

Hmm...I hadn't seen the doc on the max access token lifetime, which might throw a bit of a wrench in here, though setting up the lifetime extension and choosing an appropriately long lifetime won't be a problem. We don't usually see dbt run invocations that take longer than an hour, but I can definitely imagine they exist (and even 12 hours+).

As for creating a Credentials object, I resorted to just creating a small class that quacks like it should (this is the hacky bit!):

class RawServiceCredentials(credentials.Credentials)
    def __init__(self, token, expiry=None):
        self.token = token
        # This keeps users of the credential happy that it's not expired and doesn't need a refresh() call
        self.expiry = expiry or datetime.datetime.now() + datetime.timedelta(days=1)

    # This is here to fulfill an interface check I think, and should maybe raise NotImplementedError
    def refresh():
        pass

@drewbanin
Copy link
Contributor

Yeah - I think that's fair. I do think that dbt runs which take > 1 hour are not-uncommon, but I think you get to add constraints around things like this for security-minded features!

I bet we can swing this implementation using the google.oauth2.credentials.Credentials object. Hopefully you can just pass in a token and everything else will work from there! I'm going to PR a change for #2344 which might serve as a good template for a change like this one if that's helpful at all - happy to link to it here when it's up!

@davehughes
Copy link
Author

I'd be interested to see that PR! And I'll try to find some time to experiment with google.oauth2.credentials.Credentials and see if it fits the bill.

@drewbanin
Copy link
Contributor

check out the PR here: #2805

@drewbanin
Copy link
Contributor

@davehughes PR #2805 is going to add support for providing a raw access token in a profiles.yml file. You think that implementation helps resolve this issue too?

@davehughes
Copy link
Author

Yep, just pulled that down and confirmed that it works for me with the following settings:

type: bigquery
method: oauth-secrets
token: <my generated token>
# refresh_token/client_id/client_secret/token_uri set to None

Passing a bad or expired token errors with the Unable to generate access token message, which is reasonable for folks using this particular approach.

Thanks @drewbanin @jtcohen6 for the nice solution. I'll look forward to this landing. 😎

@drewbanin
Copy link
Contributor

Awesome, glad to hear it!

@uaroracca
Copy link

Hello, the discussion above is super helpful. I just had another related issue if someone could help with this.

Issue:
image

profiles.yml:
image

same config was previously working but after a few months it's causing this issue.
Any help is appreciated.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants