Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use same auth for Dataproc + GCS #224

Merged
merged 2 commits into from
Jul 21, 2022

Conversation

jtcohen6
Copy link
Contributor

resolves CT-861 (just opened)

Try using the same authentication methods for Dataproc + GCS that we're using for BigQuery, so we don't need to rely on the GOOGLE_APPLICATION_CREDENTIALS env var being set.

I still haven't been able to get this running on my local machine, unfortunately. job_client.submit_job_as_operation returns:

501 Received http2 header with status: 404

@jtcohen6 jtcohen6 requested a review from ChenyuLInx July 19, 2022 16:53
@cla-bot cla-bot bot added the cla:yes label Jul 19, 2022
@ChenyuLInx
Copy link
Contributor

@jtcohen6 I updated the code a bit and tested locally, everything works well

)
# Create the job config.
job = {
"placement": {"cluster_name": self.credential.dataproc_cluster_name},
"pyspark_job": {"main_python_file_uri": "gs://{}/{}".format(self.credential.gcs_bucket, filename)},
}
operation = job_client.submit_job_as_operation(
operation = self.job_client.submit_job_as_operation(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: pass in timeout + retry, if/as provided in profile configs (docs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a todo in the main ticket to keep track of it, I think we can close this one and do it as a separate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, could we just use Google's own Retry() decorator for this? and plumb through user-provided configurations if provided, but in order to have sensible defaults. See: #231

@@ -863,33 +863,41 @@ def __init__(self, credential):
credential (_type_): _description_
"""
self.credential = credential

self.GoogleCredentials = BigQueryConnectionManager.get_credentials(credential)
self.storage_client = storage.Client(project=self.credential.database, credentials=self.GoogleCredentials)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also raise a clear validation error if the user forget to install the [dataproc] extras

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you feel like it is actually better to have a enable_python flag in dbt_project.yml to control all logic/import related to python?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed in Slack - let's hold off on doing this for now, but I'd be curious to hear if it comes up in beta feedback!

@ChenyuLInx ChenyuLInx mentioned this pull request Jul 21, 2022
2 tasks
@ChenyuLInx ChenyuLInx marked this pull request as ready for review July 21, 2022 22:09
@ChenyuLInx ChenyuLInx merged commit 9be7017 into feature/python-model-v1 Jul 21, 2022
@ChenyuLInx ChenyuLInx deleted the jerco/python-v1-experimentation branch July 21, 2022 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants