-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/python model v1 #209
Conversation
models__simple_python_model_v2 = """ | ||
import pandas | ||
|
||
def model(dbt): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What the model code looks like
spark = SparkSession.builder.appName('smallTest').getOrCreate() | ||
|
||
spark.conf.set("viewsEnabled","true") | ||
spark.conf.set("temporaryGcsBucket","python-model-test") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO in the comment above
dbt/adapters/bigquery/impl.py
Outdated
from google.cloud import dataproc_v1 | ||
from google.cloud import storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Let's add these as an
extras_require
— e.g.pip install dbt-bigquery[dataproc]
- Let's raise a clearer error below if a user wants to use Python models, and hasn't installed the extra
dbt/adapters/bigquery/impl.py
Outdated
matches = re.match("gs://(.*?)/(.*)", response.driver_output_resource_uri) | ||
output = ( | ||
storage.Client() | ||
.get_bucket(matches.group(1)) | ||
.blob(f"{matches.group(2)}.000000000") | ||
.download_as_string() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no way to pass a full GCS resource URL into storage.Client()
? That's sorta surprising to me. This isn't the worst regex, so I'm not strictly opposed to taking this approach if it's necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure! I just followed a tutorial google provided for this, we can look into it more if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ope ok makes sense! :)
dbt/adapters/bigquery/impl.py
Outdated
# Dataproc job output is saved to the Cloud Storage bucket | ||
# allocated to the job. Use regex to obtain the bucket and blob info. | ||
matches = re.match("gs://(.*?)/(.*)", response.driver_output_resource_uri) | ||
output = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take it that these are Spark logs being streamed back to GCS: https://cloud.google.com/dataproc/docs/guides/dataproc-job-output
If we felt motivated, we could try parsing these logs to infer metadata for the result. Not a priority right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I looked into it briefly, didn't found much things to surface up right now. We can take another look later on
* Use same auth for Dataproc + GCS * remove fetch result Co-authored-by: Chenyu Li <[email protected]>
* fix test and add clear install instruction * rename and fix format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing stands out 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why the dataproc_region wasn't found. Once that is working, looks fine.
This change currently includes table materialization.
Also super happy to hear any feedback and what you think we missed.
TODO: