-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check pandas before import #229
Conversation
Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the dbt-snowflake contributing guide. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 code looks cleaner
Another approach we could take: Adding some if pandas
logic within our "materialization" code, to detect if it's installed before trying to call it. Two reasons:
- Requiring
pandas
for all Python models means that enabling Anaconda packages / accepting third-party terms is a prerequisite for using Snowpark Python at all (Snowpark setup third-party terms docs.getdbt.com#1848) - Could loading
pandas
in every stored proc risk slowing down its runtime?
Basic idea: this code
dbt-snowflake/dbt/include/snowflake/macros/materializations/table.sql
Lines 44 to 48 in 340c3fa
# we have to make sure pandas is imported | |
import pandas | |
if isinstance(df, pandas.core.frame.DataFrame): | |
# session.write_pandas does not have overwrite function | |
df = session.createDataFrame(df) |
would become something like:
try:
import pandas
else:
pandas = None
if pandas and isinstance(df, pandas.core.frame.DataFrame):
# session.write_pandas does not have overwrite function
df = session.createDataFrame(df)
Co-authored-by: Jeremy Cohen <[email protected]>
@jtcohen6 As I looked into this more, I realized that the reason we don't run into this issue without specify Product call and we will make it happen. - |
@ChenyuLInx Sounds good! Let's do the other approach then: check to see whether Only question: It's not totally clear to me if a user could still call def model(dbt, session):
df = dbt.ref('some_model')
return df.to_pandas() And so the user would need to configure it as: def model(dbt, session):
dbt.config(packages = ['pandas'])
df = dbt.ref('some_model')
return df.to_pandas() |
@jtcohen6 Looking though the Slack chat again of how this issue is created, here's my current theory.
So the current solution of checking whether pandas is available should work just fine.
Might worth confirm with the snowpark team about it since both our testing and production account have accepted the term and there's no way I can test it. |
@@ -180,7 +180,13 @@ def submit_python_job(self, parsed_model: dict, compiled_code: str): | |||
database = getattr(parsed_model, "database", self.config.credentials.database) | |||
identifier = parsed_model["alias"] | |||
proc_name = f"{database}.{schema}.{identifier}__dbt_sp" | |||
packages = ["snowflake-snowpark-python"] + parsed_model["config"].get("packages", []) | |||
packages = parsed_model["config"].get("packages", []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This make sure that if user also specify snowflake-snowpark-python
in the config, we will still run everything well
resolves #228
Description
Add pandas as the default packages for dbt-snowflake
Checklist
changie new
to create a changelog entry