Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restrict python submission #5822

Merged
merged 11 commits into from
Sep 20, 2022
Merged

restrict python submission #5822

merged 11 commits into from
Sep 20, 2022

Conversation

ChenyuLInx
Copy link
Contributor

@ChenyuLInx ChenyuLInx commented Sep 13, 2022

resolves #5596

Description

Only allow submit_python_job being used in statement macro within materializition logic.
TODO

  • make adapter.submit_python_job in dbt snowflake not available in jinja

Checklist

@ChenyuLInx ChenyuLInx requested a review from gshank September 13, 2022 02:35
@cla-bot cla-bot bot added the cla:yes label Sep 13, 2022
@github-actions
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@@ -285,6 +292,13 @@ def __init__(
self.node = node
self.stack = stack

if self.context and "adapter" in self.context:
if "materialization" in self.macro.unique_id:
self.context["special_functions"] = self.context["adapter"].submit_python_job
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gshank This way user actually still can use this submit_python_job function via {{special_function(model, python_code)}} in macro, is there a way to actually have this in context but not allow jinja to access it?

Other ways I have been trying to do is save it somewhere else individually, but haven't found a good place, the MacroGenerater object for materialization macro and statement macro are actually two different objects, with only context being passed from one to another(node, stack are both None for materialization macro)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also tag @jtcohen6 to see if you have any thoughts

@ChenyuLInx ChenyuLInx self-assigned this Sep 13, 2022
@ChenyuLInx ChenyuLInx added the Skip Changelog Skips GHA to check for changelog file label Sep 13, 2022
@ChenyuLInx ChenyuLInx marked this pull request as ready for review September 13, 2022 22:43
@ChenyuLInx ChenyuLInx requested a review from a team as a code owner September 13, 2022 22:43
@@ -301,6 +319,15 @@ def exception_handler(self) -> Iterator[None]:
e.stack.append(self.macro)
raise e

def _is_statement_under_materailization(self) -> bool:
return bool(
self.macro.unique_id == "macro.dbt.statement"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtcohen6 with this user can't actually access submit_python_job in custom statement macro. Any concern with that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChenyuLInx I'm aligned. I'm not familiar with any reason or legitimate use case for an end user to override the statement macro — this should really be treated as internal to dbt-core.

If we want to loosen this restriction in the future, we can. (E.g. by changing this conditional to check for any macro named statement, rather than the unique_id containing project name dbt.)

@ChenyuLInx
Copy link
Contributor Author

Don't have a test since we can't run python model in core now. but manually checked locally and made sure materialization-> statement is the only place that you can properly access python submit job function. You can also access it in materialization root level with some strange and clearly not good looking syntax. And not in all other places.

@ChenyuLInx
Copy link
Contributor Author

Also skipping changelog since I don't think we want to tell everyone about it

@@ -301,6 +319,15 @@ def exception_handler(self) -> Iterator[None]:
e.stack.append(self.macro)
raise e

def _is_statement_under_materailization(self) -> bool:
Copy link
Contributor

@jtcohen6 jtcohen6 Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a typo

Suggested change
def _is_statement_under_materailization(self) -> bool:
def _is_statement_under_materialization(self) -> bool:

and where it's called below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I understand, the idea here is: submit_python_job can only be called from within a statement, within a materialization (macro containing the name materialization). Otherwise, the user should see an exception like:

submit_python_job is not intended to be called here.

I haven't been able to see the desired behavior when running locally. I tried checking out the branch, and then defining a model like:

-- models/my_model.sql
{% set compiled_code %}

def main(session):
    pass

{% endset %}

{% set result = adapter.submit_python_job(model, compiled_code) %}

-- actual view definition
select '{{ result }}' as result

This seems to actually work — it doesn't get blocked

@@ -301,6 +319,15 @@ def exception_handler(self) -> Iterator[None]:
e.stack.append(self.macro)
raise e

def _is_statement_under_materailization(self) -> bool:
return bool(
self.macro.unique_id == "macro.dbt.statement"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChenyuLInx I'm aligned. I'm not familiar with any reason or legitimate use case for an end user to override the statement macro — this should really be treated as internal to dbt-core.

If we want to loosen this restriction in the future, we can. (E.g. by changing this conditional to check for any macro named statement, rather than the unique_id containing project name dbt.)

@@ -272,6 +275,13 @@ def pop(self, name):
raise InternalException(f"popped {got}, expected {name}")


def raise_error_func(func_name: str) -> Callable:
def raise_error(*args, **kwargs):
raise InternalException(f"{func_name} is not intended to be called here.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this include the name of the macro / node calling the unsupported function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, but I can add it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added for usage in models, for hooks, operation, test we can't really tell(they use MacroGenerator) but we don't get info about which function actually calls it unless we do some really aggressive tracking. So I just added a generic message indication it is in one of those places. Let me know if we want to be very specific.
I also included the full name adapter.submit_python_job in the message so shouldn't be hard to do a search

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really solid UX improvement - thank you!

@ChenyuLInx ChenyuLInx requested review from a team and iknox-fa September 14, 2022 15:53
@ChenyuLInx ChenyuLInx requested a review from jtcohen6 September 14, 2022 21:01
@ChenyuLInx
Copy link
Contributor Author

@jtcohen6 fixed the issue you mentioned by also remove that function for compilation time.

Copy link
Contributor

@gshank gshank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kind of concerned with this approach. I think doing this in the MacroGenerator is almost certainly the wrong place. That code would execute for every single macro that's built. The adapter that's linked in the contexts comes from the global adapter? Why doesn't the last created MacroGenerator determine which monkeypatched method is active?

I think the MacroNamespaceBuilder is a more likely place to put this -- you can modify the context before it's used to create the MacroGenerator object. But I would prefer finding some way to just pass along information on the 'submit_python_job' call that would allow us to determine the validity in the python code. What is it that we need to know? The call stack?

@ChenyuLInx
Copy link
Contributor Author

@gshank the MacroNamespaceBuilder sounds promising. One question, for the materialization macro does that also go through MacroNamespaceBuilder somehow? I feel like we need to somehow pass in the call stack of the macros if we want to check it at that function, not sure how that's going to look like. Scheduled a meeting tomorrow to talk this live

@jtcohen6
Copy link
Contributor

Really appreciate you taking a look here @gshank! If we're going to do this, it ought to be a surgical change in the right spot.

The high-level goal here is to restrict adapter.submit_python_job to only be callable within a statement of a materialization. We want to prevent users from firing up Python jobs willy-nilly (in post-hooks, in tests, wherever). We may want to support those as patterns in the future. So we're trying to put guardrails in place for now, with the possibility of removing them later, without adding a ton of cruft to the codebase in the process.

@gshank
Copy link
Contributor

gshank commented Sep 16, 2022

I wonder what the call stack looks like in adapter.submit_python_job. Could we recognize that it comes from a materialization?

Or if that doesn't work, maybe we could make a special context with a 'submit_python_job' method, with two flavors, the error-raising function and the "call adapter.submit_python_job' version, and limit the context with the working 'submit_python_job' to materialization macros. Then we could check the call stack in adapter.submit_python_job and make sure it's only coming from that method.

@gshank
Copy link
Contributor

gshank commented Sep 16, 2022

So maybe that second suggestion wouldn't work. It looks like the materialization macro is the wrapping thing that's called to execute models, so it's passed the model_context used by everything else. It wouldn't be very clean to have two contexts at that point...

@ChenyuLInx ChenyuLInx requested review from a team as code owners September 19, 2022 23:42
# only mark depth=0 as a dependency
if depth == 0:
# only mark depth=0 as a dependency, when creating this dependency we don't pass in stack
if depth == 0 and self.node:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gshank the depth is being kept as 0 because of the comment in I updated

Copy link
Contributor

@gshank gshank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks so much cleaner!

@ChenyuLInx ChenyuLInx merged commit 646a0c7 into main Sep 20, 2022
@ChenyuLInx ChenyuLInx deleted the enhancement/block_submit_python branch September 20, 2022 16:39
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - looks much cleaner!

raise RuntimeException(
f"submit_python_job is not intended to be called here, at model {parsed_model['alias']}, with macro call_stack {self.context_macro_stack.call_stack}."
)
return self.adapter.submit_python_job(parsed_model, compiled_code)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: Will adapter.submit_python_job still be callable (as a classmethod) from within the Jinja context? Would we need to decorate it as "unavailable"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is no longer available from within Jinja context, we make functions available using @available decorator, and I removed that decorator for submit_python_job function. Also have a snowflake PR opened for it(given that snowflake actually submit things slightly different). Gonna also sync with the dbt-databricks maintainer to make sure they don't add that @available decorator anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes Skip Changelog Skips GHA to check for changelog file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-972] [Feature] Block usage of submit_python_job outside of materialization logic
3 participants