Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "Optimizing" chapter to dynamic-dags section #25121

Closed

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented Jul 18, 2022

There is a nice way to optimize dynamic DAG generation by our
users. Adding a short chapter linking to example on how this can
be done might guide them to do similar approach.

We handle several cases here:

  • starting task via Python interpreter
  • starting task via forking
  • running "airflow tasks test" command

The detection here is rather complex and in the follow up PR
we will add a more robust detection mechanims (but it will be
only available as of Airflow 2.4)


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Copy link
Member

@feluelle feluelle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is genius! 🚀

@potiuk
Copy link
Member Author

potiuk commented Jul 18, 2022

I will wait for others to improve the English wording probably :)

@pierrejeambrun
Copy link
Member

Very interesting, I think this will help a lot of people 💪

@potiuk potiuk force-pushed the add-optimization-of-dag-parsing-chapter branch 2 times, most recently from 76087c6 to 13a85f0 Compare July 18, 2022 14:20
@potiuk
Copy link
Member Author

potiuk commented Jul 18, 2022

Hey @BasPH, @eladkal I took your comments in and rephrased it a bit. I removed "worker" and "k8s" references and replaced it with more generic "task execution".

I also added a few improvements to make the description and tests more "robust":

  • I made sure to reflect some of the recent changes (it's either "scheduler" or "dag-processor")

  • I added an extra check to detect if second argument is "tasks". The problem is that in more general case both scheduler and dag processor can be executed with extra args that might make the "scheduler" or "dag-processor" command to have more than 3 args. Checking for > 3 and whether second arg is "tasks" is much more robust. Also it has the nice side effect that this optimisation will also work for "airflow tasks test" command - which I think users of such dynamic dags would badly, badly need.

@potiuk potiuk force-pushed the add-optimization-of-dag-parsing-chapter branch 2 times, most recently from ab8cccd to 9eae1bf Compare July 18, 2022 14:27
@potiuk potiuk marked this pull request as draft July 18, 2022 15:53
@potiuk
Copy link
Member Author

potiuk commented Jul 18, 2022

ocnverted to draft to iterate a bit more on it

@potiuk
Copy link
Member Author

potiuk commented Jul 18, 2022

Yep. This is what you get when you run "getproctitle()" in your task:

airflow task supervisor: ['airflow', 'tasks', 'run', 'example_bash_operator2', 'run_after_loop', 
'manual__2022-07-18T20:57:08.555358+00:00', '--local', '--subdir', 'DAGS_FOLDER/example_bash2.py']

Maybe not perfect but it might do for now until we add some more robust way.

@potiuk potiuk marked this pull request as ready for review July 18, 2022 21:28
@potiuk potiuk force-pushed the add-optimization-of-dag-parsing-chapter branch from 9eae1bf to 4305d5c Compare July 18, 2022 21:28
@potiuk
Copy link
Member Author

potiuk commented Jul 18, 2022

Right - @BasPH @eladkal this version reads dag id from either "args" or setproctitle. Worst case it will not find the dag id in which case it will simply parse all dags.

@potiuk
Copy link
Member Author

potiuk commented Jul 18, 2022

I think this will do for now, but in the futture we might add more robust way

@potiuk potiuk force-pushed the add-optimization-of-dag-parsing-chapter branch 3 times, most recently from 1e5251e to 55d766c Compare July 18, 2022 21:41
Copy link
Member

@jedcunningham jedcunningham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little torn on documenting this in the official docs. Almost feels like if we are going to officially suggest it, we should have a better DAG authoring experience for it.

docs/apache-airflow/howto/dynamic-dag-generation.rst Outdated Show resolved Hide resolved
docs/apache-airflow/howto/dynamic-dag-generation.rst Outdated Show resolved Hide resolved
docs/apache-airflow/howto/dynamic-dag-generation.rst Outdated Show resolved Hide resolved
Copy link
Contributor

@BasPH BasPH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, nice that the example works for both new and forked processes. Made some smaller comments.

docs/apache-airflow/howto/dynamic-dag-generation.rst Outdated Show resolved Hide resolved
docs/apache-airflow/howto/dynamic-dag-generation.rst Outdated Show resolved Hide resolved
docs/apache-airflow/howto/dynamic-dag-generation.rst Outdated Show resolved Hide resolved
docs/apache-airflow/howto/dynamic-dag-generation.rst Outdated Show resolved Hide resolved
docs/apache-airflow/howto/dynamic-dag-generation.rst Outdated Show resolved Hide resolved
docs/apache-airflow/howto/dynamic-dag-generation.rst Outdated Show resolved Hide resolved
docs/apache-airflow/howto/dynamic-dag-generation.rst Outdated Show resolved Hide resolved
docs/apache-airflow/howto/dynamic-dag-generation.rst Outdated Show resolved Hide resolved
docs/apache-airflow/howto/dynamic-dag-generation.rst Outdated Show resolved Hide resolved
@potiuk potiuk force-pushed the add-optimization-of-dag-parsing-chapter branch from 55d766c to 968b998 Compare July 19, 2022 08:10
@potiuk
Copy link
Member Author

potiuk commented Jul 19, 2022

I addressed all comments I think -> Would love one more pass @BasPH @jedcunningham

Copy link
Contributor

@BasPH BasPH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor comments, otherwise LGTM!

docs/apache-airflow/howto/dynamic-dag-generation.rst Outdated Show resolved Hide resolved
docs/apache-airflow/howto/dynamic-dag-generation.rst Outdated Show resolved Hide resolved
@potiuk potiuk force-pushed the add-optimization-of-dag-parsing-chapter branch 2 times, most recently from ba3ef1b to 9a34ded Compare July 19, 2022 13:42
@potiuk potiuk force-pushed the add-optimization-of-dag-parsing-chapter branch from 9a34ded to 6e0b4da Compare July 19, 2022 13:46
@potiuk
Copy link
Member Author

potiuk commented Jul 19, 2022

Actually I found (while implementing proper context) that another variang of proctitle is possible :) 🤯

@potiuk potiuk force-pushed the add-optimization-of-dag-parsing-chapter branch from 6e0b4da to 060c33e Compare July 19, 2022 13:48
There is a nice way to optimize dynamic DAG generation by our
users. Adding a short chapter linking to example on how this can
be done might guide them to do similar approach.

We handle several cases here:

* starting task via Python interpreter
* starting task via forking
* running "airflow tasks test" command

The detection here is rather complex and in the follow up PR
we will add a more robust detection mechanims (but it will be
only available as of Airflow 2.4)
@potiuk potiuk force-pushed the add-optimization-of-dag-parsing-chapter branch from 240feeb to 4db24d8 Compare July 21, 2022 08:18
@potiuk
Copy link
Member Author

potiuk commented Jul 21, 2022

I know @jedcunningham you had reservation about putting that out now. I thought a bit about it and I added one more chapter about "experimental" status of the practice - with lots of reservations and disclaimers.

If you think this is fine now @jedcunningham @BasPH - I will look at the robust solution in #25161 (and then we can think if we want to add any kind of future approach.

I think there is very little harm in adding the documentation about it even now with the "complex" code - as this will help some users and they will at least know that there is a possibility of such optimization.

The problem is real and serious - the more dynamic dags people will use, the more similiar solution will be needed, so I prefer even to get some issues from people who tried and fail, so that we can possibly see if we can optimize it even further or maybe even add some other solutions in the future.

@jedcunningham
Copy link
Member

The new paragraph looks good, and it does give us something to point at to say "we told you it was experimental", I'm not sure it changes much.

I'm not exactly sure how to articulate my thoughts, but I'll try.

This feels to me like we are helping hide the fact people didn't architect their DAGs properly. If you've hit the point where this is necessary, then you've already strayed out of the norm. Putting out a complex chuck of code that'll be copy pasted around, getting out of date, what have you is just adding an extra burden to maintenance and support. I can totally imagine down the road an upgrade breaking an early version of this and us getting hard to diagnose bug reports because of it. Worse, if the users DAGs are big/complex enough to need this, they likely can't/won't share it as a reproduction so that adds an extra burden.

There are other ways to DRY up a DAG than to have them all come from a single loop 🤷‍♂️.

I have no problem with folks opting in to this, knowing (hopefully) exactly the risks that come with it. However once we toss it in the official docs (even with a disclaimer paragraph), it becomes "officially blessed" to an extent, and I'm not sure we should do that for this iteration.

tldr: very clever solution, but too esoteric/brittle to be officially documented for end users imo.

@potiuk
Copy link
Member Author

potiuk commented Jul 21, 2022

The new paragraph looks good, and it does give us something to point at to say "we told you it was experimental", I'm not sure it changes much.

I'm not exactly sure how to articulate my thoughts, but I'll try.

I think you managed very well :).

This feels to me like we are helping hide the fact people didn't architect their DAGs properly. If you've hit the point where this is necessary, then you've already strayed out of the norm. Putting out a complex chuck of code that'll be copy pasted around, getting out of date, what have you is just adding an extra burden to maintenance and support. I can totally imagine down the road an upgrade breaking an early version of this and us getting hard to diagnose bug reports because of it. Worse, if the users DAGs are big/complex enough to need this, they likely can't/won't share it as a reproduction so that adds an extra burden.

Yeah I know exactly what you mean. I have reservations myself, looking brittleness of the solution. I actually think I will do it differently. I will - myself - write a short blog post acompanying the one from @itayB where I add a more complete solution and I also publish it in Airflow Publication.

There are other ways to DRY up a DAG than to have them all come from a single loop man_shrugging.

Should we maybe then mention THIS instead? Maybe that is a better approach that we warn the users not to do it?

tldr: very clever solution, but too esoteric/brittle to be officially documented for end users imo.

Question: woudl you also object if just get the "official" approach from #25161 ? Or is it mostly the brittleness ?

@potiuk potiuk marked this pull request as draft July 21, 2022 15:59
@jedcunningham
Copy link
Member

There are other ways to DRY up a DAG than to have them all come from a single loop man_shrugging.

Should we maybe then mention THIS instead? Maybe that is a better approach that we warn the users not to do it?

I like this idea. We should do it, even if/when we do add a better short circuit for the loop approach.

Question: woudl you also object if just get the "official" approach from #25161 ? Or is it mostly the brittleness ?

get_parsing_context seems much better and I wouldn't object to documenting that. It's tested and can be a stable interface for DAG authors. (Though I haven't looked closely at that PR and likely won't be able to until later next week)

I will - myself - write a short blog post acompanying the one from @itayB where I add a more complete solution and I also publish it in Airflow Publication.

👍

@potiuk
Copy link
Member Author

potiuk commented Jul 22, 2022

Closin this one then - blog post is being reviewed :)

@potiuk potiuk closed this Jul 22, 2022
@potiuk potiuk deleted the add-optimization-of-dag-parsing-chapter branch July 29, 2022 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants