Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support DAGS folder being in different location on scheduler and runners #16860

Merged

Conversation

ashb
Copy link
Member

@ashb ashb commented Jul 7, 2021

There has been some vestigial support for this concept in Airflow for a
while (all the CLI command already turn the literal DAGS_FOLDER in to
the real value of the DAGS folder when loading dags), but sometime
around 1.10.1-1.10.3 it got fully broken and the scheduler only ever
passed full paths to DAG files.

This PR brings back this behaviour

Closes #8061


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@boring-cyborg boring-cyborg bot added area:Scheduler including HA (high availability) scheduler area:serialization area:UI Related to UI/UX. For Frontend Developers. area:webserver Webserver related Issues labels Jul 7, 2021
@ashb ashb changed the title Support DAGS folder being in different location on scheduler and runners 🚧 Support DAGS folder being in different location on scheduler and runners Jul 7, 2021
@ashb ashb mentioned this pull request Jul 7, 2021
Comment on lines -505 to -508
mark_success=False,
ignore_all_deps=False,
ignore_depends_on_past=False,
ignore_task_deps=False,
ignore_ti_state=False,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the defaults, we don't need to pass them again

@@ -2379,6 +2413,8 @@ class DagModel(Base):
Index('idx_next_dagrun_create_after', next_dagrun_create_after, unique=False),
)

parent_dag = relationship("DagModel", remote_side=[dag_id])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This relationship was added for "Duck-type compatibility" between a DAG and a DagModel in the case of subdags -- both now have a "parent_dag" attribute.

airflow/models/dag.py Outdated Show resolved Hide resolved

@property
def relative_fileloc(self) -> str:
"""File location of the importable dag 'file' relative to the configured DAGs folder."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will this field behave for example DAGs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it would be /usr/lib/python3.7/.../airflow/example_dags/example_bash_operator.py -- so not relative at all.

I'll update/expand the docstring to say how it deals with DAGs outside of the dags folder.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we are also able to handle examples DAGs by adding a suffix? This will allow us to install workers on shared machines without any problems
For samples:
[exaample_dags_folder]/example_bash_operator.py => /usr/lib/python3.7/.../airflow/example_dags/example_bash_operator.py
[dags_folder]/example_bash_operator.py => ~/home/airflow/dags

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change would be better done along with something like https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-20+DAG+manifest, so I've not done this for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we shouldn’t call this relative fileloc, but just fileloc…? Not sure about this.

@ashb ashb added this to the Airflow 2.2 milestone Jul 27, 2021
@ashb ashb force-pushed the dags-folder-in-different-location-workers branch from a72eb83 to 6a7d7ab Compare July 27, 2021 11:08
@ashb ashb marked this pull request as ready for review July 27, 2021 11:09
@ashb ashb changed the title 🚧 Support DAGS folder being in different location on scheduler and runners Support DAGS folder being in different location on scheduler and runners Jul 27, 2021
@ashb ashb requested a review from uranusjr July 27, 2021 11:09
airflow/models/dag.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Jul 27, 2021
@github-actions
Copy link

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

airflow/models/dag.py Outdated Show resolved Hide resolved
airflow/models/dagbag.py Outdated Show resolved Hide resolved
@ashb ashb requested a review from uranusjr July 28, 2021 14:46
'template_searchpath',
'last_loaded',
}

__serialized_fields: Optional[FrozenSet[str]] = None

fileloc: str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reminds me I was wondering whether fileloc is guaranteed to be absolute or not and had to trace a lot of code. Maybe it’s worthwhile to add this to the docstring (and the one on DagModel).

Copy link
Member

@uranusjr uranusjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good to me in general

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@potiuk
Copy link
Member

potiuk commented Aug 1, 2021

@ashb - can you rebase please ? I think most of the issues with CI stability are solved

@ashb ashb force-pushed the dags-folder-in-different-location-workers branch from be08c89 to af12ddd Compare August 3, 2021 11:08
There has been some vestigial support for this concept in Airflow for a
while (all the CLI command already turn the literal `DAGS_FOLDER` in to
the real value of the DAGS folder when loading dags), but sometime
around 1.10.1-1.10.3 it got fully broken and the scheduler only ever
passed full paths to DAG files.

This PR brings back this behaviour
@ashb ashb force-pushed the dags-folder-in-different-location-workers branch from ca640d3 to 0da2b8c Compare August 4, 2021 09:10
@ashb
Copy link
Member Author

ashb commented Aug 4, 2021

Looks like transient MSQQL errors :(

  ERROR: for airflow  Container "69eb70e8e951" is unhealthy.
  Encountered errors while bringing up the project.

@ashb ashb merged commit 9bed58e into apache:main Aug 4, 2021
@noelmcloughlin
Copy link
Contributor

Linking #16025 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler area:serialization area:UI Related to UI/UX. For Frontend Developers. area:webserver Webserver related Issues full tests needed We need to run full set of tests for this PR to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support DAGS folder being in different location on scheduler and worker.
6 participants