Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: get task dependencies without serializing task tree to string #41494

Merged
merged 1 commit into from
Aug 15, 2024

Conversation

mobuchowski
Copy link
Contributor

Currently _get_parsed_dag_tree uses get_tree_view which in degenerated case (like, in test test_get_dag_tree_large_dag can generate string tree representation taking multiple gigabytes of memory.

However, for what it's trying to accomplish, pretty much any temporary allocations are not necessary. This fix constructs task dependency tree without intermediate representation.

Previous memory consumption:

root@a24bae3584cb:/opt/airflow# pytest --memray tests/providers/openlineage/utils/test_utils.py::test_get_dag_tree_large_dag
=========================================================================================================================================================================== test session starts ============================================================================================================================================================================
platform linux -- Python 3.12.5, pytest-8.3.2, pluggy-1.5.0 -- /usr/local/bin/python
cachedir: .pytest_cache
rootdir: /opt/airflow
configfile: pyproject.toml
plugins: memray-1.7.0, timeouts-1.2.1, icdiff-0.9, mock-3.14.0, rerunfailures-14.0, requests-mock-1.12.1, xdist-3.6.1, asyncio-0.23.8, anyio-4.4.0, instafail-0.5.0, cov-5.0.0, time-machine-2.15.0, custom-exit-code-0.3.0
asyncio: mode=Mode.STRICT
setup timeout: 0.0s, execution timeout: 0.0s, teardown timeout: 0.0s
collected 1 item

tests/providers/openlineage/utils/test_utils.py::test_get_dag_tree_large_dag PASSED                                                                                                                                                                                                                                                                                  [100%]


============================================================================================================================================================================== MEMRAY REPORT ===============================================================================================================================================================================
Allocation results for tests/providers/openlineage/utils/test_utils.py::test_get_dag_tree_large_dag at the high watermark

	 📦 Total memory allocated: 5.4GiB
	 📏 Total allocations: 23
	 📊 Histogram of allocation sizes: |▁▁█  |
	 🥇 Biggest allocating functions:
		- _safe_get_dag_tree_view:/opt/airflow/airflow/providers/openlineage/utils/utils.py:446 -> 2.7GiB
		- get_tree_view:/opt/airflow/airflow/models/dag.py:2445 -> 2.7GiB
		- __setattr__:/opt/airflow/airflow/models/baseoperator.py:1191 -> 1.3MiB
		- __setattr__:/opt/airflow/airflow/models/baseoperator.py:1191 -> 1.3MiB
		- __setattr__:/opt/airflow/airflow/models/baseoperator.py:1191 -> 1.3MiB


=================================================================================================================================================================== Warning summary. Total: 3, Unique: 3 ===================================================================================================================================================================
airflow: total 1, unique 1
  collect: total 1, unique 1
other: total 2, unique 2
  collect: total 2, unique 2
Warnings saved into /opt/airflow/tests/warnings.txt file.
============================================================================================================================================================================ 1 passed in 8.60s =============================================================================================================================================================================

current memory consumption:

root@2788b43cb914:/opt/airflow# pytest --memray tests/providers/openlineage/utils/test_utils.py::test_get_dag_tree_large_dag
=========================================================================================================================================================================== test session starts ============================================================================================================================================================================
platform linux -- Python 3.12.5, pytest-8.3.2, pluggy-1.5.0 -- /usr/local/bin/python
cachedir: .pytest_cache
rootdir: /opt/airflow
configfile: pyproject.toml
plugins: memray-1.7.0, timeouts-1.2.1, icdiff-0.9, mock-3.14.0, rerunfailures-14.0, requests-mock-1.12.1, xdist-3.6.1, asyncio-0.23.8, anyio-4.4.0, instafail-0.5.0, cov-5.0.0, time-machine-2.15.0, custom-exit-code-0.3.0
asyncio: mode=Mode.STRICT
setup timeout: 0.0s, execution timeout: 0.0s, teardown timeout: 0.0s
collected 1 item

tests/providers/openlineage/utils/test_utils.py::test_get_dag_tree_large_dag PASSED                                                                                                                                                                                                                                                                                  [100%]


============================================================================================================================================================================== MEMRAY REPORT ===============================================================================================================================================================================
Allocation results for tests/providers/openlineage/utils/test_utils.py::test_get_dag_tree_large_dag at the high watermark

	 📦 Total memory allocated: 16.0MiB
	 📏 Total allocations: 30
	 📊 Histogram of allocation sizes: |▁█ ▃▇|
	 🥇 Biggest allocating functions:
		- __setattr__:/opt/airflow/airflow/models/baseoperator.py:1191 -> 1.3MiB
		- __setattr__:/opt/airflow/airflow/models/baseoperator.py:1191 -> 1.3MiB
		- __setattr__:/opt/airflow/airflow/models/baseoperator.py:1191 -> 1.3MiB
		- __setattr__:/opt/airflow/airflow/models/baseoperator.py:1191 -> 1.3MiB
		- __setattr__:/opt/airflow/airflow/models/baseoperator.py:1191 -> 1.3MiB


=================================================================================================================================================================== Warning summary. Total: 3, Unique: 3 ===================================================================================================================================================================
airflow: total 1, unique 1
  collect: total 1, unique 1
other: total 2, unique 2
  collect: total 2, unique 2
Warnings saved into /opt/airflow/tests/warnings.txt file.
============================================================================================================================================================================ 1 passed in 2.49s =============================================================================================================================================================================

Copy link
Contributor

@kacpermuda kacpermuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks so much better than my previous implementation 🤦 Thanks @mobuchowski 🚀 !

@mobuchowski mobuchowski merged commit 0e7c757 into apache:main Aug 15, 2024
55 checks passed
@jlaneve
Copy link
Contributor

jlaneve commented Aug 15, 2024

Nice! Do you know if there's an issue open to track the underlying function? This fixed it for OL but not for other consumers of that function

@mobuchowski
Copy link
Contributor Author

@jlaneve just created it. #41505

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants