-
Notifications
You must be signed in to change notification settings - Fork 14.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement AIP-60 Dataset URI formats #37005
Conversation
0b39388
to
e70f323
Compare
e70f323
to
3fb9eca
Compare
Finally got this to a point I’m somewhat satisfied with. The logic should be mostly there. Not exactly sure about docs yet, please provide some ideas on where to add things. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was reading two times bare code - and I like it!
I am not an expert in structural changes, as provider interface is extended I assume this is not breaking provider releases as long as the core does not deliver the extension in Airflow 2.9.0?
Kindly ask to add a newsfragment entry for the release notes for the new feature that URIs are now validated, might hit a lot of users when upgrading and exsting datasets render invalid in upgrade. Users need to fix DAGs prior upgrade!
The providers will be fine even if we release this with 2.9—since URIs without a registered scheme is silently treated as plain strings, they will work like they did in 2.8 and prior, aside from some new additional validations (e.g. no auth in the URI—but you shouldn’t do that in the first place anyway). Good point about the fragment, I’ll add one. |
I expanded the documentation a bit, and added a news fragment as suggested above. |
We have not implemented it yet.
…rflow 2.9 (#39670) Closes: #39486 # Context Valid DAGs that worked in Airflow 2.8.x and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after #37005 was merged. This was a breaking change in an Airflow minor version. We should avoid this. Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour. The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration). # How to reproduce By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example: ``` from datetime import datetime from airflow import DAG from airflow.datasets import Dataset from airflow.operators.empty import EmptyOperator with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag: task1 = EmptyOperator( task_id='empty_task1', dag=dag, outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")] ) task2 = EmptyOperator( task_id='empty_task2', dag=dag ) task1 >> task2 ``` Causes to the exception: ``` Broken DAG: [/usr/local/airflow/dags/example_issue.py] Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri parsed = normalizer(parsed) ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri raise ValueError("URI format postgres:// must contain database, schema, and table names") ValueError: URI format postgres:// must contain database, schema, and table names ``` # About the changes introduced This PR introduces the following: 1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default. 2. When this configuration is `False,` Airflow should raise a warning saying: ``` From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX. ``` 3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1 4. From Airflow 3.0, we change this configuration to be `True` by default.
…rflow 2.9 (#39670) Closes: #39486 Valid DAGs that worked in Airflow 2.8.x and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after #37005 was merged. This was a breaking change in an Airflow minor version. We should avoid this. Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour. The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration). By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example: ``` from datetime import datetime from airflow import DAG from airflow.datasets import Dataset from airflow.operators.empty import EmptyOperator with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag: task1 = EmptyOperator( task_id='empty_task1', dag=dag, outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")] ) task2 = EmptyOperator( task_id='empty_task2', dag=dag ) task1 >> task2 ``` Causes to the exception: ``` Broken DAG: [/usr/local/airflow/dags/example_issue.py] Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri parsed = normalizer(parsed) ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri raise ValueError("URI format postgres:// must contain database, schema, and table names") ValueError: URI format postgres:// must contain database, schema, and table names ``` This PR introduces the following: 1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default. 2. When this configuration is `False,` Airflow should raise a warning saying: ``` From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX. ``` 3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1 4. From Airflow 3.0, we change this configuration to be `True` by default. (cherry picked from commit a07d799)
…rflow 2.9 (#39670) Closes: #39486 Valid DAGs that worked in Airflow 2.8.x and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after #37005 was merged. This was a breaking change in an Airflow minor version. We should avoid this. Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour. The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration). By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example: ``` from datetime import datetime from airflow import DAG from airflow.datasets import Dataset from airflow.operators.empty import EmptyOperator with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag: task1 = EmptyOperator( task_id='empty_task1', dag=dag, outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")] ) task2 = EmptyOperator( task_id='empty_task2', dag=dag ) task1 >> task2 ``` Causes to the exception: ``` Broken DAG: [/usr/local/airflow/dags/example_issue.py] Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri parsed = normalizer(parsed) ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri raise ValueError("URI format postgres:// must contain database, schema, and table names") ValueError: URI format postgres:// must contain database, schema, and table names ``` This PR introduces the following: 1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default. 2. When this configuration is `False,` Airflow should raise a warning saying: ``` From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX. ``` 3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1 4. From Airflow 3.0, we change this configuration to be `True` by default. (cherry picked from commit a07d799)
…rflow 2.9 (apache#39670) Closes: apache#39486 # Context Valid DAGs that worked in Airflow 2.8.x and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after apache#37005 was merged. This was a breaking change in an Airflow minor version. We should avoid this. Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour. The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration). # How to reproduce By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example: ``` from datetime import datetime from airflow import DAG from airflow.datasets import Dataset from airflow.operators.empty import EmptyOperator with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag: task1 = EmptyOperator( task_id='empty_task1', dag=dag, outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")] ) task2 = EmptyOperator( task_id='empty_task2', dag=dag ) task1 >> task2 ``` Causes to the exception: ``` Broken DAG: [/usr/local/airflow/dags/example_issue.py] Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri parsed = normalizer(parsed) ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri raise ValueError("URI format postgres:// must contain database, schema, and table names") ValueError: URI format postgres:// must contain database, schema, and table names ``` # About the changes introduced This PR introduces the following: 1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default. 2. When this configuration is `False,` Airflow should raise a warning saying: ``` From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX. ``` 3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1 4. From Airflow 3.0, we change this configuration to be `True` by default.
I’m struggling with documentation, and have yet to implement actual sanitisation logic. And it seems like a lot of the URIs (especially file-like ones) don’t actually need any validation at all? But this implements a basic structure.