-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-2524]Add SageMaker Batch Inference #3767
Conversation
Add SageMaker Transform Operator & Sensor Co-authored-by: srrajeev-aws <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #3767 +/- ##
=======================================
Coverage 77.49% 77.49%
=======================================
Files 200 200
Lines 15889 15889
=======================================
Hits 12313 12313
Misses 3576 3576 Continue to review full report at Codecov.
|
@@ -86,6 +86,16 @@ def check_valid_tuning_input(self, tuning_config): | |||
self.check_for_url(channel['DataSource'] | |||
['S3DataSource']['S3Uri']) | |||
|
|||
def check_valid_transform_input(self, transform_config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function feels a bit silly to me, why not check_for_url
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@troychen728 fixed it.
@@ -219,6 +229,50 @@ def create_tuning_job(self, tuning_job_config): | |||
return self.conn.create_hyper_parameter_tuning_job( | |||
**tuning_job_config) | |||
|
|||
def create_transform_job(self, transform_job_config, wait_for_completion=True): | |||
""" | |||
Create a tuning job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not really in line with the name of the function, right? create_transform_job
vs Create a tuning job
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@troychen728 fixed it.
:type transform_job_config: dict | ||
:param wait_for_completion: | ||
if the program should keep running until job finishes | ||
:param wait_for_completion: bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be :type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@troychen728 has fixed it.
|
||
|
||
class SageMakerCreateTransformJobOperator(BaseOperator): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trim the \n
please :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@troychen728 has fixed it.
Initiate a SageMaker transform | ||
|
||
This operator returns The ARN of the model created in Amazon SageMaker | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we order the docstring in the same order as the arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@troychen728 has fixed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
sagemaker_conn_id=None, | ||
use_db_config=False, | ||
wait_for_completion=False, | ||
check_interval=2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a time unit here? For example, _seconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you suggest to change the name, or add a documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated docstring to say the unit for check_interval is in seconds. Please let me know if you have a strong preference that this should be in variable name.
region_name=None, | ||
sagemaker_conn_id=None, | ||
use_db_config=False, | ||
wait_for_completion=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change this one to True
by default, this is how the other operators are behaving as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@troychen728 has fixed it.
|
||
:param job_name: job_name of the transform job instance to check the state of | ||
:type job_name: string | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing region_name
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@troychen728 has fixed it.
self.region_name = region_name | ||
|
||
def non_terminal_states(self): | ||
return ['InProgress', 'Stopping', 'Stopped'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we consolidate these settings somewhere? I see them repeated quite a lot. Also please change it to a set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can write them into the base sensor because at least for now, all the terminal_state, failed_state are the same. However, I just think it is a little bit risky, because later if others are writing sagemaker sensors, a not implemented error will not be thrown, and the API return values are subject to change. Please let me know what you think, and I'll change accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but most likely if the API changes, you need to change it in two places, if you forget one of the two, you have a problem. This is of course a design question. Could you at least change them to a set: {'InProgress', 'Stopping', 'Stopped'}
and make them static?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added two static variables under class SageMakerHook.
non_terminal_states = {'InProgress', 'Stopping', 'Stopped'}
failed_states = {'Failed'}
And all codes in Hook and Sensors will just use these two static variables. If in the future we have an API change for all of training/tuning/inference, we only need to change this part. But if we changed API specifically for part of them, we still need to go back to previous design.
For now I would say API should be stable at this part, so used static variables.
It looks like the tests are failing:
|
Hey folks! How about to fix and to merge this amazing PR? In order to fix the fail:
just need to replace |
@troychen728 can you shed a light on the error @schipiga is encountering? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments.
self.job_name = job_name | ||
self.region_name = region_name | ||
|
||
def non_terminal_states(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this one static?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so sure why should I or what difference does it make if I make this static. Can you shed a little bit light on this? Thanks. Also, I am sorry but I might not be able to give feedback to the comments very quickly, because I am really busy with school stuff these few days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the function does not use anything from the class itself, it is prettier to make it static since it is then easier to reuse in other classes/functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made non_terminal_states and failed_states static under SageMakerHook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much beter, thanks
def non_terminal_states(self): | ||
return {'InProgress', 'Stopping', 'Stopped'} | ||
|
||
def failed_states(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this one static as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reused static variables from SageMakerHook
response = self.conn.create_transform_job( | ||
**transform_job_config) | ||
if wait_for_completion: | ||
self.check_status(['InProgress', 'Stopping', 'Stopped'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def check_status(self, non_terminal_states,
failed_state, key,
describe_function, *args):
"""
:param non_terminal_states: the set of non_terminal states
:type non_terminal_states: dict
:param failed_state: the set of failed states
:type failed_state: dict
:param key: the key of the response dict
that points to the state
:type key: string
:param describe_function: the function used to retrieve the status
:type describe_function: python callable
:param args: the arguments for the function
:return: None
"""
The non_terminal_states
and failed_state
should be dict
's according to the docs. Can we make these one set
's as well? {'InProgress', 'Stopping', 'Stopped'}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made them into set instead of dict
Can you rebase onto master as well? :) |
Sorry for late updates. @troychen728 is unavailable for development recently. I will target to update this PR sometime next week. |
* Fix for comments * Fix sensor test * Update non_terminal_states and failed_states to static variables of SageMakerHook Add SageMaker Transform Operator & Sensor Co-authored-by: srrajeev-aws <[email protected]>
Thanks for your review |
[AIRFLOW-XXX] Remove residual line in Changelog (#3814) [AIRFLOW-2993] s3_to_sftp and sftp_to_s3 operators (#3828) [AIRFLOW-2709] Improve error handling in Databricks hook (#3570) * Use float for default value * Use status code to determine whether an error is retryable * Fix wrong type in assertion * Fix style to prevent lines from exceeding 90 characters * Fix wrong way of checking exception type [AIRFLOW-2854] kubernetes_pod_operator add more configuration items (#3697) * kubernetes_pod_operator add more configuration items * fix test_kubernetes_pod_operator test_faulty_service_account failure case * fix review comment issues * pod_operator add hostnetwork config * add doc example [AIRFLOW-2994] Fix command status check in Qubole Check operator (#3790) [AIRFLOW-2928] Use uuid4 instead of uuid1 (#3779) for better randomness. [AIRFLOW-2949] Add syntax highlight for single quote strings (#3795) * AIRFLOW-2949: Add syntax highlight for single quote strings * AIRFLOW-2949: Also updated new UI main.css [AIRFLOW-2948] Arg check & better doc - SSHOperator & SFTPOperator (#3793) There may be different combinations of arguments, and some processings are being done 'silently', while users may not be fully aware of them. For example - User only needs to provide either `ssh_hook` or `ssh_conn_id`, while this is not clear in doc - if both provided, `ssh_conn_id` will be ignored. - if `remote_host` is provided, it will replace the `remote_host` which wasndefined in `ssh_hook` or predefined in the connection of `ssh_conn_id` These should be documented clearly to ensure it's transparent to the users. log.info() should also be used to remind users and provide clear logs. In addition, add instance check for ssh_hook to ensure it is of the correct type (SSHHook). Tests are updated for this PR. [AIRFLOW-XXX] Fix Broken Link in CONTRIBUTING.md [AIRFLOW-2980] ReadTheDocs - Fix Missing API Reference [AIRFLOW-2984] Convert operator dates to UTC (#3822) Tasks can have start_dates or end_dates separately from the DAG. These need to be converted to UTC otherwise we cannot use them for calculation the next execution date. [AIRFLOW-2779] Make GHE auth third party licensed (#3803) This reinstates the original license. [AIRFLOW-XXX] Add Format to list of companies (#3824) [AIRFLOW-2900] Show code for packaged DAGs (#3749) [AIRFLOW-2983] Add prev_ds_nodash and next_ds_nodash macro (#3821) [AIRFLOW-XXX] Fix Docstrings for Operators (#3820) [AIRFLOW-2951] Update dag_run table end_date when state change (#3798) The existing airflow only change dag_run table end_date value when a user teminate a dag in web UI. The end_date will not be updated if airflow detected a dag finished and updated its state. This commit add end_date update in DagRun's set_state function to make up tho problem mentioned above. [AIRFLOW-2145] fix deadlock on clearing running TI (#3657) a `shutdown` task is not considered be `unfinished`, so a dag run can deadlock when all `unfinished` downstreams are all waiting on a task that's in the `shutdown` state. fix this by considering `shutdown` to be `unfinished`, since it's not truly a terminal state [AIRFLOW-XXX] Fix typo in docstring of gcs_to_bq (#3833) [AIRFLOW-2476] Allow tabulate up to 0.8.2 (#3835) [AIRFLOW-XXX] Fix typos in faq.rst (#3837) [AIRFLOW-2979] Make celery_result_backend conf Backwards compatible (#3832) (#2806) Renamed `celery_result_backend` to `result_backend` and broke backwards compatibility. [AIRFLOW-2866] Fix missing CSRF token head when using RBAC UI (#3804) [AIRFLOW-491] Add feature to pass extra api configs to BQ Hook (#3733) [AIRFLOW-208] Add badge to show supported Python versions (#3839) [AIRFLOW-3007] Update backfill example in Scheduler docs The scheduler docs at https://airflow.apache.org/scheduler.html#backfill-and-catchup use deprecated way of passing `schedule_interval`. `schedule_interval` should be pass to DAG as a separate parameter and not as a default arg. [AIRFLOW-3005] Replace 'Airbnb Airflow' with 'Apache Airflow' (#3845) [AIRFLOW-3002] Fix variable & tests in GoogleCloudBucketHelper (#3843) [AIRFLOW-2991] Log path to driver output after Dataproc job (#3827) [AIRFLOW-XXX] Fix python3 and flake8 errors in dev/airflow-jira This is a script that checks if the Jira's marked as fixed in a release are actually merged in - getting this working is helpful to me in preparing 1.10.1 [AIRFLOW-3006] Add note on using None for schedule_interval [AIRFLOW-3003] Pull the krb5 image instead of building (#3844) Pull the image instead of building it, this will speed up the CI process since we don't have to build it every time. [AIRFLOW-2883] Add import and export for pool cli using JSON [AIRFLOW-2847] Remove legacy imports support for plugins (#3692) [AIRFLOW-1998] Implemented DatabricksRunNowOperator for jobs/run-now … (#3813) Add functionality to kick of a Databricks job right away. * Per feedback: fixed a documentation error, reintegrated the execute and on_kill onto the objects. * Fixed a documentation issue. [AIRFLOW-3021] Add Censys to who uses Airflow list > Censys > Find and analyze every reachable server and device on the Internet > https://censys.io/ closes AIRFLOW-3021 https://issues.apache.org/jira/browse/AIRFLOW-3021 [AIRFLOW-3018] Fix Minor issues in Documentation Add Branch to Company List [AIRFLOW-3023] Fix docstring datatypes [AIRFLOW-3008] Move Kubernetes example DAGs to contrib [AIRFLOW-2997] Support cluster fields in bigquery (#3838) This adds a cluster_fields argument to the bigquery hook, GCS to bigquery operator and bigquery query operators. This field requests that bigquery store the result of the query/load operation sorted according to the specified fields (the order of fields given is significant). [AIRFLOW-XXX] Redirect FAQ `airflow[crypto]` to How-to Guides. [AIRFLOW-XXX] Remove redundant space in Kerberos (#3866) [AIRFLOW-3028] Update Text & Images in Readme.md [AIRFLOW-1917] Trim extra newline and trailing whitespace from log (#3862) [AIRFLOW-2985] Operators for S3 object copying/deleting (#3823) 1. Copying: Under the hood, it's `boto3.client.copy_object()`. It can only handle the situation in which the S3 connection used can access both source and destination bucket/key. 2. Deleting: 2.1 Under the hood, it's `boto3.client.delete_objects()`. It supports either deleting one single object or multiple objects. 2.2 If users try to delete a non-existent object, the request will still succeed, but there will be an entry 'Errors' in the response. There may also be other reasons which may cause similar 'Errors' ( request itself would succeed without explicit exception). So an argument `silent_on_errors` is added to let users decide if this sort of 'Errors' should fail the operator. The corresponding methods are added into S3Hook, and these two operators are 'wrappers' of these methods. [AIRFLOW-3030] Fix CLI docs (#3872) [AIRFLOW-XXX] Update kubernetes.rst docs (#3875) Update kubernetes.rst with correct KubernetesPodOperator inputs for the volumes. [AIRFLOW-XXX] Add Enigma to list of companies [AIRFLOW-2965] CLI tool to show the next execution datetime Cover different cases - schedule_interval is "@once" or None, then following_schedule method would always return None - If dag is paused, print reminder - If latest_execution_date is not found, print warning saying not applicable. [AIRFLOW-XXX] Add Bombora Inc using Airflow [AIRFLOW-2156] Parallelize Celery Executor task state fetching (#3830) [AIRFLOW-XXX] Move Dag level access control out of 1.10 section (#3882) It isn't in 1.10 (and wasn't in this section when the PR was created). [AIRFLOW-3040] Enable ProBot to clean up stale Pull Requests (#3883) [AIRFLOW-3012] Fix Bug when passing emails for SLA [AIRFLOW-2797] Create Google Dataproc cluster with custom image (#3871) [AIRFLOW-XXX] Updated README to include CAVA Addressed comments in PR with appropriate refactoring of s3-sftp operators. Added s3-sftp operator links [AIRFLOW-2993] s3_to_sftp and sftp_to_s3 operators #3828 Rearranged input parameters for sftp_to_s3_operator. [AIRFLOW-2988] Run specifically python2 for dataflow (#3826) Apache beam does not yet support python3, so it's best to run dataflow jobs with python2 specifically until python3 support is complete (BEAM-1251), in case if the user's 'python' in PATH is python3. [AIRFLOW-3035] Allow custom 'job_error_states' in dataproc ops (#3884) Allow caller to pass in custom list of Dataproc job states into the DataProc*Operator classes that should result in the _DataProcJob.raise_error() method raising an Exception. [AIRFLOW-3034]: Readme updates : Add Slack & Twitter, remove Gitter [AIRFLOW-3056] Add happn to Airflow user list [AIRFLOW-3052] Add logo options to Airflow (#3892) [AIRFLOW-3060] DAG context manager fails to exit properly in certain circumstances [AIRFLOW-2524] Add SageMaker Batch Inference (#3767) * Fix for comments * Fix sensor test * Update non_terminal_states and failed_states to static variables of SageMakerHook Add SageMaker Transform Operator & Sensor Co-authored-by: srrajeev-aws <[email protected]> [AIRFLOW-2772] Fix Bug in BigQuery hook for Partitioned Table (#3901) [AIRFLOW-XXX] Added Jeitto as one of happy Airflow users! (#3902) [AIRFLOW-XXX] Add Jeitto as one happy Airflow user! [AIRFLOW-3044] Dataflow operators accept templated job_name param (#3887) * Default value of new job_name param is templated task_id, to match the existing behavior as much as possible. * Change expected value in test_mlengine_operator_utils.py to match default for new job_name param. [AIRFLOW-2707] Validate task_log_reader on upgrade from <=1.9 (#3881) We changed the default logging config and config from 1.9 to 1.10, but anyone who upgrades and has an existing airflow.cfg won't know they need to change this value - instead they will get nothing displayed in the UI (ajax request fails) and see "'NoneType' object has no attribute 'read'" in the error log. This validates that config section at start up, and seamlessly upgrades the old previous value. [AIRFLOW-3025] Enable specifying dns and dns_search options for DockerOperator (#3860) Enable specifying dns and dns_search options for DockerOperator [AIRFLOW-1298] Clear UPSTREAM_FAILED using the clean cli (#3886) * [AIRFLOW-1298] Fix 'clear only_failed' * [AIRFLOW-1298] Fix 'clear only_failed' [AIRFLOW-3059] Log how many rows are read from Postgres (#3905) To know how many data is being read from Postgres, it is nice to log this to the Airflow log. Previously when there was no data, it would still create a single file. This is not something that we want, and therefore we've changed this behaviour. Refactored the tests to make use of Postgres itself since we have it running. This makes the tests more realistic, instead of mocking everything. [AIRFLOW-XXX] Fix typo in docs/timezone.rst (#3904) [AIRFLOW-3070] Refine web UI authentication-related docs (#3863) [AIRFLOW-3068] Remove deprecated imports [AIRFLOW-3036] Add relevant ECS options to ECS operator. (#3908) The ECS operator currently supports only a subset of available options for running ECS tasks. This patch adds all ECS options that could be relevant to airflow; options that wouldn't make sense here, like `count`, were skipped. [AIRFLOW-1195] Add feature to clear tasks in Parent Dag (#3907) [AIRFLOW-3073] Add note-Profiling feature not supported in new webserver (#3909) Adhoc queries and Charts features are no longer supported in new FAB-based webserver and UI. But this is not mentioned at all in the doc "Data Profiling" (https://airflow.incubator.apache.org/profiling.html) This commit adds a note to remind users for this. [AIRFLOW-XXX] Fix SlackWebhookOperator docs (#3915) The docs refer to `conn_id` while the actual argument is `http_conn_id`. [AIRFLOW-1441] Fix inconsistent tutorial code (#2466) [AIRFLOW-XXX] Add 90 Seconds to companies [AIRFLOW-3096] Reduce DaysUntilStale for probot/stale [AIRFLOW-3096] Further reduce DaysUntilStale for probo/stale [AIRFLOW-3072] Assign permission get_logs_with_metadata to viewer role (#3913) [AIRFLOW-3090] Demote dag start/stop log messages to debug (#3920) [AIRFLOW-2407] Use feature detection for reload() (#3298) * [AIRFLOW-2407] Use feature detection for reload() [Use feature detection instead of version detection](https://docs.python.org/3/howto/pyporting.html#use-feature-detection-instead-of-version-detection) is a Python porting best practice that avoids a flake8 undefined name error... flake8 testing of https://github.com/apache/incubator-airflow on Python 3.6.3 [AIRFLOW-2747] Explicit re-schedule of sensors (#3596) * [AIRFLOW-2747] Explicit re-schedule of sensors Add `mode` property to sensors. If set to `reschedule` an AirflowRescheduleException is raised instead of sleeping which sets the task back to state `NONE`. Reschedules are recorded in new `task_schedule` table and visualized in the Gantt view. New TI dependency checks if a sensor task is ready to be re-scheduled. * Reformat sqlalchemy imports * Make `_handle_reschedule` private * Remove print * Add comment * Add comment * Don't record reschule request in test mode [AIRFLOW-XXX] Fix a wrong sample bash command, a display issue & a few typos (#3924) [AIRFLOW-3090] Make No tasks to consider for execution debug (#3923) During normal operation, it is not necessary to see the message. This can only be useful when debugging an issue. AIRFLOW-2952 Fix Kubernetes CI (#3922) The current dockerised CI pipeline doesn't run minikube and the Kubernetes integration tests. This starts a Kubernetes cluster using minikube and runs k8s integration tests using docker-compose. [AIRFLOW-2918] Fix Flake8 violations (#3931) [AIRFLOW-3076] Remove preloading of MySQL testdata (#3911) One of the things for tests is being self contained. This means that it should not depend on anything external, such as loading data. This PR will use the setUp and tearDown to load the data into MySQL and remove it afterwards. This removes the actual bash mysql commands and will make it easier to dockerize the whole testsuite in the future [AIRFLOW-2887] Added BigQueryCreateEmptyDatasetOperator and create_emty_dataset to bigquery_hook (#3876) [AIRFLOW-2918] Remove unused imports [AIRFLOW-3099] Stop Missing Section Errors for optional sections (#3934) [AIRFLOW-3090] Specify path of key file in log message (#3921) [AIRFLOW-3067] Display www_rbac Flask flash msg properly (#3903) The Flask flash messages are not displayed properly. When we don't give a category for a flash message, defautl value will be 'message'. In some cases, we specify 'error' category. Using Flask-AppBuilder, the flash message will be given a CSS class 'alert-[category]'. But We don't have 'alert-message' or 'alert-error' in the current 'bootstrap-theme.css' file. This makes the the flash messages in www_rbac UI come with no background color. This commit addresses this issue by adding 'alert-message' (using specs of existing CSS class 'alert-info') and 'alert-error' (using specs of existing CSS class 'alert-danger') into 'bootstrap-theme.css'. [AIRFLOW-3109] Bugfix to allow user/op roles to clear task intance via UI by default add show statements to hql filtering. [AIRFLOW-3051] Change CLI to make users ops similar to connections The ability to manipulate users from the command line is a bit clunky. Currently 'airflow create_user' and 'airflow delete_user' and 'airflow list_users'. It seems that these ought to be made more like connections, so that it becomes 'airflow users list ...', 'airflow users delete ...' and 'airflow users create ...' [AIRFLOW-3009] Import Hashable from collection.abc to fix Python 3.7 deprecation warning (#3849) [AIRFLOW-XXX] Add Tesla as an Apache Airflow user (#3947) [AIRFLOW-3111] Fix instructions in UPDATING.md and remove comment (#3944) artifacts in default_airflow.cfg - fixed incorrect instructions in UPDATING.md regarding core.log_filename_template and elasticsearch.elasticsearch_log_id_template - removed comments referencing "additional curly braces" from default_airflow.cfg since they're irrelevant to the rendered airflow.cfg [AIRFLOW-3117] Add instructions to allow GPL dependency (#3949) The installation instructions failed to mention how to proceed with the GPL dependency. For those who are not concerned by GPL, it is useful to know how to proceed with GPL dependency. [AIRFLOW-XXX] Add Square to the companies lists [AIRFLOW-XXX] Add Fathom Health to readme [AIRFLOW-XXX] Pin Click to 6.7 to Fix CI (#3962) [AIRFLOW-XXX] Fix SlackWebhookOperator execute method comment (#3963) [AIRFLOW-3100][AIRFLOW-3101] Improve docker compose local testing (#3933) [AIRFLOW-3127] Fix out-dated doc for Celery SSL (#3967) Now in `airflow.cfg`, for Celery-SSL, the item names are "ssl_active", "ssl_key", "ssl_cert", and "ssl_cacert". (since PR https://github.com/apache/incubator-airflow/pull/2806/files) But in the documentation https://airflow.incubator.apache.org/security.html?highlight=celery or https://github.com/apache/incubator-airflow/blob/master/docs/security.rst, it's "CELERY_SSL_ACTIVE", "CELERY_SSL_KEY", "CELERY_SSL_CERT", and "CELERY_SSL_CACERT", which is out-dated and may confuse readers. [AIRFLOW-XXX] Fix PythonVirtualenvOperator tests (#3968) The recent update to the CI image changed the default python from python2 to python3. The PythonVirtualenvOperator tests expected python2 as default and fail due to serialisation errors. [AIRFLOW-2952] Fix Kubernetes CI (#3957) - Update outdated cli command to create user - Remove `airflow/example_dags_kubernetes` as the dag already exists in `contrib/example_dags/` - Update the path to copy K8s dags [AIRFLOW-3104] Add .airflowignore info into doc (#3939) .airflowignore is a nice feature, but it was not mentioned at all in the documentation. [AIRFLOW-3130] Add CLI docs for users command [AIRFLOW-XXX] Add Delete for CLI Example in UPDATING.md [AIRFLOW-3123] Use a stack for DAG context management (#3956) [AIRFLOW-3125] Monitor Task Instances creation rates (#3966) Montor Task Instances creation rates by Operator type. These stats can provide some visibility on how much workload Airflow is getting. They can be used for resource allocation in the long run (i.e. to determine when we should scale up workers) and debugging in scenarios like the creation rate of certain type of Task Instances spikes. [AIRFLOW-3129] Backfill mysql hook unit tests. (#3970) [AIRFLOW-3124] Fix RBAC webserver debug mode (#3958) [AIRFLOW-XXX] Add Compass to companies list (#3972) We're using Airflow at Compass now. [AIRFLOW-XXX] Speed up DagBagTest cases (#3974) I noticed that many of the tests of DagBags operate on a specific DAG only, and don't need to load the example or test dags. By not loading the dags we don't need to this shaves about 10-20s of test time. [AIRFLOW-2912] Add Deploy and Delete operators for GCF (#3969) Both Deploy and Delete operators interact with Google Cloud Functions to manage functions. Both are idempotent and make use of GcfHook - hook that encapsulates communication with GCP over GCP API. [AIRFLOW-1390] Update Alembic to 0.9 (#3935) [AIRFLOW-2238] Update PR tool to remove outdated info (#3978) [AIRFLOW-XXX] Don't spam test logs with "bad cron expression" messages (#3973) We needed these test dags to check the behaviour of invalid cron expressions, but by default we were loading them every time we create a DagBag (which many, many tests to). Instead we ignore these known-bad dags by default, and the test checking those (tests/models.py:DagBagTest.test_process_file_cron_validity_check) is already explicitly processing those DAGs directly, so it remains tested. [AIRFLOW-XXX] Fix undocumented params in S3_hook Some function parameters were undocumented. Additional docstrings were added for clarity. [AIRFLOW-3079] Improve migration scripts to support MSSQL Server (#3964) There were two problems for MSSQL. First, 'timestamp' data type in MSSQL Server is essentially a row-id, and not a timezone enabled date/time stamp. Second, alembic creates invalid SQL when applying the 0/1 constraint to boolean values. MSSQL should enforce this constraint by simply asserting a boolean value. [AIRFLOW-XXX] Add DoorDash to README.md (#3980) DoorDash uses Airflow https://softwareengineeringdaily.com/2018/09/28/doordash/ [AIRFLOW-3062] Add Qubole in integration docs (#3946) [AIRFLOW-3129] Improve test coverage of airflow.models. (#3982) [AIRFLOW-2574] Cope with '%' in SQLA DSN when running migrations (#3787) Alembic uses a ConfigParser like Airflow does, and "%% is a special value in there, so we need to escape it. As per the Alembic docs: > Note that this value is passed to ConfigParser.set, which supports > variable interpolation using pyformat (e.g. `%(some_value)s`). A raw > percent sign not part of an interpolation symbol must therefore be > escaped, e.g. `%%` [AIRFLOW-3137] Make ProxyFix middleware optional. (#3983) The ProxyFix middleware should only be used when airflow is running behind a trusted proxy. This patch adds a `USE_PROXY_FIX` flag that defaults to `False`. [AIRFLOW-3004] Add config disabling scheduler cron (#3899) [AIRFLOW-3103][AIRFLOW-3147] Update flask-appbuilder (#3937) [AIRFLOW-2993] s3_to_sftp and sftp_to_s3 operators #3828 Added apply_default decorator. Added test for operators [AIRFLOW-XXX] Fixing the issue in Documentation (#3998) Fixing the operator name from DataFlowOperation to DataFlowJavaOperator in Documentation [AIRFLOW-3088] Include slack-compatible emoji image [AIRFLOW-3161] fix TaskInstance log link in RBAC UI [AIRFLOW-3148] Remove unnecessary arg "parameters" in RedshiftToS3Transfer (#3995) "Parameters" are used to help render the SQL command. But in this operator, only "schema" and "table" are needed. There is no SQL command to render. By checking the code,we can also find argument "parameters" is never really used. (Fix a minor issue in the docstring as well) [AIRFLOW-3159] Update GCS logging docs for latest code (#3952) Formatted code [AIRFLOW-XXX] Fixing the issue in Documentation (#3998) Fixing the operator name from DataFlowOperation to DataFlowJavaOperator in Documentation [AIRFLOW-3088] Include slack-compatible emoji image [AIRFLOW-3161] fix TaskInstance log link in RBAC UI [AIRFLOW-3148] Remove unnecessary arg "parameters" in RedshiftToS3Transfer (#3995) "Parameters" are used to help render the SQL command. But in this operator, only "schema" and "table" are needed. There is no SQL command to render. By checking the code,we can also find argument "parameters" is never really used. (Fix a minor issue in the docstring as well) [AIRFLOW-3159] Update GCS logging docs for latest code (#3952) [AIRFLOW-2930] Fix celery excecutor scheduler crash (#3784) Caused by an update in PR #3740. execute_command.apply_async(args=command, ...) -command is a list of short unicode strings and the above code pass multiple arguments to a function defined as taking only one argument. -command = ["airflow", "run", "dag323",...] -args = command = ["airflow", "run", "dag323", ...] -execute_command("airflow","run","dag3s3", ...) will be error and exit. [AIRFLOW-2854] kubernetes_pod_operator add more configuration items (#3697) * kubernetes_pod_operator add more configuration items * fix test_kubernetes_pod_operator test_faulty_service_account failure case * fix review comment issues * pod_operator add hostnetwork config * add doc example [AIRFLOW-2994] Fix command status check in Qubole Check operator (#3790) [AIRFLOW-2949] Add syntax highlight for single quote strings (#3795) * AIRFLOW-2949: Add syntax highlight for single quote strings * AIRFLOW-2949: Also updated new UI main.css [AIRFLOW-2948] Arg check & better doc - SSHOperator & SFTPOperator (#3793) There may be different combinations of arguments, and some processings are being done 'silently', while users may not be fully aware of them. For example - User only needs to provide either `ssh_hook` or `ssh_conn_id`, while this is not clear in doc - if both provided, `ssh_conn_id` will be ignored. - if `remote_host` is provided, it will replace the `remote_host` which wasndefined in `ssh_hook` or predefined in the connection of `ssh_conn_id` These should be documented clearly to ensure it's transparent to the users. log.info() should also be used to remind users and provide clear logs. In addition, add instance check for ssh_hook to ensure it is of the correct type (SSHHook). Tests are updated for this PR. [AIRFLOW-XXX] Fix Broken Link in CONTRIBUTING.md [AIRFLOW-2980] ReadTheDocs - Fix Missing API Reference [AIRFLOW-2779] Make GHE auth third party licensed (#3803) This reinstates the original license. [AIRFLOW-XXX] Add Format to list of companies (#3824) [AIRFLOW-2900] Show code for packaged DAGs (#3749) [AIRFLOW-2983] Add prev_ds_nodash and next_ds_nodash macro (#3821) [AIRFLOW-2951] Update dag_run table end_date when state change (#3798) The existing airflow only change dag_run table end_date value when a user teminate a dag in web UI. The end_date will not be updated if airflow detected a dag finished and updated its state. This commit add end_date update in DagRun's set_state function to make up tho problem mentioned above. [AIRFLOW-2145] fix deadlock on clearing running TI (#3657) a `shutdown` task is not considered be `unfinished`, so a dag run can deadlock when all `unfinished` downstreams are all waiting on a task that's in the `shutdown` state. fix this by considering `shutdown` to be `unfinished`, since it's not truly a terminal state [AIRFLOW-XXX] Fix typo in docstring of gcs_to_bq (#3833) [AIRFLOW-2476] Allow tabulate up to 0.8.2 (#3835) [AIRFLOW-XXX] Fix typos in faq.rst (#3837) [AIRFLOW-2979] Make celery_result_backend conf Backwards compatible (#3832) (#2806) Renamed `celery_result_backend` to `result_backend` and broke backwards compatibility. [AIRFLOW-2866] Fix missing CSRF token head when using RBAC UI (#3804) [AIRFLOW-3007] Update backfill example in Scheduler docs The scheduler docs at https://airflow.apache.org/scheduler.html#backfill-and-catchup use deprecated way of passing `schedule_interval`. `schedule_interval` should be pass to DAG as a separate parameter and not as a default arg. [AIRFLOW-3005] Replace 'Airbnb Airflow' with 'Apache Airflow' (#3845) [AIRFLOW-3002] Fix variable & tests in GoogleCloudBucketHelper (#3843) [AIRFLOW-2991] Log path to driver output after Dataproc job (#3827) [AIRFLOW-XXX] Fix python3 and flake8 errors in dev/airflow-jira This is a script that checks if the Jira's marked as fixed in a release are actually merged in - getting this working is helpful to me in preparing 1.10.1 [AIRFLOW-2883] Add import and export for pool cli using JSON [AIRFLOW-3021] Add Censys to who uses Airflow list > Censys > Find and analyze every reachable server and device on the Internet > https://censys.io/ closes AIRFLOW-3021 https://issues.apache.org/jira/browse/AIRFLOW-3021 Add Branch to Company List [AIRFLOW-3008] Move Kubernetes example DAGs to contrib [AIRFLOW-2997] Support cluster fields in bigquery (#3838) This adds a cluster_fields argument to the bigquery hook, GCS to bigquery operator and bigquery query operators. This field requests that bigquery store the result of the query/load operation sorted according to the specified fields (the order of fields given is significant). [AIRFLOW-XXX] Redirect FAQ `airflow[crypto]` to How-to Guides. [AIRFLOW-XXX] Remove redundant space in Kerberos (#3866) [AIRFLOW-3028] Update Text & Images in Readme.md [AIRFLOW-1917] Trim extra newline and trailing whitespace from log (#3862) [AIRFLOW-2985] Operators for S3 object copying/deleting (#3823) 1. Copying: Under the hood, it's `boto3.client.copy_object()`. It can only handle the situation in which the S3 connection used can access both source and destination bucket/key. 2. Deleting: 2.1 Under the hood, it's `boto3.client.delete_objects()`. It supports either deleting one single object or multiple objects. 2.2 If users try to delete a non-existent object, the request will still succeed, but there will be an entry 'Errors' in the response. There may also be other reasons which may cause similar 'Errors' ( request itself would succeed without explicit exception). So an argument `silent_on_errors` is added to let users decide if this sort of 'Errors' should fail the operator. The corresponding methods are added into S3Hook, and these two operators are 'wrappers' of these methods. [AIRFLOW-3030] Fix CLI docs (#3872) [AIRFLOW-XXX] Update kubernetes.rst docs (#3875) Update kubernetes.rst with correct KubernetesPodOperator inputs for the volumes. [AIRFLOW-XXX] Add Enigma to list of companies [AIRFLOW-2965] CLI tool to show the next execution datetime Cover different cases - schedule_interval is "@once" or None, then following_schedule method would always return None - If dag is paused, print reminder - If latest_execution_date is not found, print warning saying not applicable. [AIRFLOW-XXX] Add Bombora Inc using Airflow [AIRFLOW-XXX] Move Dag level access control out of 1.10 section (#3882) It isn't in 1.10 (and wasn't in this section when the PR was created). [AIRFLOW-3012] Fix Bug when passing emails for SLA [AIRFLOW-2797] Create Google Dataproc cluster with custom image (#3871) [AIRFLOW-XXX] Updated README to include CAVA [AIRFLOW-3035] Allow custom 'job_error_states' in dataproc ops (#3884) Allow caller to pass in custom list of Dataproc job states into the DataProc*Operator classes that should result in the _DataProcJob.raise_error() method raising an Exception. [AIRFLOW-3034]: Readme updates : Add Slack & Twitter, remove Gitter [AIRFLOW-3056] Add happn to Airflow user list [AIRFLOW-3052] Add logo options to Airflow (#3892) [AIRFLOW-2524] Add SageMaker Batch Inference (#3767) * Fix for comments * Fix sensor test * Update non_terminal_states and failed_states to static variables of SageMakerHook Add SageMaker Transform Operator & Sensor Co-authored-by: srrajeev-aws <[email protected]> [AIRFLOW-XXX] Added Jeitto as one of happy Airflow users! (#3902) [AIRFLOW-XXX] Add Jeitto as one happy Airflow user! [AIRFLOW-3044] Dataflow operators accept templated job_name param (#3887) * Default value of new job_name param is templated task_id, to match the existing behavior as much as possible. * Change expected value in test_mlengine_operator_utils.py to match default for new job_name param. [AIRFLOW-2707] Validate task_log_reader on upgrade from <=1.9 (#3881) We changed the default logging config and config from 1.9 to 1.10, but anyone who upgrades and has an existing airflow.cfg won't know they need to change this value - instead they will get nothing displayed in the UI (ajax request fails) and see "'NoneType' object has no attribute 'read'" in the error log. This validates that config section at start up, and seamlessly upgrades the old previous value. [AIRFLOW-3025] Enable specifying dns and dns_search options for DockerOperator (#3860) Enable specifying dns and dns_search options for DockerOperator [AIRFLOW-1298] Clear UPSTREAM_FAILED using the clean cli (#3886) * [AIRFLOW-1298] Fix 'clear only_failed' * [AIRFLOW-1298] Fix 'clear only_failed' [AIRFLOW-3059] Log how many rows are read from Postgres (#3905) To know how many data is being read from Postgres, it is nice to log this to the Airflow log. Previously when there was no data, it would still create a single file. This is not something that we want, and therefore we've changed this behaviour. Refactored the tests to make use of Postgres itself since we have it running. This makes the tests more realistic, instead of mocking everything. [AIRFLOW-XXX] Fix typo in docs/timezone.rst (#3904) [AIRFLOW-3068] Remove deprecated imports [AIRFLOW-3036] Add relevant ECS options to ECS operator. (#3908) The ECS operator currently supports only a subset of available options for running ECS tasks. This patch adds all ECS options that could be relevant to airflow; options that wouldn't make sense here, like `count`, were skipped. [AIRFLOW-1195] Add feature to clear tasks in Parent Dag (#3907) [AIRFLOW-3073] Add note-Profiling feature not supported in new webserver (#3909) Adhoc queries and Charts features are no longer supported in new FAB-based webserver and UI. But this is not mentioned at all in the doc "Data Profiling" (https://airflow.incubator.apache.org/profiling.html) This commit adds a note to remind users for this. [AIRFLOW-XXX] Fix SlackWebhookOperator docs (#3915) The docs refer to `conn_id` while the actual argument is `http_conn_id`. [AIRFLOW-1441] Fix inconsistent tutorial code (#2466) [AIRFLOW-XXX] Add 90 Seconds to companies [AIRFLOW-3096] Further reduce DaysUntilStale for probo/stale [AIRFLOW-3072] Assign permission get_logs_with_metadata to viewer role (#3913) [AIRFLOW-3090] Demote dag start/stop log messages to debug (#3920) [AIRFLOW-2407] Use feature detection for reload() (#3298) * [AIRFLOW-2407] Use feature detection for reload() [Use feature detection instead of version detection](https://docs.python.org/3/howto/pyporting.html#use-feature-detection-instead-of-version-detection) is a Python porting best practice that avoids a flake8 undefined name error... flake8 testing of https://github.com/apache/incubator-airflow on Python 3.6.3 [AIRFLOW-XXX] Fix a wrong sample bash command, a display issue & a few typos (#3924) [AIRFLOW-3090] Make No tasks to consider for execution debug (#3923) During normal operation, it is not necessary to see the message. This can only be useful when debugging an issue. AIRFLOW-2952 Fix Kubernetes CI (#3922) The current dockerised CI pipeline doesn't run minikube and the Kubernetes integration tests. This starts a Kubernetes cluster using minikube and runs k8s integration tests using docker-compose. [AIRFLOW-2918] Fix Flake8 violations (#3931) [AIRFLOW-3076] Remove preloading of MySQL testdata (#3911) One of the things for tests is being self contained. This means that it should not depend on anything external, such as loading data. This PR will use the setUp and tearDown to load the data into MySQL and remove it afterwards. This removes the actual bash mysql commands and will make it easier to dockerize the whole testsuite in the future [AIRFLOW-2918] Remove unused imports [AIRFLOW-3099] Stop Missing Section Errors for optional sections (#3934) [AIRFLOW-3090] Specify path of key file in log message (#3921) [AIRFLOW-3067] Display www_rbac Flask flash msg properly (#3903) The Flask flash messages are not displayed properly. When we don't give a category for a flash message, defautl value will be 'message'. In some cases, we specify 'error' category. Using Flask-AppBuilder, the flash message will be given a CSS class 'alert-[category]'. But We don't have 'alert-message' or 'alert-error' in the current 'bootstrap-theme.css' file. This makes the the flash messages in www_rbac UI come with no background color. This commit addresses this issue by adding 'alert-message' (using specs of existing CSS class 'alert-info') and 'alert-error' (using specs of existing CSS class 'alert-danger') into 'bootstrap-theme.css'. [AIRFLOW-3109] Bugfix to allow user/op roles to clear task intance via UI by default add show statements to hql filtering. [AIRFLOW-3051] Change CLI to make users ops similar to connections The ability to manipulate users from the command line is a bit clunky. Currently 'airflow create_user' and 'airflow delete_user' and 'airflow list_users'. It seems that these ought to be made more like connections, so that it becomes 'airflow users list ...', 'airflow users delete ...' and 'airflow users create ...' [AIRFLOW-3009] Import Hashable from collection.abc to fix Python 3.7 deprecation warning (#3849) [AIRFLOW-XXX] Add Tesla as an Apache Airflow user (#3947) [AIRFLOW-3111] Fix instructions in UPDATING.md and remove comment (#3944) artifacts in default_airflow.cfg - fixed incorrect instructions in UPDATING.md regarding core.log_filename_template and elasticsearch.elasticsearch_log_id_template - removed comments referencing "additional curly braces" from default_airflow.cfg since they're irrelevant to the rendered airflow.cfg [AIRFLOW-3117] Add instructions to allow GPL dependency (#3949) The installation instructions failed to mention how to proceed with the GPL dependency. For those who are not concerned by GPL, it is useful to know how to proceed with GPL dependency. [AIRFLOW-XXX] Add Square to the companies lists [AIRFLOW-XXX] Add Fathom Health to readme [AIRFLOW-XXX] Pin Click to 6.7 to Fix CI (#3962) [AIRFLOW-XXX] Fix SlackWebhookOperator execute method comment (#3963) [AIRFLOW-3100][AIRFLOW-3101] Improve docker compose local testing (#3933) [AIRFLOW-3127] Fix out-dated doc for Celery SSL (#3967) Now in `airflow.cfg`, for Celery-SSL, the item names are "ssl_active", "ssl_key", "ssl_cert", and "ssl_cacert". (since PR https://github.com/apache/incubator-airflow/pull/2806/files) But in the documentation https://airflow.incubator.apache.org/security.html?highlight=celery or https://github.com/apache/incubator-airflow/blob/master/docs/security.rst, it's "CELERY_SSL_ACTIVE", "CELERY_SSL_KEY", "CELERY_SSL_CERT", and "CELERY_SSL_CACERT", which is out-dated and may confuse readers. [AIRFLOW-XXX] Fix PythonVirtualenvOperator tests (#3968) The recent update to the CI image changed the default python from python2 to python3. The PythonVirtualenvOperator tests expected python2 as default and fail due to serialisation errors. [AIRFLOW-2952] Fix Kubernetes CI (#3957) - Update outdated cli command to create user - Remove `airflow/example_dags_kubernetes` as the dag already exists in `contrib/example_dags/` - Update the path to copy K8s dags [AIRFLOW-3104] Add .airflowignore info into doc (#3939) .airflowignore is a nice feature, but it was not mentioned at all in the documentation. [AIRFLOW-XXX] Add Delete for CLI Example in UPDATING.md [AIRFLOW-3123] Use a stack for DAG context management (#3956) [AIRFLOW-3125] Monitor Task Instances creation rates (#3966) Montor Task Instances creation rates by Operator type. These stats can provide some visibility on how much workload Airflow is getting. They can be used for resource allocation in the long run (i.e. to determine when we should scale up workers) and debugging in scenarios like the creation rate of certain type of Task Instances spikes. [AIRFLOW-3129] Backfill mysql hook unit tests. (#3970) [AIRFLOW-3124] Fix RBAC webserver debug mode (#3958) [AIRFLOW-XXX] Add Compass to companies list (#3972) We're using Airflow at Compass now. [AIRFLOW-XXX] Speed up DagBagTest cases (#3974) I noticed that many of the tests of DagBags operate on a specific DAG only, and don't need to load the example or test dags. By not loading the dags we don't need to this shaves about 10-20s of test time. [AIRFLOW-2912] Add Deploy and Delete operators for GCF (#3969) Both Deploy and Delete operators interact with Google Cloud Functions to manage functions. Both are idempotent and make use of GcfHook - hook that encapsulates communication with GCP over GCP API. [AIRFLOW-1390] Update Alembic to 0.9 (#3935) [AIRFLOW-2238] Update PR tool to remove outdated info (#3978) [AIRFLOW-XXX] Don't spam test logs with "bad cron expression" messages (#3973) We needed these test dags to check the behaviour of invalid cron expressions, but by default we were loading them every time we create a DagBag (which many, many tests to). Instead we ignore these known-bad dags by default, and the test checking those (tests/models.py:DagBagTest.test_process_file_cron_validity_check) is already explicitly processing those DAGs directly, so it remains tested. [AIRFLOW-XXX] Fix undocumented params in S3_hook Some function parameters were undocumented. Additional docstrings were added for clarity. [AIRFLOW-3079] Improve migration scripts to support MSSQL Server (#3964) There were two problems for MSSQL. First, 'timestamp' data type in MSSQL Server is essentially a row-id, and not a timezone enabled date/time stamp. Second, alembic creates invalid SQL when applying the 0/1 constraint to boolean values. MSSQL should enforce this constraint by simply asserting a boolean value. [AIRFLOW-XXX] Add DoorDash to README.md (#3980) DoorDash uses Airflow https://softwareengineeringdaily.com/2018/09/28/doordash/ [AIRFLOW-3062] Add Qubole in integration docs (#3946) [AIRFLOW-3129] Improve test coverage of airflow.models. (#3982) [AIRFLOW-2574] Cope with '%' in SQLA DSN when running migrations (#3787) Alembic uses a ConfigParser like Airflow does, and "%% is a special value in there, so we need to escape it. As per the Alembic docs: > Note that this value is passed to ConfigParser.set, which supports > variable interpolation using pyformat (e.g. `%(some_value)s`). A raw > percent sign not part of an interpolation symbol must therefore be > escaped, e.g. `%%` [AIRFLOW-3137] Make ProxyFix middleware optional. (#3983) The ProxyFix middleware should only be used when airflow is running behind a trusted proxy. This patch adds a `USE_PROXY_FIX` flag that defaults to `False`. [AIRFLOW-3004] Add config disabling scheduler cron (#3899) [AIRFLOW-3103][AIRFLOW-3147] Update flask-appbuilder (#3937) [AIRFLOW-XXX] Fixing the issue in Documentation (#3998) Fixing the operator name from DataFlowOperation to DataFlowJavaOperator in Documentation [AIRFLOW-3088] Include slack-compatible emoji image [AIRFLOW-3161] fix TaskInstance log link in RBAC UI [AIRFLOW-3148] Remove unnecessary arg "parameters" in RedshiftToS3Transfer (#3995) "Parameters" are used to help render the SQL command. But in this operator, only "schema" and "table" are needed. There is no SQL command to render. By checking the code,we can also find argument "parameters" is never really used. (Fix a minor issue in the docstring as well) [AIRFLOW-3159] Update GCS logging docs for latest code (#3952) Reformmatted to flaskdiff requirements. [AIRFLOW-XXX] Remove residual line in Changelog (#3814) [AIRFLOW-2930] Fix celery excecutor scheduler crash (#3784) Caused by an update in PR #3740. execute_command.apply_async(args=command, ...) -command is a list of short unicode strings and the above code pass multiple arguments to a function defined as taking only one argument. -command = ["airflow", "run", "dag323",...] -args = command = ["airflow", "run", "dag323", ...] -execute_command("airflow","run","dag3s3", ...) will be error and exit. [AIRFLOW-2854] kubernetes_pod_operator add more configuration items (#3697) * kubernetes_pod_operator add more configuration items * fix test_kubernetes_pod_operator test_faulty_service_account failure case * fix review comment issues * pod_operator add hostnetwork config * add doc example [AIRFLOW-2994] Fix command status check in Qubole Check operator (#3790) [AIRFLOW-2949] Add syntax highlight for single quote strings (#3795) * AIRFLOW-2949: Add syntax highlight for single quote strings * AIRFLOW-2949: Also updated new UI main.css [AIRFLOW-2948] Arg check & better doc - SSHOperator & SFTPOperator (#3793) There may be different combinations of arguments, and some processings are being done 'silently', while users may not be fully aware of them. For example - User only needs to provide either `ssh_hook` or `ssh_conn_id`, while this is not clear in doc - if both provided, `ssh_conn_id` will be ignored. - if `remote_host` is provided, it will replace the `remote_host` which wasndefined in `ssh_hook` or predefined in the connection of `ssh_conn_id` These should be documented clearly to ensure it's transparent to the users. log.info() should also be used to remind users and provide clear logs. In addition, add instance check for ssh_hook to ensure it is of the correct type (SSHHook). Tests are updated for this PR. [AIRFLOW-XXX] Fix Broken Link in CONTRIBUTING.md [AIRFLOW-2980] ReadTheDocs - Fix Missing API Reference [AIRFLOW-2779] Make GHE auth third party licensed (#3803) This reinstates the original license. [AIRFLOW-XXX] Add Format to list of companies (#3824) [AIRFLOW-2900] Show code for packaged DAGs (#3749) [AIRFLOW-2983] Add prev_ds_nodash and next_ds_nodash macro (#3821) [AIRFLOW-2951] Update dag_run table end_date when state change (#3798) The existing airflow only change dag_run table end_date value when a user teminate a dag in web UI. The end_date will not be updated if airflow detected a dag finished and updated its state. This commit add end_date update in DagRun's set_state function to make up tho problem mentioned above. [AIRFLOW-2145] fix deadlock on clearing running TI (#3657) a `shutdown` task is not considered be `unfinished`, so a dag run can deadlock when all `unfinished` downstreams are all waiting on a task that's in the `shutdown` state. fix this by considering `shutdown` to be `unfinished`, since it's not truly a terminal state [AIRFLOW-XXX] Fix typo in docstring of gcs_to_bq (#3833) [AIRFLOW-2476] Allow tabulate up to 0.8.2 (#3835) [AIRFLOW-XXX] Fix typos in faq.rst (#3837) [AIRFLOW-2979] Make celery_result_backend conf Backwards compatible (#3832) (#2806) Renamed `celery_result_backend` to `result_backend` and broke backwards compatibility. [AIRFLOW-2866] Fix missing CSRF token head when using RBAC UI (#3804) [AIRFLOW-3007] Update backfill example in Scheduler docs The scheduler docs at https://airflow.apache.org/scheduler.html#backfill-and-catchup use deprecated way of passing `schedule_interval`. `schedule_interval` should be pass to DAG as a separate parameter and not as a default arg. [AIRFLOW-3005] Replace 'Airbnb Airflow' with 'Apache Airflow' (#3845) [AIRFLOW-3002] Fix variable & tests in GoogleCloudBucketHelper (#3843) [AIRFLOW-2991] Log path to driver output after Dataproc job (#3827) [AIRFLOW-XXX] Fix python3 and flake8 errors in dev/airflow-jira This is a script that checks if the Jira's marked as fixed in a release are actually merged in - getting this working is helpful to me in preparing 1.10.1 [AIRFLOW-2883] Add import and export for pool cli using JSON [AIRFLOW-3021] Add Censys to who uses Airflow list > Censys > Find and analyze every reachable server and device on the Internet > https://censys.io/ closes AIRFLOW-3021 https://issues.apache.org/jira/browse/AIRFLOW-3021 Add Branch to Company List [AIRFLOW-3008] Move Kubernetes example DAGs to contrib [AIRFLOW-2997] Support cluster fields in bigquery (#3838) This adds a cluster_fields argument to the bigquery hook, GCS to bigquery operator and bigquery query operators. This field requests that bigquery store the result of the query/load operation sorted according to the specified fields (the order of fields given is significant). [AIRFLOW-XXX] Redirect FAQ `airflow[crypto]` to How-to Guides. [AIRFLOW-XXX] Remove redundant space in Kerberos (#3866) [AIRFLOW-3028] Update Text & Images in Readme.md [AIRFLOW-1917] Trim extra newline and trailing whitespace from log (#3862) [AIRFLOW-2985] Operators for S3 object copying/deleting (#3823) 1. Copying: Under the hood, it's `boto3.client.copy_object()`. It can only handle the situation in which the S3 connection used can access both source and destination bucket/key. 2. Deleting: 2.1 Under the hood, it's `boto3.client.delete_objects()`. It supports either deleting one single object or multiple objects. 2.2 If users try to delete a non-existent object, the request will still succeed, but there will be an entry 'Errors' in the response. There may also be other reasons which may cause similar 'Errors' ( request itself would succeed without explicit exception). So an argument `silent_on_errors` is added to let users decide if this sort of 'Errors' should fail the operator. The corresponding methods are added into S3Hook, and these two operators are 'wrappers' of these methods. [AIRFLOW-3030] Fix CLI docs (#3872) [AIRFLOW-XXX] Update kubernetes.rst docs (#3875) Update kubernetes.rst with correct KubernetesPodOperator inputs for the volumes. [AIRFLOW-XXX] Add Enigma to list of companies [AIRFLOW-2965] CLI tool to show the next execution datetime Cover different cases - schedule_interval is "@once" or None, then following_schedule method would always return None - If dag is paused, print reminder - If latest_execution_date is not found, print warning saying not applicable. [AIRFLOW-XXX] Add Bombora Inc using Airflow [AIRFLOW-XXX] Move Dag level access control out of 1.10 section (#3882) It isn't in 1.10 (and wasn't in this section when the PR was created). [AIRFLOW-3012] Fix Bug when passing emails for SLA [AIRFLOW-2797] Create Google Dataproc cluster with custom image (#3871) [AIRFLOW-XXX] Updated README to include CAVA [AIRFLOW-3035] Allow custom 'job_error_states' in dataproc ops (#3884) Allow caller to pass in custom list of Dataproc job states into the DataProc*Operator classes that should result in the _DataProcJob.raise_error() method raising an Exception. [AIRFLOW-3034]: Readme updates : Add Slack & Twitter, remove Gitter [AIRFLOW-3056] Add happn to Airflow user list [AIRFLOW-3052] Add logo options to Airflow (#3892) [AIRFLOW-2524] Add SageMaker Batch Inference (#3767) * Fix for comments * Fix sensor test * Update non_terminal_states and failed_states to static variables of SageMakerHook Add SageMaker Transform Operator & Sensor Co-authored-by: srrajeev-aws <[email protected]> [AIRFLOW-XXX] Added Jeitto as one of happy Airflow users! (#3902) [AIRFLOW-XXX] Add Jeitto as one happy Airflow user! [AIRFLOW-3044] Dataflow operators accept templated job_name param (#3887) * Default value of new job_name param is templated task_id, to match the existing behavior as much as possible. * Change expected value in test_mlengine_operator_utils.py to match default for new job_name param. [AIRFLOW-2707] Validate task_log_reader on upgrade from <=1.9 (#3881) We changed the default logging config and config from 1.9 to 1.10, but anyone who upgrades and has an existing airflow.cfg won't know they need to change this value - instead they will get nothing displayed in the UI (ajax request fails) and see "'NoneType' object has no attribute 'read'" in the error log. This validates that config section at start up, and seamlessly upgrades the old previous value. [AIRFLOW-3025] Enable specifying dns and dns_search options for DockerOperator (#3860) Enable specifying dns and dns_search options for DockerOperator [AIRFLOW-1298] Clear UPSTREAM_FAILED using the clean cli (#3886) * [AIRFLOW-1298] Fix 'clear only_failed' * [AIRFLOW-1298] Fix 'clear only_failed' [AIRFLOW-3059] Log how many rows are read from Postgres (#3905) To know how many data is being read from Postgres, it is nice to log this to the Airflow log. Previously when there was no data, it would still create a single file. This is not something that we want, and therefore we've changed this behaviour. Refactored the tests to make use of Postgres itself since we have it running. This makes the tests more realistic, instead of mocking everything. [AIRFLOW-XXX] Fix typo in docs/timezone.rst (#3904) [AIRFLOW-3068] Remove deprecated imports [AIRFLOW-3036] Add relevant ECS options to ECS operator. (#3908) The ECS operator currently supports only a subset of available options for running ECS tasks. This patch adds all ECS options that could be relevant to airflow; options that wouldn't make sense here, like `count`, were skipped. [AIRFLOW-1195] Add feature to clear tasks in Parent Dag (#3907) [AIRFLOW-3073] Add note-Profiling feature not supported in new webserver (#3909) Adhoc queries and Charts features are no longer supported in new FAB-based webserver and UI. But this is not mentioned at all in the doc "Data Profiling" (https://airflow.incubator.apache.org/profiling.html) This commit adds a note to remind users for this. [AIRFLOW-XXX] Fix SlackWebhookOperator docs (#3915) The docs refer to `conn_id` while the actual argument is `http_conn_id`. [AIRFLOW-1441] Fix inconsistent tutorial code (#2466) [AIRFLOW-XXX] Add 90 Seconds to companies [AIRFLOW-3096] Further reduce DaysUntilStale for probo/stale [AIRFLOW-3072] Assign permission get_logs_with_metadata to viewer role (#3913) [AIRFLOW-3090] Demote dag start/stop log messages to debug (#3920) [AIRFLOW-2407] Use feature detection for reload() (#3298) * [AIRFLOW-2407] Use feature detection for reload() [Use feature detection instead of version detection](https://docs.python.org/3/howto/pyporting.html#use-feature-detection-instead-of-version-detection) is a Python porting best practice that avoids a flake8 undefined name error... flake8 testing of https://github.com/apache/incubator-airflow on Python 3.6.3 [AIRFLOW-XXX] Fix a wrong sample bash command, a display issue & a few typos (#3924) [AIRFLOW-3090] Make No tasks to consider for execution debug (#3923) During normal operation, it is not necessary to see the message. This can only be useful when debugging an issue. AIRFLOW-2952 Fix Kubernetes CI (#3922) The current dockerised CI pipeline doesn't run minikube and the Kubernetes integration tests. This starts a Kubernetes cluster using minikube and runs k8s integration tests using docker-compose. [AIRFLOW-2918] Fix Flake8 violations (#3931) [AIRFLOW-3076] Remove preloading of MySQL testdata (#3911) One of the things for tests is being self contained. This means that it should not depend on anything external, such as loading data. This PR will use the setUp and tearDown to load the data into MySQL and remove it afterwards. This removes the actual bash mysql commands and will make it easier to dockerize the whole testsuite in the future [AIRFLOW-2918] Remove unused imports [AIRFLOW-3099] Stop Missing Section Errors for optional sections (#3934) [AIRFLOW-3090] Specify path of key file in log message (#3921) [AIRFLOW-3067] Display www_rbac Flask flash msg properly (#3903) The Flask flash messages are not displayed properly. When we don't give a category for a flash message, defautl value will be 'message'. In some cases, we specify 'error' category. Using Flask-AppBuilder, the flash message will be given a CSS class 'alert-[category]'. But We don't have 'alert-message' or 'alert-error' in the current 'bootstrap-theme.css' file. This makes the the flash messages in www_rbac UI come with no background color. This commit addresses this issue by adding 'alert-message' (using specs of existing CSS class 'alert-info') and 'alert-error' (using specs of existing CSS class 'alert-danger') into 'bootstrap-theme.css'. [AIRFLOW-3109] Bugfix to allow user/op roles to clear task intance via UI by default add show statements to hql filtering. [AIRFLOW-3051] Change CLI to make users ops similar to connections The ability to manipulate users from the command line is a bit clunky. Currently 'airflow create_user' and 'airflow delete_user' and 'airflow list_users'. It seems that these ought to be made more like connections, so that it becomes 'airflow users list ...', 'airflow users delete ...' and 'airflow users create ...' [AIRFLOW-3009] Import Hashable from collection.abc to fix Python 3.7 deprecation warning (#3849) [AIRFLOW-XXX] Add Tesla as an Apache Airflow user (#3947) [AIRFLOW-3111] Fix instructions in UPDATING.md and remove comment (#3944) artifacts in default_airflow.cfg - fixed incorrect instructions in UPDATING.md regarding core.log_filename_template and elasticsearch.elasticsearch_log_id_template - removed comments referencing "additional curly braces" from default_airflow.cfg since they're irrelevant to the rendered airflow.cfg [AIRFLOW-3117] Add instructions to allow GPL dependency (#3949) The installation instructions failed to mention how to proceed with the GPL dependency. For those who are not concerned by GPL, it is useful to know how to proceed with GPL dependency. [AIRFLOW-XXX] Add Square to the companies lists [AIRFLOW-XXX] Add Fathom Health to readme [AIRFLOW-XXX] Pin Click to 6.7 to Fix CI (#3962) [AIRFLOW-XXX] Fix SlackWebhookOperator execute method comment (#3963) [AIRFLOW-3100][AIRFLOW-3101] Improve docker compose local testing (#3933) [AIRFLOW-3127] Fix out-dated doc for Celery SSL (#3967) Now in `airflow.cfg`, for Celery-SSL, the item names are "ssl_active", "ssl_key", "ssl_cert", and "ssl_cacert". (since PR https://github.com/apache/incubator-airflow/pull/2806/files) But in the documentation https://airflow.incubator.apache.org/security.html?highlight=celery or https://github.com/apache/incubator-airflow/blob/master/docs/security.rst, it's "CELERY_SSL_ACTIVE", "CELERY_SSL_KEY", "CELERY_SSL_CERT", and "CELERY_SSL_CACERT", which is out-dated and may confuse readers. [AIRFLOW-XXX] Fix PythonVirtualenvOperator tests (#3968) The recent update to the CI image changed the default python from python2 to python3. The PythonVirtualenvOperator tests expected python2 as default and fail due to serialisation errors. [AIRFLOW-2952] Fix Kubernetes CI (#3957) - Update outdated cli command to create user - Remove `airflow/example_dags_kubernetes` as the dag already exists in `contrib/example_dags/` - Update the path to copy K8s dags [AIRFLOW-3104] Add .airflowignore info into doc (#3939) .airflowignore is a nice feature, but it was not mentioned at all in the documentation. [AIRFLOW-XXX] Add Delete for CLI Example in UPDATING.md [AIRFLOW-3123] Use a stack for DAG context management (#3956) [AIRFLOW-3125] Monitor Task Instances creation rates (#3966) Montor Task Instances creation rates by Operator type. These stats can provide some visibility on how much workload Airflow is getting. They can be used for resource allocation in the long run (i.e. to determine when we should scale up workers) and debugging in scenarios like the creation rate of certain type of Task Instances spikes. [AIRFLOW-3129] Backfill mysql hook unit tests. (#3970) [AIRFLOW-3124] Fix RBAC webserver debug mode (#3958) [AIRFLOW-XXX] Add Compass to companies list (#3972) We're using Airflow at Compass now. [AIRFLOW-XXX] Speed up DagBagTest cases (#3974) I noticed that many of the tests of DagBags operate on a specific DAG only, and don't need to load the example or test dags. By not loading the dags we don't need to this shaves about 10-20s of test time. [AIRFLOW-2912] Add Deploy and Delete operators for GCF (#3969) Both Deploy and Delete operators interact with Google Cloud Functions to manage functions. Both are idempotent and make use of GcfHook - hook that encapsulates communication with GCP over GCP API. [AIRFLOW-1390] Update Alembic to 0.9 (#3935) [AIRFLOW-2238] Update PR tool to remove outdated info (#3978) [AIRFLOW-XXX] Don't spam test logs with "bad cron expression" messages (#3973) We needed these test dags to check the behaviour of invalid cron expressions, but by default we were loading them every time we create a DagBag (which many, many tests to). Instead we ignore these known-bad dags by default, and the test checking those (tests/models.py:DagBagTest.test_process_file_cron_validity_check) is already explicitly processing those DAGs directly, so it remains tested. [AIRFLOW-XXX] Fix undocumented params in S3_hook Some function parameters were undocumented. Additional docstrings were added for clarity. [AIRFLOW-3079] Improve migration scripts to support MSSQL Server (#3964) There were two problems for MSSQL. First, 'timestamp' data type in MSSQL Server is essentially a row-id, and not a timezone enabled date/time stamp. Second, alembic creates invalid SQL when applying the 0/1 constraint to boolean values. MSSQL should enforce this constraint by simply asserting a boolean value. [AIRFLOW-XXX] Add DoorDash to README.md (#3980) DoorDash uses Airflow https://softwareengineeringdaily.com/2018/09/28/doordash/ [AIRFLOW-3062] Add Qubole in integration docs (#3946) [AIRFLOW-3129] Improve test coverage of airflow.models. (#3982) [AIRFLOW-2574] Cope with '%' in SQLA DSN when running migrations (#3787) Alembic uses a ConfigParser like Airflow does, and "%% is a special value in there, so we need to escape it. As per the Alembic docs: > Note that this value is passed to ConfigParser.set, which supports > variable interpolation using pyformat (e.g. `%(some_value)s`). A raw > percent sign not part of an interpolation symbol must therefore be > escaped, e.g. `%%` [AIRFLOW-3137] Make ProxyFix middleware optional. (#3983) The ProxyFix middleware should only be used when airflow is running behind a trusted proxy. This patch adds a `USE_PROXY_FIX` flag that defaults to `False`. [AIRFLOW-3004] Add config disabling scheduler cron (#3899) [AIRFLOW-3103][AIRFLOW-3147] Update flask-appbuilder (#3937) [AIRFLOW-XXX] Fixing the issue in Documentation (#3998) Fixing the operator name from DataFlowOperation to DataFlowJavaOperator in Documentation [AIRFLOW-3088] Include slack-compatible emoji image [AIRFLOW-3161] fix TaskInstance log link in RBAC UI [AIRFLOW-3148] Remove unnecessary arg "parameters" in RedshiftToS3Transfer (#3995) "Parameters" are used to help render the SQL command. But in this operator, only "schema" and "table" are needed. There is no SQL command to render. By checking the code,we can also find argument "parameters" is never really used. (Fix a minor issue in the docstring as well) [AIRFLOW-3159] Update GCS logging docs for latest code (#3952) [AIRFLOW-XXX] Fix airflow.models.DAG docstring mistake Closes #4004 from Sambeth/sambeth Updated the tests written for s3/sftp operators Fixed the flask diff errors. Fixed aws connection test Fixed flask diff errors. Updated test_s3_to_sftp_operator with correct class name. Fixed test_s3_to_sftp_operator error reported in travis. Fixed test_s3_to_sftp_operator error reported in travis. Changed default values for s3_to_sftp_operator Updated test for checking for sftp file content. Fixed flask diff error. [AIRFLOW-XXX] Adding Home Depot as users of Apache airflow (#4013) * Adding Home Depot as users of Apache airflow [AIRFLOW-XXX] Added ThoughtWorks as user of Airflow in README (#4012) [AIRFLOW-XXX] Added DataCamp to list of companies in README (#4009) [AIRFLOW-3165] Document interpolation of '%' and warn (#4007) [AIRFLOW-3099] Complete list of optional airflow.cfg sections (#4002) [AIRFLOW-3162] Fix HttpHook URL parse error when port is specified (#4001) [AIRFLOW-3055] add get_dataset and get_datasets_list to bigquery_hook (#3894) * [AIRFLOW-3055] add get_dataset and get_datasets_list to bigquery_hook
Fixed variable for deleting resources. [AIRFLOW-XXX] Remove residual line in Changelog (apache#3814) [AIRFLOW-2930] Fix celery excecutor scheduler crash (apache#3784) Caused by an update in PR apache#3740. execute_command.apply_async(args=command, ...) -command is a list of short unicode strings and the above code pass multiple arguments to a function defined as taking only one argument. -command = ["airflow", "run", "dag323",...] -args = command = ["airflow", "run", "dag323", ...] -execute_command("airflow","run","dag3s3", ...) will be error and exit. [AIRFLOW-2854] kubernetes_pod_operator add more configuration items (apache#3697) * kubernetes_pod_operator add more configuration items * fix test_kubernetes_pod_operator test_faulty_service_account failure case * fix review comment issues * pod_operator add hostnetwork config * add doc example [AIRFLOW-2994] Fix command status check in Qubole Check operator (apache#3790) [AIRFLOW-2949] Add syntax highlight for single quote strings (apache#3795) * AIRFLOW-2949: Add syntax highlight for single quote strings * AIRFLOW-2949: Also updated new UI main.css [AIRFLOW-2948] Arg check & better doc - SSHOperator & SFTPOperator (apache#3793) There may be different combinations of arguments, and some processings are being done 'silently', while users may not be fully aware of them. For example - User only needs to provide either `ssh_hook` or `ssh_conn_id`, while this is not clear in doc - if both provided, `ssh_conn_id` will be ignored. - if `remote_host` is provided, it will replace the `remote_host` which wasndefined in `ssh_hook` or predefined in the connection of `ssh_conn_id` These should be documented clearly to ensure it's transparent to the users. log.info() should also be used to remind users and provide clear logs. In addition, add instance check for ssh_hook to ensure it is of the correct type (SSHHook). Tests are updated for this PR. [AIRFLOW-XXX] Fix Broken Link in CONTRIBUTING.md [AIRFLOW-2980] ReadTheDocs - Fix Missing API Reference [AIRFLOW-2779] Make GHE auth third party licensed (apache#3803) This reinstates the original license. [AIRFLOW-XXX] Add Format to list of companies (apache#3824) [AIRFLOW-2900] Show code for packaged DAGs (apache#3749) [AIRFLOW-2983] Add prev_ds_nodash and next_ds_nodash macro (apache#3821) [AIRFLOW-2974] Extended Databricks hook with clusters operation (apache#3817) Add hooks for: - cluster start, - restart, - terminate. Add unit tests for the added hooks. Add hooks for cluster start, restart and terminate. Add unit tests for the added hooks. Add cluster_id variable for performing cluster operation tests. [AIRFLOW-2951] Update dag_run table end_date when state change (apache#3798) The existing airflow only change dag_run table end_date value when a user teminate a dag in web UI. The end_date will not be updated if airflow detected a dag finished and updated its state. This commit add end_date update in DagRun's set_state function to make up tho problem mentioned above. [AIRFLOW-2145] fix deadlock on clearing running TI (apache#3657) a `shutdown` task is not considered be `unfinished`, so a dag run can deadlock when all `unfinished` downstreams are all waiting on a task that's in the `shutdown` state. fix this by considering `shutdown` to be `unfinished`, since it's not truly a terminal state [AIRFLOW-XXX] Fix typo in docstring of gcs_to_bq (apache#3833) [AIRFLOW-2476] Allow tabulate up to 0.8.2 (apache#3835) [AIRFLOW-XXX] Fix typos in faq.rst (apache#3837) [AIRFLOW-2979] Make celery_result_backend conf Backwards compatible (apache#3832) (apache#2806) Renamed `celery_result_backend` to `result_backend` and broke backwards compatibility. [AIRFLOW-2866] Fix missing CSRF token head when using RBAC UI (apache#3804) [AIRFLOW-491] Add feature to pass extra api configs to BQ Hook (apache#3733) [AIRFLOW-3007] Update backfill example in Scheduler docs The scheduler docs at https://airflow.apache.org/scheduler.html#backfill-and-catchup use deprecated way of passing `schedule_interval`. `schedule_interval` should be pass to DAG as a separate parameter and not as a default arg. [AIRFLOW-3005] Replace 'Airbnb Airflow' with 'Apache Airflow' (apache#3845) [AIRFLOW-3002] Fix variable & tests in GoogleCloudBucketHelper (apache#3843) [AIRFLOW-2991] Log path to driver output after Dataproc job (apache#3827) [AIRFLOW-XXX] Fix python3 and flake8 errors in dev/airflow-jira This is a script that checks if the Jira's marked as fixed in a release are actually merged in - getting this working is helpful to me in preparing 1.10.1 [AIRFLOW-2883] Add import and export for pool cli using JSON [AIRFLOW-3021] Add Censys to who uses Airflow list > Censys > Find and analyze every reachable server and device on the Internet > https://censys.io/ closes AIRFLOW-3021 https://issues.apache.org/jira/browse/AIRFLOW-3021 Add Branch to Company List [AIRFLOW-3008] Move Kubernetes example DAGs to contrib [AIRFLOW-2997] Support cluster fields in bigquery (apache#3838) This adds a cluster_fields argument to the bigquery hook, GCS to bigquery operator and bigquery query operators. This field requests that bigquery store the result of the query/load operation sorted according to the specified fields (the order of fields given is significant). [AIRFLOW-XXX] Redirect FAQ `airflow[crypto]` to How-to Guides. [AIRFLOW-XXX] Remove redundant space in Kerberos (apache#3866) [AIRFLOW-3028] Update Text & Images in Readme.md [AIRFLOW-1917] Trim extra newline and trailing whitespace from log (apache#3862) [AIRFLOW-2985] Operators for S3 object copying/deleting (apache#3823) 1. Copying: Under the hood, it's `boto3.client.copy_object()`. It can only handle the situation in which the S3 connection used can access both source and destination bucket/key. 2. Deleting: 2.1 Under the hood, it's `boto3.client.delete_objects()`. It supports either deleting one single object or multiple objects. 2.2 If users try to delete a non-existent object, the request will still succeed, but there will be an entry 'Errors' in the response. There may also be other reasons which may cause similar 'Errors' ( request itself would succeed without explicit exception). So an argument `silent_on_errors` is added to let users decide if this sort of 'Errors' should fail the operator. The corresponding methods are added into S3Hook, and these two operators are 'wrappers' of these methods. [AIRFLOW-3030] Fix CLI docs (apache#3872) [AIRFLOW-XXX] Update kubernetes.rst docs (apache#3875) Update kubernetes.rst with correct KubernetesPodOperator inputs for the volumes. [AIRFLOW-XXX] Add Enigma to list of companies [AIRFLOW-2965] CLI tool to show the next execution datetime Cover different cases - schedule_interval is "@once" or None, then following_schedule method would always return None - If dag is paused, print reminder - If latest_execution_date is not found, print warning saying not applicable. [AIRFLOW-XXX] Add Bombora Inc using Airflow [AIRFLOW-XXX] Move Dag level access control out of 1.10 section (apache#3882) It isn't in 1.10 (and wasn't in this section when the PR was created). [AIRFLOW-3012] Fix Bug when passing emails for SLA [AIRFLOW-2797] Create Google Dataproc cluster with custom image (apache#3871) [AIRFLOW-XXX] Updated README to include CAVA [AIRFLOW-3035] Allow custom 'job_error_states' in dataproc ops (apache#3884) Allow caller to pass in custom list of Dataproc job states into the DataProc*Operator classes that should result in the _DataProcJob.raise_error() method raising an Exception. [AIRFLOW-3034]: Readme updates : Add Slack & Twitter, remove Gitter [AIRFLOW-3056] Add happn to Airflow user list [AIRFLOW-3052] Add logo options to Airflow (apache#3892) [AIRFLOW-2524] Add SageMaker Batch Inference (apache#3767) * Fix for comments * Fix sensor test * Update non_terminal_states and failed_states to static variables of SageMakerHook Add SageMaker Transform Operator & Sensor Co-authored-by: srrajeev-aws <[email protected]> [AIRFLOW-XXX] Added Jeitto as one of happy Airflow users! (apache#3902) [AIRFLOW-XXX] Add Jeitto as one happy Airflow user! [AIRFLOW-3044] Dataflow operators accept templated job_name param (apache#3887) * Default value of new job_name param is templated task_id, to match the existing behavior as much as possible. * Change expected value in test_mlengine_operator_utils.py to match default for new job_name param. [AIRFLOW-2707] Validate task_log_reader on upgrade from <=1.9 (apache#3881) We changed the default logging config and config from 1.9 to 1.10, but anyone who upgrades and has an existing airflow.cfg won't know they need to change this value - instead they will get nothing displayed in the UI (ajax request fails) and see "'NoneType' object has no attribute 'read'" in the error log. This validates that config section at start up, and seamlessly upgrades the old previous value. [AIRFLOW-3025] Enable specifying dns and dns_search options for DockerOperator (apache#3860) Enable specifying dns and dns_search options for DockerOperator [AIRFLOW-1298] Clear UPSTREAM_FAILED using the clean cli (apache#3886) * [AIRFLOW-1298] Fix 'clear only_failed' * [AIRFLOW-1298] Fix 'clear only_failed' [AIRFLOW-3059] Log how many rows are read from Postgres (apache#3905) To know how many data is being read from Postgres, it is nice to log this to the Airflow log. Previously when there was no data, it would still create a single file. This is not something that we want, and therefore we've changed this behaviour. Refactored the tests to make use of Postgres itself since we have it running. This makes the tests more realistic, instead of mocking everything. [AIRFLOW-XXX] Fix typo in docs/timezone.rst (apache#3904) [AIRFLOW-3068] Remove deprecated imports [AIRFLOW-3036] Add relevant ECS options to ECS operator. (apache#3908) The ECS operator currently supports only a subset of available options for running ECS tasks. This patch adds all ECS options that could be relevant to airflow; options that wouldn't make sense here, like `count`, were skipped. [AIRFLOW-1195] Add feature to clear tasks in Parent Dag (apache#3907) [AIRFLOW-3073] Add note-Profiling feature not supported in new webserver (apache#3909) Adhoc queries and Charts features are no longer supported in new FAB-based webserver and UI. But this is not mentioned at all in the doc "Data Profiling" (https://airflow.incubator.apache.org/profiling.html) This commit adds a note to remind users for this. [AIRFLOW-XXX] Fix SlackWebhookOperator docs (apache#3915) The docs refer to `conn_id` while the actual argument is `http_conn_id`. [AIRFLOW-1441] Fix inconsistent tutorial code (apache#2466) [AIRFLOW-XXX] Add 90 Seconds to companies [AIRFLOW-3096] Further reduce DaysUntilStale for probo/stale [AIRFLOW-3072] Assign permission get_logs_with_metadata to viewer role (apache#3913) [AIRFLOW-3090] Demote dag start/stop log messages to debug (apache#3920) [AIRFLOW-2407] Use feature detection for reload() (apache#3298) * [AIRFLOW-2407] Use feature detection for reload() [Use feature detection instead of version detection](https://docs.python.org/3/howto/pyporting.html#use-feature-detection-instead-of-version-detection) is a Python porting best practice that avoids a flake8 undefined name error... flake8 testing of https://github.com/apache/incubator-airflow on Python 3.6.3 [AIRFLOW-XXX] Fix a wrong sample bash command, a display issue & a few typos (apache#3924) [AIRFLOW-3090] Make No tasks to consider for execution debug (apache#3923) During normal operation, it is not necessary to see the message. This can only be useful when debugging an issue. AIRFLOW-2952 Fix Kubernetes CI (apache#3922) The current dockerised CI pipeline doesn't run minikube and the Kubernetes integration tests. This starts a Kubernetes cluster using minikube and runs k8s integration tests using docker-compose. [AIRFLOW-2918] Fix Flake8 violations (apache#3931) [AIRFLOW-3076] Remove preloading of MySQL testdata (apache#3911) One of the things for tests is being self contained. This means that it should not depend on anything external, such as loading data. This PR will use the setUp and tearDown to load the data into MySQL and remove it afterwards. This removes the actual bash mysql commands and will make it easier to dockerize the whole testsuite in the future [AIRFLOW-2918] Remove unused imports [AIRFLOW-3090] Specify path of key file in log message (apache#3921) [AIRFLOW-3067] Display www_rbac Flask flash msg properly (apache#3903) The Flask flash messages are not displayed properly. When we don't give a category for a flash message, defautl value will be 'message'. In some cases, we specify 'error' category. Using Flask-AppBuilder, the flash message will be given a CSS class 'alert-[category]'. But We don't have 'alert-message' or 'alert-error' in the current 'bootstrap-theme.css' file. This makes the the flash messages in www_rbac UI come with no background color. This commit addresses this issue by adding 'alert-message' (using specs of existing CSS class 'alert-info') and 'alert-error' (using specs of existing CSS class 'alert-danger') into 'bootstrap-theme.css'. [AIRFLOW-3109] Bugfix to allow user/op roles to clear task intance via UI by default add show statements to hql filtering. [AIRFLOW-3051] Change CLI to make users ops similar to connections The ability to manipulate users from the command line is a bit clunky. Currently 'airflow create_user' and 'airflow delete_user' and 'airflow list_users'. It seems that these ought to be made more like connections, so that it becomes 'airflow users list ...', 'airflow users delete ...' and 'airflow users create ...' [AIRFLOW-3009] Import Hashable from collection.abc to fix Python 3.7 deprecation warning (apache#3849) [AIRFLOW-3111] Fix instructions in UPDATING.md and remove comment (apache#3944) artifacts in default_airflow.cfg - fixed incorrect instructions in UPDATING.md regarding core.log_filename_template and elasticsearch.elasticsearch_log_id_template - removed comments referencing "additional curly braces" from default_airflow.cfg since they're irrelevant to the rendered airflow.cfg [AIRFLOW-3117] Add instructions to allow GPL dependency (apache#3949) The installation instructions failed to mention how to proceed with the GPL dependency. For those who are not concerned by GPL, it is useful to know how to proceed with GPL dependency. [AIRFLOW-XXX] Add Square to the companies lists [AIRFLOW-XXX] Add Fathom Health to readme [AIRFLOW-XXX] Pin Click to 6.7 to Fix CI (apache#3962) [AIRFLOW-XXX] Fix SlackWebhookOperator execute method comment (apache#3963) [AIRFLOW-3100][AIRFLOW-3101] Improve docker compose local testing (apache#3933) [AIRFLOW-3127] Fix out-dated doc for Celery SSL (apache#3967) Now in `airflow.cfg`, for Celery-SSL, the item names are "ssl_active", "ssl_key", "ssl_cert", and "ssl_cacert". (since PR https://github.com/apache/incubator-airflow/pull/2806/files) But in the documentation https://airflow.incubator.apache.org/security.html?highlight=celery or https://github.com/apache/incubator-airflow/blob/master/docs/security.rst, it's "CELERY_SSL_ACTIVE", "CELERY_SSL_KEY", "CELERY_SSL_CERT", and "CELERY_SSL_CACERT", which is out-dated and may confuse readers. [AIRFLOW-XXX] Fix PythonVirtualenvOperator tests (apache#3968) The recent update to the CI image changed the default python from python2 to python3. The PythonVirtualenvOperator tests expected python2 as default and fail due to serialisation errors. [AIRFLOW-2952] Fix Kubernetes CI (apache#3957) - Update outdated cli command to create user - Remove `airflow/example_dags_kubernetes` as the dag already exists in `contrib/example_dags/` - Update the path to copy K8s dags [AIRFLOW-3104] Add .airflowignore info into doc (apache#3939) .airflowignore is a nice feature, but it was not mentioned at all in the documentation. [AIRFLOW-XXX] Add Delete for CLI Example in UPDATING.md [AIRFLOW-3123] Use a stack for DAG context management (apache#3956) [AIRFLOW-3125] Monitor Task Instances creation rates (apache#3966) Montor Task Instances creation rates by Operator type. These stats can provide some visibility on how much workload Airflow is getting. They can be used for resource allocation in the long run (i.e. to determine when we should scale up workers) and debugging in scenarios like the creation rate of certain type of Task Instances spikes. [AIRFLOW-3129] Backfill mysql hook unit tests. (apache#3970) [AIRFLOW-3124] Fix RBAC webserver debug mode (apache#3958) [AIRFLOW-XXX] Add Compass to companies list (apache#3972) We're using Airflow at Compass now. [AIRFLOW-XXX] Speed up DagBagTest cases (apache#3974) I noticed that many of the tests of DagBags operate on a specific DAG only, and don't need to load the example or test dags. By not loading the dags we don't need to this shaves about 10-20s of test time. [AIRFLOW-2912] Add Deploy and Delete operators for GCF (apache#3969) Both Deploy and Delete operators interact with Google Cloud Functions to manage functions. Both are idempotent and make use of GcfHook - hook that encapsulates communication with GCP over GCP API. [AIRFLOW-1390] Update Alembic to 0.9 (apache#3935) [AIRFLOW-2238] Update PR tool to remove outdated info (apache#3978) [AIRFLOW-XXX] Don't spam test logs with "bad cron expression" messages (apache#3973) We needed these test dags to check the behaviour of invalid cron expressions, but by default we were loading them every time we create a DagBag (which many, many tests to). Instead we ignore these known-bad dags by default, and the test checking those (tests/models.py:DagBagTest.test_process_file_cron_validity_check) is already explicitly processing those DAGs directly, so it remains tested. [AIRFLOW-XXX] Fix undocumented params in S3_hook Some function parameters were undocumented. Additional docstrings were added for clarity. [AIRFLOW-3079] Improve migration scripts to support MSSQL Server (apache#3964) There were two problems for MSSQL. First, 'timestamp' data type in MSSQL Server is essentially a row-id, and not a timezone enabled date/time stamp. Second, alembic creates invalid SQL when applying the 0/1 constraint to boolean values. MSSQL should enforce this constraint by simply asserting a boolean value. [AIRFLOW-XXX] Add DoorDash to README.md (apache#3980) DoorDash uses Airflow https://softwareengineeringdaily.com/2018/09/28/doordash/ [AIRFLOW-3062] Add Qubole in integration docs (apache#3946) [AIRFLOW-3129] Improve test coverage of airflow.models. (apache#3982) [AIRFLOW-2574] Cope with '%' in SQLA DSN when running migrations (apache#3787) Alembic uses a ConfigParser like Airflow does, and "%% is a special value in there, so we need to escape it. As per the Alembic docs: > Note that this value is passed to ConfigParser.set, which supports > variable interpolation using pyformat (e.g. `%(some_value)s`). A raw > percent sign not part of an interpolation symbol must therefore be > escaped, e.g. `%%` [AIRFLOW-3137] Make ProxyFix middleware optional. (apache#3983) The ProxyFix middleware should only be used when airflow is running behind a trusted proxy. This patch adds a `USE_PROXY_FIX` flag that defaults to `False`. [AIRFLOW-3004] Add config disabling scheduler cron (apache#3899) [AIRFLOW-3103][AIRFLOW-3147] Update flask-appbuilder (apache#3937) [AIRFLOW-XXX] Fixing the issue in Documentation (apache#3998) Fixing the operator name from DataFlowOperation to DataFlowJavaOperator in Documentation [AIRFLOW-3088] Include slack-compatible emoji image [AIRFLOW-3161] fix TaskInstance log link in RBAC UI [AIRFLOW-3148] Remove unnecessary arg "parameters" in RedshiftToS3Transfer (apache#3995) "Parameters" are used to help render the SQL command. But in this operator, only "schema" and "table" are needed. There is no SQL command to render. By checking the code,we can also find argument "parameters" is never really used. (Fix a minor issue in the docstring as well) [AIRFLOW-3159] Update GCS logging docs for latest code (apache#3952) [AIRFLOW-XXX] Fix airflow.models.DAG docstring mistake Closes apache#4004 from Sambeth/sambeth [AIRFLOW-XXX] Adding Home Depot as users of Apache airflow (apache#4013) * Adding Home Depot as users of Apache airflow [AIRFLOW-XXX] Added ThoughtWorks as user of Airflow in README (apache#4012) [AIRFLOW-XXX] Added DataCamp to list of companies in README (apache#4009) [AIRFLOW-3165] Document interpolation of '%' and warn (apache#4007) [AIRFLOW-3099] Complete list of optional airflow.cfg sections (apache#4002) [AIRFLOW-3162] Fix HttpHook URL parse error when port is specified (apache#4001) [AIRFLOW-3055] add get_dataset and get_datasets_list to bigquery_hook (apache#3894) * [AIRFLOW-3055] add get_dataset and get_datasets_list to bigquery_hook [AIRFLOW-3141] Add missing missing sensor tests. (apache#3991) Fixed string encoding error and updated with master. [AIRFLOW-XXX] Fix wrong {{ next_ds }} description (apache#4017) [AIRFLOW-XXX] Fix Typo in SFTPOperator docstring (apache#4016) [AIRFLOW-XXX] Remove residual line in Changelog (apache#3814) [AIRFLOW-2930] Fix celery excecutor scheduler crash (apache#3784) Caused by an update in PR apache#3740. execute_command.apply_async(args=command, ...) -command is a list of short unicode strings and the above code pass multiple arguments to a function defined as taking only one argument. -command = ["airflow", "run", "dag323",...] -args = command = ["airflow", "run", "dag323", ...] -execute_command("airflow","run","dag3s3", ...) will be error and exit. [AIRFLOW-2854] kubernetes_pod_operator add more configuration items (apache#3697) * kubernetes_pod_operator add more configuration items * fix test_kubernetes_pod_operator test_faulty_service_account failure case * fix review comment issues * pod_operator add hostnetwork config * add doc example [AIRFLOW-2994] Fix command status check in Qubole Check operator (apache#3790) [AIRFLOW-2949] Add syntax highlight for single quote strings (apache#3795) * AIRFLOW-2949: Add syntax highlight for single quote strings * AIRFLOW-2949: Also updated new UI main.css [AIRFLOW-2948] Arg check & better doc - SSHOperator & SFTPOperator (apache#3793) There may be different combinations of arguments, and some processings are being done 'silently', while users may not be fully aware of them. For example - User only needs to provide either `ssh_hook` or `ssh_conn_id`, while this is not clear in doc - if both provided, `ssh_conn_id` will be ignored. - if `remote_host` is provided, it will replace the `remote_host` which wasndefined in `ssh_hook` or predefined in the connection of `ssh_conn_id` These should be documented clearly to ensure it's transparent to the users. log.info() should also be used to remind users and provide clear logs. In addition, add instance check for ssh_hook to ensure it is of the correct type (SSHHook). Tests are updated for this PR. [AIRFLOW-XXX] Fix Broken Link in CONTRIBUTING.md [AIRFLOW-2980] ReadTheDocs - Fix Missing API Reference [AIRFLOW-2779] Make GHE auth third party licensed (apache#3803) This reinstates the original license. [AIRFLOW-XXX] Add Format to list of companies (apache#3824) [AIRFLOW-2900] Show code for packaged DAGs (apache#3749) [AIRFLOW-2974] Extended Databricks hook with clusters operation (apache#3817) Add hooks for: - cluster start, - restart, - terminate. Add unit tests for the added hooks. Add hooks for cluster start, restart and terminate. Add unit tests for the added hooks. Add cluster_id variable for performing cluster operation tests. [AIRFLOW-2951] Update dag_run table end_date when state change (apache#3798) The existing airflow only change dag_run table end_date value when a user teminate a dag in web UI. The end_date will not be updated if airflow detected a dag finished and updated its state. This commit add end_date update in DagRun's set_state function to make up tho problem mentioned above. [AIRFLOW-2145] fix deadlock on clearing running TI (apache#3657) a `shutdown` task is not considered be `unfinished`, so a dag run can deadlock when all `unfinished` downstreams are all waiting on a task that's in the `shutdown` state. fix this by considering `shutdown` to be `unfinished`, since it's not truly a terminal state [AIRFLOW-XXX] Fix typo in docstring of gcs_to_bq (apache#3833) [AIRFLOW-2476] Allow tabulate up to 0.8.2 (apache#3835) [AIRFLOW-XXX] Fix typos in faq.rst (apache#3837) [AIRFLOW-2979] Make celery_result_backend conf Backwards compatible (apache#3832) (apache#2806) Renamed `celery_result_backend` to `result_backend` and broke backwards compatibility. [AIRFLOW-2866] Fix missing CSRF token head when using RBAC UI (apache#3804) [AIRFLOW-3007] Update backfill example in Scheduler docs The scheduler docs at https://airflow.apache.org/scheduler.html#backfill-and-catchup use deprecated way of passing `schedule_interval`. `schedule_interval` should be pass to DAG as a separate parameter and not as a default arg. [AIRFLOW-3005] Replace 'Airbnb Airflow' with 'Apache Airflow' (apache#3845) [AIRFLOW-3002] Fix variable & tests in GoogleCloudBucketHelper (apache#3843) [AIRFLOW-2991] Log path to driver output after Dataproc job (apache#3827) [AIRFLOW-XXX] Fix python3 and flake8 errors in dev/airflow-jira This is a script that checks if the Jira's marked as fixed in a release are actually merged in - getting this working is helpful to me in preparing 1.10.1 [AIRFLOW-2883] Add import and export for pool cli using JSON [AIRFLOW-3021] Add Censys to who uses Airflow list > Censys > Find and analyze every reachable server and device on the Internet > https://censys.io/ closes AIRFLOW-3021 https://issues.apache.org/jira/browse/AIRFLOW-3021 Add Branch to Company List [AIRFLOW-3008] Move Kubernetes example DAGs to contrib [AIRFLOW-2997] Support cluster fields in bigquery (apache#3838) This adds a cluster_fields argument to the bigquery hook, GCS to bigquery operator and bigquery query operators. This field requests that bigquery store the result of the query/load operation sorted according to the specified fields (the order of fields given is significant). [AIRFLOW-XXX] Redirect FAQ `airflow[crypto]` to How-to Guides. [AIRFLOW-XXX] Remove redundant space in Kerberos (apache#3866) [AIRFLOW-3028] Update Text & Images in Readme.md [AIRFLOW-1917] Trim extra newline and trailing whitespace from log (apache#3862) [AIRFLOW-2985] Operators for S3 object copying/deleting (apache#3823) 1. Copying: Under the hood, it's `boto3.client.copy_object()`. It can only handle the situation in which the S3 connection used can access both source and destination bucket/key. 2. Deleting: 2.1 Under the hood, it's `boto3.client.delete_objects()`. It supports either deleting one single object or multiple objects. 2.2 If users try to delete a non-existent object, the request will still succeed, but there will be an entry 'Errors' in the response. There may also be other reasons which may cause similar 'Errors' ( request itself would succeed without explicit exception). So an argument `silent_on_errors` is added to let users decide if this sort of 'Errors' should fail the operator. The corresponding methods are added into S3Hook, and these two operators are 'wrappers' of these methods. [AIRFLOW-3030] Fix CLI docs (apache#3872) [AIRFLOW-XXX] Update kubernetes.rst docs (apache#3875) Update kubernetes.rst with correct KubernetesPodOperator inputs for the volumes. [AIRFLOW-XXX] Add Enigma to list of companies [AIRFLOW-2965] CLI tool to show the next execution datetime Cover different cases - schedule_interval is "@once" or None, then following_schedule method would always return None - If dag is paused, print reminder - If latest_execution_date is not found, print warning saying not applicable. [AIRFLOW-XXX] Add Bombora Inc using Airflow [AIRFLOW-XXX] Move Dag level access control out of 1.10 section (apache#3882) It isn't in 1.10 (and wasn't in this section when the PR was created). [AIRFLOW-3012] Fix Bug when passing emails for SLA [AIRFLOW-2797] Create Google Dataproc cluster with custom image (apache#3871) [AIRFLOW-XXX] Updated README to include CAVA [AIRFLOW-3035] Allow custom 'job_error_states' in dataproc ops (apache#3884) Allow caller to pass in custom list of Dataproc job states into the DataProc*Operator classes that should result in the _DataProcJob.raise_error() method raising an Exception. [AIRFLOW-3034]: Readme updates : Add Slack & Twitter, remove Gitter [AIRFLOW-3056] Add happn to Airflow user list [AIRFLOW-3052] Add logo options to Airflow (apache#3892) [AIRFLOW-2524] Add SageMaker Batch Inference (apache#3767) * Fix for comments * Fix sensor test * Update non_terminal_states and failed_states to static variables of SageMakerHook Add SageMaker Transform Operator & Sensor Co-authored-by: srrajeev-aws <[email protected]> [AIRFLOW-XXX] Added Jeitto as one of happy Airflow users! (apache#3902) [AIRFLOW-XXX] Add Jeitto as one happy Airflow user! [AIRFLOW-3044] Dataflow operators accept templated job_name param (apache#3887) * Default value of new job_name param is templated task_id, to match the existing behavior as much as possible. * Change expected value in test_mlengine_operator_utils.py to match default for new job_name param. [AIRFLOW-2707] Validate task_log_reader on upgrade from <=1.9 (apache#3881) We changed the default logging config and config from 1.9 to 1.10, but anyone who upgrades and has an existing airflow.cfg won't know they need to change this value - instead they will get nothing displayed in the UI (ajax request fails) and see "'NoneType' object has no attribute 'read'" in the error log. This validates that config section at start up, and seamlessly upgrades the old previous value. [AIRFLOW-3025] Enable specifying dns and dns_search options for DockerOperator (apache#3860) Enable specifying dns and dns_search options for DockerOperator [AIRFLOW-1298] Clear UPSTREAM_FAILED using the clean cli (apache#3886) * [AIRFLOW-1298] Fix 'clear only_failed' * [AIRFLOW-1298] Fix 'clear only_failed' [AIRFLOW-3059] Log how many rows are read from Postgres (apache#3905) To know how many data is being read from Postgres, it is nice to log this to the Airflow log. Previously when there was no data, it would still create a single file. This is not something that we want, and therefore we've changed this behaviour. Refactored the tests to make use of Postgres itself since we have it running. This makes the tests more realistic, instead of mocking everything. [AIRFLOW-XXX] Fix typo in docs/timezone.rst (apache#3904) [AIRFLOW-3068] Remove deprecated imports [AIRFLOW-3036] Add relevant ECS options to ECS operator. (apache#3908) The ECS operator currently supports only a subset of available options for running ECS tasks. This patch adds all ECS options that could be relevant to airflow; options that wouldn't make sense here, like `count`, were skipped. [AIRFLOW-1195] Add feature to clear tasks in Parent Dag (apache#3907) [AIRFLOW-3073] Add note-Profiling feature not supported in new webserver (apache#3909) Adhoc queries and Charts features are no longer supported in new FAB-based webserver and UI. But this is not mentioned at all in the doc "Data Profiling" (https://airflow.incubator.apache.org/profiling.html) This commit adds a note to remind users for this. [AIRFLOW-XXX] Fix SlackWebhookOperator docs (apache#3915) The docs refer to `conn_id` while the actual argument is `http_conn_id`. [AIRFLOW-1441] Fix inconsistent tutorial code (apache#2466) [AIRFLOW-XXX] Add 90 Seconds to companies [AIRFLOW-3096] Further reduce DaysUntilStale for probo/stale [AIRFLOW-3072] Assign permission get_logs_with_metadata to viewer role (apache#3913) [AIRFLOW-3090] Demote dag start/stop log messages to debug (apache#3920) [AIRFLOW-2407] Use feature detection for reload() (apache#3298) * [AIRFLOW-2407] Use feature detection for reload() [Use feature detection instead of version detection](https://docs.python.org/3/howto/pyporting.html#use-feature-detection-instead-of-version-detection) is a Python porting best practice that avoids a flake8 undefined name error... flake8 testing of https://github.com/apache/incubator-airflow on Python 3.6.3 [AIRFLOW-XXX] Fix a wrong sample bash command, a display issue & a few typos (apache#3924) [AIRFLOW-3090] Make No tasks to consider for execution debug (apache#3923) During normal operation, it is not necessary to see the message. This can only be useful when debugging an issue. AIRFLOW-2952 Fix Kubernetes CI (apache#3922) The current dockerised CI pipeline doesn't run minikube and the Kubernetes integration tests. This starts a Kubernetes cluster using minikube and runs k8s integration tests using docker-compose. [AIRFLOW-2918] Fix Flake8 violations (apache#3931) [AIRFLOW-3076] Remove preloading of MySQL testdata (apache#3911) One of the things for tests is being self contained. This means that it should not depend on anything external, such as loading data. This PR will use the setUp and tearDown to load the data into MySQL and remove it afterwards. This removes the actual bash mysql commands and will make it easier to dockerize the whole testsuite in the future [AIRFLOW-2918] Remove unused imports [AIRFLOW-3090] Specify path of key file in log message (apache#3921) [AIRFLOW-3067] Display www_rbac Flask flash msg properly (apache#3903) The Flask flash messages are not displayed properly. When we don't give a category for a flash message, defautl value will be 'message'. In some cases, we specify 'error' category. Using Flask-AppBuilder, the flash message will be given a CSS class 'alert-[category]'. But We don't have 'alert-message' or 'alert-error' in the current 'bootstrap-theme.css' file. This makes the the flash messages in www_rbac UI come with no background color. This commit addresses this issue by adding 'alert-message' (using specs of existing CSS class 'alert-info') and 'alert-error' (using specs of existing CSS class 'alert-danger') into 'bootstrap-theme.css'. [AIRFLOW-3109] Bugfix to allow user/op roles to clear task intance via UI by default add show statements to hql filtering. [AIRFLOW-3051] Change CLI to make users ops similar to connections The ability to manipulate users from the command line is a bit clunky. Currently 'airflow create_user' and 'airflow delete_user' and 'airflow list_users'. It seems that these ought to be made more like connections, so that it becomes 'airflow users list ...', 'airflow users delete ...' and 'airflow users create ...' [AIRFLOW-3009] Import Hashable from collection.abc to fix Python 3.7 deprecation warning (apache#3849) [AIRFLOW-3111] Fix instructions in UPDATING.md and remove comment (apache#3944) artifacts in default_airflow.cfg - fixed incorrect instructions in UPDATING.md regarding core.log_filename_template and elasticsearch.elasticsearch_log_id_template - removed comments referencing "additional curly braces" from default_airflow.cfg since they're irrelevant to the rendered airflow.cfg [AIRFLOW-3117] Add instructions to allow GPL dependency (apache#3949) The installation instructions failed to mention how to proceed with the GPL dependency. For those who are not concerned by GPL, it is useful to know how to proceed with GPL dependency. [AIRFLOW-XXX] Add Square to the companies lists [AIRFLOW-XXX] Add Fathom Health to readme [AIRFLOW-XXX] Pin Click to 6.7 to Fix CI (apache#3962) [AIRFLOW-XXX] Fix SlackWebhookOperator execute method comment (apache#3963) [AIRFLOW-3100][AIRFLOW-3101] Improve docker compose local testing (apache#3933) [AIRFLOW-3127] Fix out-dated doc for Celery SSL (apache#3967) Now in `airflow.cfg`, for Celery-SSL, the item names are "ssl_active", "ssl_key", "ssl_cert", and "ssl_cacert". (since PR https://github.com/apache/incubator-airflow/pull/2806/files) But in the documentation https://airflow.incubator.apache.org/security.html?highlight=celery or https://github.com/apache/incubator-airflow/blob/master/docs/security.rst, it's "CELERY_SSL_ACTIVE", "CELERY_SSL_KEY", "CELERY_SSL_CERT", and "CELERY_SSL_CACERT", which is out-dated and may confuse readers. [AIRFLOW-XXX] Fix PythonVirtualenvOperator tests (apache#3968) The recent update to the CI image changed the default python from python2 to python3. The PythonVirtualenvOperator tests expected python2 as default and fail due to serialisation errors. [AIRFLOW-2952] Fix Kubernetes CI (apache#3957) - Update outdated cli command to create user - Remove `airflow/example_dags_kubernetes` as the dag already exists in `contrib/example_dags/` - Update the path to copy K8s dags [AIRFLOW-3104] Add .airflowignore info into doc (apache#3939) .airflowignore is a nice feature, but it was not mentioned at all in the documentation. [AIRFLOW-XXX] Add Delete for CLI Example in UPDATING.md [AIRFLOW-3123] Use a stack for DAG context management (apache#3956) [AIRFLOW-3125] Monitor Task Instances creation rates (apache#3966) Montor Task Instances creation rates by Operator type. These stats can provide some visibility on how much workload Airflow is getting. They can be used for resource allocation in the long run (i.e. to determine when we should scale up workers) and debugging in scenarios like the creation rate of certain type of Task Instances spikes. [AIRFLOW-3129] Backfill mysql hook unit tests. (apache#3970) [AIRFLOW-3124] Fix RBAC webserver debug mode (apache#3958) [AIRFLOW-XXX] Add Compass to companies list (apache#3972) We're using Airflow at Compass now. [AIRFLOW-XXX] Speed up DagBagTest cases (apache#3974) I noticed that many of the tests of DagBags operate on a specific DAG only, and don't need to load the example or test dags. By not loading the dags we don't need to this shaves about 10-20s of test time. [AIRFLOW-2912] Add Deploy and Delete operators for GCF (apache#3969) Both Deploy and Delete operators interact with Google Cloud Functions to manage functions. Both are idempotent and make use of GcfHook - hook that encapsulates communication with GCP over GCP API. [AIRFLOW-1390] Update Alembic to 0.9 (apache#3935) [AIRFLOW-2238] Update PR tool to remove outdated info (apache#3978) [AIRFLOW-XXX] Don't spam test logs with "bad cron expression" messages (apache#3973) We needed these test dags to check the behaviour of invalid cron expressions, but by default we were loading them every time we create a DagBag (which many, many tests to). Instead we ignore these known-bad dags by default, and the test checking those (tests/models.py:DagBagTest.test_process_file_cron_validity_check) is already explicitly processing those DAGs directly, so it remains tested. [AIRFLOW-XXX] Fix undocumented params in S3_hook Some function parameters were undocumented. Additional docstrings were added for clarity. [AIRFLOW-3079] Improve migration scripts to support MSSQL Server (apache#3964) There were two problems for MSSQL. First, 'timestamp' data type in MSSQL Server is essentially a row-id, and not a timezone enabled date/time stamp. Second, alembic creates invalid SQL when applying the 0/1 constraint to boolean values. MSSQL should enforce this constraint by simply asserting a boolean value. [AIRFLOW-XXX] Add DoorDash to README.md (apache#3980) DoorDash uses Airflow https://softwareengineeringdaily.com/2018/09/28/doordash/ [AIRFLOW-3062] Add Qubole in integration docs (apache#3946) [AIRFLOW-3129] Improve test coverage of airflow.models. (apache#3982) [AIRFLOW-2574] Cope with '%' in SQLA DSN when running migrations (apache#3787) Alembic uses a ConfigParser like Airflow does, and "%% is a special value in there, so we need to escape it. As per the Alembic docs: > Note that this value is passed to ConfigParser.set, which supports > variable interpolation using pyformat (e.g. `%(some_value)s`). A raw > percent sign not part of an interpolation symbol must therefore be > escaped, e.g. `%%` [AIRFLOW-3137] Make ProxyFix middleware optional. (apache#3983) The ProxyFix middleware should only be used when airflow is running behind a trusted proxy. This patch adds a `USE_PROXY_FIX` flag that defaults to `False`. [AIRFLOW-3004] Add config disabling scheduler cron (apache#3899) [AIRFLOW-3103][AIRFLOW-3147] Update flask-appbuilder (apache#3937) [AIRFLOW-XXX] Fixing the issue in Documentation (apache#3998) Fixing the operator name from DataFlowOperation to DataFlowJavaOperator in Documentation [AIRFLOW-3088] Include slack-compatible emoji image [AIRFLOW-3161] fix TaskInstance log link in RBAC UI [AIRFLOW-3148] Remove unnecessary arg "parameters" in RedshiftToS3Transfer (apache#3995) "Parameters" are used to help render the SQL command. But in this operator, only "schema" and "table" are needed. There is no SQL command to render. By checking the code,we can also find argument "parameters" is never really used. (Fix a minor issue in the docstring as well) [AIRFLOW-3159] Update GCS logging docs for latest code (apache#3952) [AIRFLOW-XXX] Fix airflow.models.DAG docstring mistake Closes apache#4004 from Sambeth/sambeth [AIRFLOW-XXX] Adding Home Depot as users of Apache airflow (apache#4013) * Adding Home Depot as users of Apache airflow [AIRFLOW-XXX] Added ThoughtWorks as user of Airflow in README (apache#4012) [AIRFLOW-XXX] Added DataCamp to list of companies in README (apache#4009) [AIRFLOW-3165] Document interpolation of '%' and warn (apache#4007) [AIRFLOW-3099] Complete list of optional airflow.cfg sections (apache#4002) [AIRFLOW-3162] Fix HttpHook URL parse error when port is specified (apache#4001) [AIRFLOW-3055] add get_dataset and get_datasets_list to bigquery_hook (apache#3894) * [AIRFLOW-3055] add get_dataset and get_datasets_list to bigquery_hook [AIRFLOW-3141] Add missing missing sensor tests. (apache#3991) [AIRFLOW-XXX] Fix wrong {{ next_ds }} description (apache#4017) [AIRFLOW-XXX] Fix Typo in SFTPOperator docstring (apache#4016) Addressed changes from comments made in the PR. [AIRFLOW-3139] include parameters into log.info in SQL operators, if any (apache#3986) For all SQL-operators based on DbApiHook, sql command itself is printed into log.info. But if parameters are used for the sql command, the parameters would not be included in the printing. This makes the log less useful. This commit ensures that the parameters are also printed into the log.info, if any. [AIRFLOW-XXX] Include Danamica in list of companies using Airflow (apache#4019) [AIRFLOW-XXX] Update manage-connections.rst (apache#4020) Explain how to connect with MySQL [AIRFLOW-XXX] Add CarLabs to companies list (apache#4021) [AIRFLOW-3175] Fix docstring format in airflow/jobs.py (apache#4025) These docstrings could not parsed properly in Sphinx syntax [AIRFLOW-3086] Add extras group for google auth to setup.py. (apache#3917) To clarify installation instructions for the google auth backend, add an install group to `setup.py` that installs dependencies google auth via `pip install apache-airflow[google_auth]`. [AIRFLOW-XXX] Include Pagar.me in list of users of Airflow (apache#4026) [AIRFLOW-3173] Add _cmd options for password config options (apache#4024) There were a few more "password" config options added over the last few months that didn't have _cmd options. Any config option that is a password should be able to be provided via a _cmd version. [AIRFLOW-3078] Basic operators for Google Compute Engine (apache#4022) Add GceInstanceStartOperator, GceInstanceStopOperator and GceSetMachineTypeOperator. Each operator includes: - core logic - input params validation - unit tests - presence in the example DAG - docstrings - How-to and Integration documentation Additionally, in GceHook error checking if response is 200 OK was added: Some types of errors are only visible in the response's "error" field and the overall HTTP response is 200 OK. That is why apart from checking if status is "done" we also check if "error" is empty, and if not an exception is raised with error message extracted from the "error" field of the response. In this commit we also separated out Body Field Validator to separate module in tools - this way it can be reused between various GCP operators, it has proven to be usable in at least two of them now. Co-authored-by: sprzedwojski <[email protected]> Co-authored-by: potiuk <[email protected]> [AIRFLOW-3168] More resillient database use in CI (apache#4014) Make sure mysql is available before calling it in CI [AIRFLOW-3177] Change scheduler_heartbeat from gauge to counter (apache#4027) This updates the scheduler_heartbeat metric from a gauge to a counter to better support the statsd_exporter for usage with Prometheus. A counter allows users to track the rate of the heartbeat, and integrates with the exporter better. A crashing or down scheduler will no longer emit the metric, but the statsd_exporter will continue to show a 1 for the metric value. This fixes that issue because a counter will continually change, and the lack of change indicates an issue with the scheduler. Add statsd change notice in UPDATING.md [AIRFLOW-2956] Add kubernetes tolerations (apache#3806) [AIRFLOW-3183] Fix bug in DagFileProcessorManager.max_runs_reached() (apache#4031) The condition is intended to ensure the function will return False if any file's run_count is still smaller than max_run. But the operator used here is "!=". Instead, it should be "<". This is because in DagFileProcessorManager, there is no statement helping limit the upper limit of run_count. It's possible that files' run_count will be bigger than max_run. In such case, max_runs_reached() method may fail its purpose. [AIRFLOW-3099] Don't ever warn about missing sections of config (apache#4028) Rather than looping through and setting each config variable individually, and having to know which sections are optional and which aren't, instead we can just call a single function on ConfigParser and it will read the config from the dict, and more importantly here, never error about missing sections - it will just create them as needed. [AIRFLOW-1837] Respect task start_date when different from dag's (apache#4010) Currently task instances get created and scheduled based on the DAG's start date rather than their own. This commit adds a check before creating a task instance to see that the start date is not after the execution date. [AIRFLOW-3089] Drop hard-coded url scheme in google auth redirect. (apache#3919) The google auth provider hard-codes the `_scheme` in the callback url to `https` so that airflow generates correct urls when run behind a proxy that terminates tls. But this means that google auth can't be used when running without https--for example, during local development. Also, hard-coding `_scheme` isn't the correct solution to the problem of running behind a proxy. Instead, the proxy should be configured to set the `X-Forwarded-Proto` header to `https`; Flask interprets this header and generates the appropriate callback url without hard-coding the scheme. [AIRFLOW-XXX] Add Grab to companies list (apache#4041) [AIRFLOW-3178] Handle percents signs in configs for airflow run (apache#4029) * [AIRFLOW-3178] Don't mask defaults() function from ConfigParser ConfigParser (the base class for AirflowConfigParser) expects defaults() to be a function - so when we re-assign it to be a property some of the methods from ConfigParser no longer work. * [AIRFLOW-3178] Correctly escape percent signs when creating temp config Otherwise we have a problem when we come to use those values. * [AIRFLOW-3178] Use os.chmod instead of shelling out There's no need to run another process for a built in Python function. This also removes a possible race condition that would make temporary config file be readable by more than the airflow or run-as user The exact behaviour would depend on the umask we run under, and the primary group of our user, likely this would mean the file was readably by members of the airflow group (which in most cases would be just the airflow user). To remove any such possibility we chmod the file before we write to it [AIRFLOW-2216] Use profile for AWS hook if S3 config file provided in aws_default connection extra parameters (apache#4011) Use profile for AWS hook if S3 config file provided in aws_default connection extra parameters Add test to validate profile set [AIRFLOW-3001] Add index 'ti_dag_date' to taskinstance (apache#3885) To optimize query performance [AIRFLOW-2794] Add WasbDeleteBlobOperator (apache#3961) Deleting Azure blob is now supported. Either single blobs can be deleted, or one can choose to supply a prefix, in which case one can match multiple blobs to be deleted. [AIRFLOW-3138] Use current data type for migrations (apache#3985) * Use timestamp instead of timestamp with timezone for migration. [AIRFLOW-393] Add callback for FTP downloads (apache#2372) [AIRFLOW-3119] Enable debugging with Celery(apache#3950) This will enable --loglevel when launching a celery worker and inherit that LOGGING_LEVEL setting from airflow.cfg [AIRFLOW-3112] Make SFTP hook to inherit SSH hook (apache#3945) This is to aline the arguments of SFTP hook with SSH hook [AIRFLOW-3195] Log query and task_id in druid-hook (apache#4018) Log query and task_id in druid-hook [AIRFLOW-3187] Update airflow.gif file with a slower version (apache#4033) [AIRFLOW-2789] Create single node DataProc cluster (apache#4015) Create single node cluster - infer from num_workers
* Fix for comments * Fix sensor test * Update non_terminal_states and failed_states to static variables of SageMakerHook Add SageMaker Transform Operator & Sensor Co-authored-by: srrajeev-aws <[email protected]>
* Fix for comments * Fix sensor test * Update non_terminal_states and failed_states to static variables of SageMakerHook Add SageMaker Transform Operator & Sensor Co-authored-by: srrajeev-aws <[email protected]>
* Fix for comments * Fix sensor test * Update non_terminal_states and failed_states to static variables of SageMakerHook Add SageMaker Transform Operator & Sensor Co-authored-by: srrajeev-aws <[email protected]>
* Fix for comments * Fix sensor test * Update non_terminal_states and failed_states to static variables of SageMakerHook Add SageMaker Transform Operator & Sensor Co-authored-by: srrajeev-aws <[email protected]>
author Ash Berlin-Taylor <[email protected]> 1564493832 +0100 committer wayne.morris <[email protected]> 1564516048 -0400 parent 6ef0e37 author Ash Berlin-Taylor <[email protected]> 1564493832 +0100 committer wayne.morris <[email protected]> 1564515968 -0400 parent 6ef0e37 author Ash Berlin-Taylor <[email protected]> 1564493832 +0100 committer wayne.morris <[email protected]> 1564515909 -0400 parent 6ef0e37 author Ash Berlin-Taylor <[email protected]> 1564493832 +0100 committer wayne.morris <[email protected]> 1564515887 -0400 parent 6ef0e37 author Ash Berlin-Taylor <[email protected]> 1564493832 +0100 committer wayne.morris <[email protected]> 1564507924 -0400 parent 6ef0e37 author Ash Berlin-Taylor <[email protected]> 1564493832 +0100 committer wayne.morris <[email protected]> 1564507818 -0400 parent 6ef0e37 author Ash Berlin-Taylor <[email protected]> 1564493832 +0100 committer wayne.morris <[email protected]> 1564507092 -0400 parent 6ef0e37 author Ash Berlin-Taylor <[email protected]> 1564493832 +0100 committer wayne.morris <[email protected]> 1564507071 -0400 parent 6ef0e37 author Ash Berlin-Taylor <[email protected]> 1564493832 +0100 committer wayne.morris <[email protected]> 1564507049 -0400 parent 6ef0e37 author Ash Berlin-Taylor <[email protected]> 1564493832 +0100 committer wayne.morris <[email protected]> 1564506218 -0400 parent 6ef0e37 author Ash Berlin-Taylor <[email protected]> 1564493832 +0100 committer wayne.morris <[email protected]> 1564506121 -0400 parent 6ef0e37 author Ash Berlin-Taylor <[email protected]> 1564493832 +0100 committer wayne.morris <[email protected]> 1564505391 -0400 parent 6ef0e37 author Ash Berlin-Taylor <[email protected]> 1564493832 +0100 committer wayne.morris <[email protected]> 1564504191 -0400 parent 6ef0e37 author Ash Berlin-Taylor <[email protected]> 1564493832 +0100 committer wayne.morris <[email protected]> 1564504099 -0400 [AIRFLOW-5052] Added the include_deleted param to salesforce_hook [AIRFLOW-1840] Support back-compat on old celery config The new names are in-line with Celery 4, but if anyone upgrades Airflow without following the UPDATING.md instructions (which we probably assume most people won't, not until something stops working) their workers would suddenly just start failing. That's bad. This will issue a warning but carry on working as expected. We can remove the deprecation settings (but leave the code in config) after this release has been made. Closes apache#3549 from ashb/AIRFLOW-1840-back-compat (cherry picked from commit a4592f9) Signed-off-by: Bolke de Bruin <[email protected]> [AIRFLOW-2812] Fix error in Updating.md for upgrading to 1.10 Closes apache#3654 from nrhvyc/AIRFLOW-2812 [AIRFLOW-2816] Fix license text in docs/license.rst (cherry picked from commit af15f11) Signed-off-by: Bolke de Bruin <[email protected]> [AIRFLOW-2817] Force explicit choice on GPL dependency (apache#3660) By default one of Apache Airflow's dependencies pulls in a GPL library. Airflow should not install (and upgrade) without an explicit choice. This is part of the Apache requirements as we cannot depend on Category X software. (cherry picked from commit c37fc0b) Signed-off-by: Bolke de Bruin <[email protected]> (cherry picked from commit b39e453) Signed-off-by: Bolke de Bruin <[email protected]> [AIRFLOW-2869] Remove smart quote from default config Closes apache#3716 from wdhorton/remove-smart-quote- from-cfg (cherry picked from commit 67e2bb9) Signed-off-by: Bolke de Bruin <[email protected]> (cherry picked from commit 700f5f0) Signed-off-by: Bolke de Bruin <[email protected]> [AIRFLOW-2140] Don't require kubernetes for the SparkSubmit hook (apache#3700) This extra dep is a quasi-breaking change when upgrading - previously there were no deps outside of Airflow itself for this hook. Importing the k8s libs breaks installs that aren't also using Kubernetes. This makes the dep optional for anyone who doesn't explicitly use the functionality (cherry picked from commit 0be002e) Signed-off-by: Bolke de Bruin <[email protected]> (cherry picked from commit f58246d) Signed-off-by: Bolke de Bruin <[email protected]> [AIRFLOW-2859] Implement own UtcDateTime (apache#3708) The different UtcDateTime implementations all have issues. Either they replace tzinfo directly without converting or they do not convert to UTC at all. We also ensure all mysql connections are in UTC in order to keep sanity, as mysql will ignore the timezone of a field when inserting/updating. (cherry picked from commit 6fd4e60) Signed-off-by: Bolke de Bruin <[email protected]> (cherry picked from commit 8fc8c7a) Signed-off-by: Bolke de Bruin <[email protected]> [AIRFLOW-2895] Prevent scheduler from spamming heartbeats/logs Reverts most of AIRFLOW-2027 until the issues with it can be fixed. Closes apache#3747 from aoen/revert_min_file_parsing_time_commit [AIRFLOW-2979] Make celery_result_backend conf Backwards compatible (apache#3832) (apache#2806) Renamed `celery_result_backend` to `result_backend` and broke backwards compatibility. [AIRFLOW-2524] Add Amazon SageMaker Training (apache#3658) Add SageMaker Hook, Training Operator & Sensor Co-authored-by: srrajeev-aws <[email protected]> [AIRFLOW-2524] Add Amazon SageMaker Tuning (apache#3751) Add SageMaker tuning Operator and sensor Co-authored-by: srrajeev-aws <[email protected]> [AIRFLOW-2524] Add SageMaker Batch Inference (apache#3767) * Fix for comments * Fix sensor test * Update non_terminal_states and failed_states to static variables of SageMakerHook Add SageMaker Transform Operator & Sensor Co-authored-by: srrajeev-aws <[email protected]> [AIRFLOW-2763] Add check to validate worker connectivity to metadata Database [AIRFLOW-2786] Gracefully handle Variable import errors (apache#3648) Variables that are added through a file are not checked as explicity as creating a Variable in the web UI. This handles exceptions that could be caused by improper keys or values. [AIRFLOW-2860] DruidHook: time check is wrong (apache#3745) [AIRFLOW-2773] Validates Dataflow Job Name Closes apache#3623 from kaxil/AIRFLOW-2773 [AIRFLOW-2845] Asserts in contrib package code are changed on raise ValueError and TypeError (apache#3690) [AIRFLOW-1917] Trim extra newline and trailing whitespace from log (apache#3862) [AIRFLOW-XXX] Fix SlackWebhookOperator docs (apache#3915) The docs refer to `conn_id` while the actual argument is `http_conn_id`. [AIRFLOW-2912] Add Deploy and Delete operators for GCF (apache#3969) Both Deploy and Delete operators interact with Google Cloud Functions to manage functions. Both are idempotent and make use of GcfHook - hook that encapsulates communication with GCP over GCP API. [AIRFLOW-3078] Basic operators for Google Compute Engine (apache#4022) Add GceInstanceStartOperator, GceInstanceStopOperator and GceSetMachineTypeOperator. Each operator includes: - core logic - input params validation - unit tests - presence in the example DAG - docstrings - How-to and Integration documentation Additionally, in GceHook error checking if response is 200 OK was added: Some types of errors are only visible in the response's "error" field and the overall HTTP response is 200 OK. That is why apart from checking if status is "done" we also check if "error" is empty, and if not an exception is raised with error message extracted from the "error" field of the response. In this commit we also separated out Body Field Validator to separate module in tools - this way it can be reused between various GCP operators, it has proven to be usable in at least two of them now. Co-authored-by: sprzedwojski <[email protected]> Co-authored-by: potiuk <[email protected]> [AIRFLOW-3183] Fix bug in DagFileProcessorManager.max_runs_reached() (apache#4031) The condition is intended to ensure the function will return False if any file's run_count is still smaller than max_run. But the operator used here is "!=". Instead, it should be "<". This is because in DagFileProcessorManager, there is no statement helping limit the upper limit of run_count. It's possible that files' run_count will be bigger than max_run. In such case, max_runs_reached() method may fail its purpose. [AIRFLOW-3099] Don't ever warn about missing sections of config (apache#4028) Rather than looping through and setting each config variable individually, and having to know which sections are optional and which aren't, instead we can just call a single function on ConfigParser and it will read the config from the dict, and more importantly here, never error about missing sections - it will just create them as needed. [AIRFLOW-3089] Drop hard-coded url scheme in google auth redirect. (apache#3919) The google auth provider hard-codes the `_scheme` in the callback url to `https` so that airflow generates correct urls when run behind a proxy that terminates tls. But this means that google auth can't be used when running without https--for example, during local development. Also, hard-coding `_scheme` isn't the correct solution to the problem of running behind a proxy. Instead, the proxy should be configured to set the `X-Forwarded-Proto` header to `https`; Flask interprets this header and generates the appropriate callback url without hard-coding the scheme. [AIRFLOW-3178] Handle percents signs in configs for airflow run (apache#4029) * [AIRFLOW-3178] Don't mask defaults() function from ConfigParser ConfigParser (the base class for AirflowConfigParser) expects defaults() to be a function - so when we re-assign it to be a property some of the methods from ConfigParser no longer work. * [AIRFLOW-3178] Correctly escape percent signs when creating temp config Otherwise we have a problem when we come to use those values. * [AIRFLOW-3178] Use os.chmod instead of shelling out There's no need to run another process for a built in Python function. This also removes a possible race condition that would make temporary config file be readable by more than the airflow or run-as user The exact behaviour would depend on the umask we run under, and the primary group of our user, likely this would mean the file was readably by members of the airflow group (which in most cases would be just the airflow user). To remove any such possibility we chmod the file before we write to it [AIRFLOW-2216] Use profile for AWS hook if S3 config file provided in aws_default connection extra parameters (apache#4011) Use profile for AWS hook if S3 config file provided in aws_default connection extra parameters Add test to validate profile set [AIRFLOW-3138] Use current data type for migrations (apache#3985) * Use timestamp instead of timestamp with timezone for migration. [AIRFLOW-3119] Enable debugging with Celery(apache#3950) This will enable --loglevel when launching a celery worker and inherit that LOGGING_LEVEL setting from airflow.cfg [AIRFLOW-3197] EMRHook is missing new parameters of the AWS API (apache#4044) Allow passing any params to the CreateJobFlow API, so that we don't have to stay up to date with AWS api changes. [AIRFLOW-3203] Fix DockerOperator & some operator test (apache#4049) - For argument `image`, no need to explicitly add "latest" if tag is omitted. "latest" will be used by default if no tag provided. This is handled by `docker` package itself. - Intermediate variable `cpu_shares` is not needed. - Fix wrong usage of `cpu_shares` and `cpu_shares`. Based on https://docker-py.readthedocs.io/en/stable/api.html#docker.api.container.ContainerApiMixin.create_host_config, They should be an arguments of self.cli.create_host_config() rather than APIClient.create_container(). - Change name of the corresponding test script, to ensure it can be discovered. - Fix the test itself. - Some other test scripts are not named properly, which result in failure of test discovery. [AIRFLOW-3232] More readable GCF operator documentation (apache#4067) [AIRFLOW-3231] Basic operators for Google Cloud SQL (apache#4097) Add CloudSqlInstanceInsertOperator, CloudSqlInstancePatchOperator and CloudSqlInstanceDeleteOperator. Each operator includes: - core logic - input params validation - unit tests - presence in the example DAG - docstrings - How-to and Integration documentation Additionally, small improvements to GcpBodyFieldValidator were made: - add simple list validation capability (type="list") - introduced parameter allow_empty, which can be set to False to test for non-emptiness of a string instead of specifying a regexp. Co-authored-by: sprzedwojski <[email protected]> Co-authored-by: potiuk <[email protected]> [AIRFLOW-2524] Update SageMaker hook and operators (apache#4091) This re-works the SageMaker functionality in Airflow to be more complete, and more useful for the kinds of operations that SageMaker supports. We removed some files and operators here, but these were only added after the last release so we don't need to worry about any sort of back-compat. [AIRFLOW-3276] Cloud SQL: database create / patch / delete operators (apache#4124) [AIRFLOW-2192] Allow non-latin1 usernames with MySQL backend by adding a SQL_ENGINE_ENCODING param and default to UTF-8 (apache#4087) Compromised of: Since we have unicode_literals importred and the engine arguments must be strings in Python2 explicitly make 'utf-8' a string. replace bare exception with conf.AirflowConfigException for missing value. It's just got for strings apparently. Add utf-8 to default_airflow.cfg - question do I still need the try try/except block or can we depend on defaults (I note some have both). Get rid of try/except block and depend on default_airflow.cfg Use __str__ since calling str just gives us back a newstr as well. Test that a panda user can be saved. [AIRFLOW-3295] Fix potential security issue in DaskExecutor (apache#4128) When user decides to use TLS/SSL encryption for DaskExecutor communications, `Distributed.Security` object will be created. However, argument `require_encryption` is missed to be set to `True` (its default value is `False`). This may fail the TLS/SSL encryption setting-up. [AIRFLOW-XXX] Fix flake8 errors from apache#4144 [AIRFLOW-2574] Cope with '%' in SQLA DSN when running migrations (apache#3787) Alembic uses a ConfigParser like Airflow does, and "%% is a special value in there, so we need to escape it. As per the Alembic docs: > Note that this value is passed to ConfigParser.set, which supports > variable interpolation using pyformat (e.g. `%(some_value)s`). A raw > percent sign not part of an interpolation symbol must therefore be > escaped, e.g. `%%` [AIRFLOW-3090] Demote dag start/stop log messages to debug (apache#3920) [AIRFLOW-3090] Specify path of key file in log message (apache#3921) [AIRFLOW-3111] Fix instructions in UPDATING.md and remove comment (apache#3944) artifacts in default_airflow.cfg - fixed incorrect instructions in UPDATING.md regarding core.log_filename_template and elasticsearch.elasticsearch_log_id_template - removed comments referencing "additional curly braces" from default_airflow.cfg since they're irrelevant to the rendered airflow.cfg [AIRFLOW-3127] Fix out-dated doc for Celery SSL (apache#3967) Now in `airflow.cfg`, for Celery-SSL, the item names are "ssl_active", "ssl_key", "ssl_cert", and "ssl_cacert". (since PR https://github.com/apache/incubator-airflow/pull/2806/files) But in the documentation https://airflow.incubator.apache.org/security.html?highlight=celery or https://github.com/apache/incubator-airflow/blob/master/docs/security.rst, it's "CELERY_SSL_ACTIVE", "CELERY_SSL_KEY", "CELERY_SSL_CERT", and "CELERY_SSL_CACERT", which is out-dated and may confuse readers. [AIRFLOW-3187] Update airflow.gif file with a slower version (apache#4033) [AIRFLOW-3164] Verify server certificate when connecting to LDAP (apache#4006) Misconfiguration and improper checking of exceptions disabled server certificate checking. We now only support TLS connections and do not support insecure connections anymore. [AIRFLOW-2779] Add license headers to doc files (apache#4178) This adds ASF license headers to all the .rst and .md files with the exception of the Pull Request template (as that is included verbatim when opening a Pull Request on Github which would be messy) Added the include_deleted parameter to salesforce hook
Make sure you have checked all steps below.
JIRA
Description
Tests
Commits
Documentation
Code Quality
git diff upstream/master -u -- "*.py" | flake8 --diff