Remove non-public interface usage in EcsRunTaskOperator #29447

Taragolis · 2023-02-09T19:31:27Z

Right now EcsRunTaskOperator when required "reattach" use hacks:

save info to XCom which not valid because it not use a public interface
In additional it can't work with Dynamic Task Mappings (all xcom name contain only task_id info and miss other part of unique TI key)

By this PR current mechanism replaced by builtin ECS.Client.run_task ability to setup startedBy and filter it later.

startedBy limited by 36 characters an it is not possible to use string representation of dag_id + task_id + run_id + map_id, instead of this generate UUID based on this value which can be used as unique (per TI) value.

Unfortunetly right now EcsRunTaskOperator set startedBy by owner of task, so it become mutuality exclusive:

If reattach set to True than startedBy set as unique TI (UUID)
If reattach set to False (default) than startedBy set as task.owner

airflow/providers/amazon/aws/utils/identifiers.py

tests/providers/amazon/aws/operators/test_ecs.py

tests/providers/amazon/aws/utils/test_identifiers.py

Taragolis · 2023-02-09T23:39:24Z

Not sure is it changes, breeze itself or something else but I have interesting behaviour with logs during execution this operator

[2023-02-09, 23:33:53 UTC] {ecs.py:489} INFO - ECS task ID is: 6beb6333b8dc4cf7afbd36b524efd026*** Could not read served logs: Parent instance <TaskInstance at 0xffff5f2c2590> is not bound to a Session; lazy load operation of attribute 'trigger' cannot proceed (Background on this error at: https://sqlalche.me/e/14/bhk3)
*** Found local files:
***   * /root/airflow/logs/dag_id=test_ecs_operator_reatach/run_id=manual__2023-02-09T23:33:49.777801+00:00/task_id=ecs-task-regular-no-reattach/attempt=1.log
*** Could not read served logs: Parent instance <TaskInstance at 0xffff5f3e0050> is not bound to a Session; lazy load operation of attribute 'trigger' cannot proceed (Background on this error at: https://sqlalche.me/e/14/bhk3)
*** Found local files:
***   * /root/airflow/logs/dag_id=test_ecs_operator_reatach/run_id=manual__2023-02-09T23:33:49.777801+00:00/task_id=ecs-task-regular-no-reattach/attempt=1.log
*** Could not read served logs: Parent instance <TaskInstance at 0xffff5f480810> is not bound to a Session; lazy load operation of attribute 'trigger' cannot proceed (Background on this error at: https://sqlalche.me/e/14/bhk3)
*** Found local files:
***   * /root/airflow/logs/dag_id=test_ecs_operator_reatach/run_id=manual__2023-02-09T23:33:49.777801+00:00/task_id=ecs-task-regular-no-reattach/attempt=1.log
*** Could not read served logs: Parent instance <TaskInstance at 0xffff5f259310> is not bound to a Session; lazy load operation of attribute 'trigger' cannot proceed (Background on this error at: https://sqlalche.me/e/14/bhk3)
*** Found local files:

Update: Fixed by #29472

Probably we should just chop this view in favor of grid view logging which is the future. But this fixes rendering issues raised here apache#29447 (comment). What we do, is in log tailing context (which apparently isn't used in grid, and that's why I did not see this in developing trigger logging) we don't add the messages to the log content. So, whenever log_pos is in metadata we don't add messages. It means the messages could be a bit stale but that seems ok. Refreshing the page could fix that. Longer term, we could update the API so that log content is just content and the messages are themselves returned in the metadata dict. That's probably the "right" solution ultimately. But can be saved for another day. Also resolve the "cannot load lazy instance" issue when invoking the reader logic from this different context.

Probably we should just chop this view in favor of grid view logging which is the future. But this fixes rendering issues raised here #29447 (comment). What we do, is in log tailing context (which apparently isn't used in grid, and that's why I did not see this in developing trigger logging) we don't add the messages to the log content. So, whenever log_pos is in metadata we don't add messages. It means the messages could be a bit stale but that seems ok. Refreshing the page could fix that. Longer term, we could update the API so that log content is just content and the messages are themselves returned in the metadata dict. That's probably the "right" solution ultimately. But can be saved for another day. Also resolve the "cannot load lazy instance" issue when invoking the reader logic from this different context.

shubham22 · 2023-03-21T19:36:42Z

@o-nikolas - looks like this one needs review, otherwise it is ready to be merged cc: @Taragolis

ferruzzi

LGTM. I did raise a comment previously, but it was just a convention nitpick, don't block on my account. 👍

o-nikolas · 2023-03-21T21:05:41Z

airflow/providers/amazon/aws/operators/ecs.py

+            ti: TaskInstance = context["ti"]
+            self.started_by = generate_uuid(*map(str, ti.key.primary))
+            if self.do_xcom_push:
+                ti.xcom_push("started_by", self.started_by)


I'm confused why you're pushing this value to xcom. I don't see where it's read back? In _try_reattach_task it is gotten from self.started_by not xcom.

All XComs values associated to the task cleared before it started. For current implementation (in main) EcsRunTaskOperator use non-documented feature (non-public interface usage) by save specific value to XCom for not existed task, and which do not cleared when you reset task run. And potentially it also could damage user Airflow Database (e.g. constraints violations)

This value only for the reference for example if user want to find this task in ECS (e.g. in AWS Console / AWS SDK) by provide startedBy.

If the goal is to just allow the user to reference the statedby uuid, then I think logging it and/or using an Operator extralink is much more user friendly that putting it in XCOM.

I don't think there is possible to create extra link, I'm not sure that is possible to create GET request for this one.

Okay, fair enough. If you can just log it then that'd be helpful for users (in addition to the XCOM, no need to remove that), otherwise lgtm 👍

I decide to remove ability to save started_by into the xcom, started_by is deterministic and only required for reattach, I'm not sure there are any benefits to keep it in XCom

airflow/providers/amazon/aws/operators/ecs.py

o-nikolas · 2023-03-21T21:08:16Z

airflow/providers/amazon/aws/operators/ecs.py

        list_tasks_resp = self.client.list_tasks(
-            cluster=self.cluster, desiredStatus="RUNNING", family=ecs_task_family
+            cluster=self.cluster, desiredStatus="RUNNING", startedBy=self.started_by


Don't you want to see if there is a value for started_by in xcom here?

We can't get XCom from previous task run, because it removed by Airflow before spawn new task run.
In additional I've just make started_by as private attribute

airflow/providers/amazon/aws/operators/ecs.py

github-actions · 2023-05-09T00:11:49Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

scottypate · 2023-06-06T19:03:40Z

Hey @Taragolis Curious if you are going to merge this change or if it is legit stale? We are having the same issue with reattachement.

Probably we should just chop this view in favor of grid view logging which is the future. But this fixes rendering issues raised here apache/airflow#29447 (comment). What we do, is in log tailing context (which apparently isn't used in grid, and that's why I did not see this in developing trigger logging) we don't add the messages to the log content. So, whenever log_pos is in metadata we don't add messages. It means the messages could be a bit stale but that seems ok. Refreshing the page could fix that. Longer term, we could update the API so that log content is just content and the messages are themselves returned in the metadata dict. That's probably the "right" solution ultimately. But can be saved for another day. Also resolve the "cannot load lazy instance" issue when invoking the reader logic from this different context. GitOrigin-RevId: 7cf5cea76e3ff8b790d7185632b6dd7b0196f0e3

SBurwash · 2023-08-20T18:58:56Z

+1 I'm experiencing this issue as well

Taragolis · 2023-08-20T19:23:57Z

Thanks for reminder, I've just tried to find this PR for continue work on in (simply, starting with rebase)

Co-authored-by: D. Ferruzzi <[email protected]>

Taragolis · 2023-08-20T20:00:03Z

@ferruzzi @o-nikolas @vincbeck when you have a time could you have a look?

need review against current codebase

SBurwash · 2023-08-24T12:04:30Z

Awesome feature, thank you so much for shipping it out!

How would we be able to access these changes? Will it be available directly with the ECS run task operator, or will it be part of a larger release of airflow?

potiuk · 2023-08-24T12:53:18Z

Look at the docs. we release providers separately from core and if you want you can upgrade them independetntly. https://airflow.apache.org/docs/apache-airflow/stable/installation/installing-from-pypi.html#installation-and-upgrade-scenarios

potiuk · 2023-08-24T12:54:59Z

We are gearing up to both releasing provider wave (release candidates likely today) and new airlfow 2.7.1 (next weeks) - once providers are out latest airflow release by default points to the latest released providers (but you can freely downgrade/upgrade the providers independently as you wish)

vandonr-amz · 2023-08-24T17:43:53Z

airflow/providers/amazon/aws/operators/ecs.py

-        if self.reattach:
-            # Save the task ARN in XCom to be able to reattach it if needed
-            self.xcom_push(context, key=self.REATTACH_XCOM_KEY, value=self.arn)


this is a somewhat-breaking change, as the example code

airflow/tests/system/providers/amazon/aws/example_ecs_fargate.py

Lines 125 to 126 in e0f21f4

# You must set `reattach=True` in order to get ecs_task_arn if you plan to use a Sensor.

reattach=True,

was recommending setting reattach to true to get the ARN.

I think this sucked, but removing the arn entirely from the xcom values is not good either.
What we could do is set it all the time now that we don't rely on this anymore to know if we need to reattach.

I guess this part was never work as it expected

well, pushing the ARN to xcom at least was working.
I opened a PR to restore that specific thing.

Probably we should just chop this view in favor of grid view logging which is the future. But this fixes rendering issues raised here apache/airflow#29447 (comment). What we do, is in log tailing context (which apparently isn't used in grid, and that's why I did not see this in developing trigger logging) we don't add the messages to the log content. So, whenever log_pos is in metadata we don't add messages. It means the messages could be a bit stale but that seems ok. Refreshing the page could fix that. Longer term, we could update the API so that log content is just content and the messages are themselves returned in the metadata dict. That's probably the "right" solution ultimately. But can be saved for another day. Also resolve the "cannot load lazy instance" issue when invoking the reader logic from this different context. GitOrigin-RevId: 7cf5cea76e3ff8b790d7185632b6dd7b0196f0e3

Taragolis requested a review from ferruzzi February 9, 2023 19:31

boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Feb 9, 2023