Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ArrayNode subNode timeouts #6054

Merged

Conversation

hamersaw
Copy link
Contributor

@hamersaw hamersaw commented Nov 27, 2024

Tracking issue

N/A

Why are the changes needed?

Currently, timeouts are not supported on ArryaNode subNodes. If there is a timeout set in the task decorator, that is applied only to the overall ArrayNode and not individual subNodes. This change ensures subNodes are executed with timeouts as a standalone execution would be.

What changes were proposed in this pull request?

We store a deltaTimestamp to track startedAt for subNode executions. This delta is computed from the overall ArrayNode start time. This is stored using a bitarray (just like other ArrayNode subNode metadata), with a preset 18 bits of precision. This means for 5k subNodes there will be approximately 90kb added to the FlyteWorkflow CR - reducing scalability by a small amount. The 18 bits gives us around 3 days of delta time, meaning subNodes start times can be tracked up to 3 days after the overall ArrayNode starts. If this proves to be too imprecise, we can increase this value and incur further reductions to scale.

It is important to note that we do not have a task phase for TIMED_OUT. So in the UI this will display as FAILED. This is a restriction of the current eventing scheme for map tasks.

How was this patch tested?

This was tested locally under a variety of failure scenarios (ex. retryable failures).

Setup process

from datetime import timedelta
from flytekit import map_task, task, workflow
from typing import List

import time
from flytekit.exceptions.user import FlyteRecoverableException


@task(retries=3, timeout=timedelta(minutes=1))
def say_hello(name: str) -> str:
    #time.sleep(120)

    time.sleep(10)
    raise FlyteRecoverableException(f"testing failure for {name}")

    return f"hello {name}"


@workflow
def say_hello_wf(name: List[str]) -> List[str]:
    return map_task(say_hello)(name=name)

Screenshots

N/A

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

N/A

Docs link

N/A

Copy link

codecov bot commented Nov 27, 2024

Codecov Report

Attention: Patch coverage is 27.53623% with 50 lines in your changes missing coverage. Please review.

Project coverage is 37.09%. Comparing base (ab04192) to head (36601df).
Report is 10 commits behind head on master.

Files with missing lines Patch % Lines
...eworkflow/v1alpha1/mocks/MutableArrayNodeStatus.go 0.00% 20 Missing ⚠️
...rkflow/v1alpha1/mocks/ExecutableArrayNodeStatus.go 0.00% 18 Missing ⚠️
...ler/pkg/apis/flyteworkflow/v1alpha1/node_status.go 0.00% 7 Missing ⚠️
...opeller/pkg/controller/nodes/node_state_manager.go 0.00% 4 Missing ⚠️
...lytepropeller/pkg/controller/nodes/transformers.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           master    #6054    +/-   ##
========================================
  Coverage   37.08%   37.09%            
========================================
  Files        1318     1318            
  Lines      132284   132399   +115     
========================================
+ Hits        49062    49117    +55     
- Misses      78950    79010    +60     
  Partials     4272     4272            
Flag Coverage Δ
unittests-datacatalog 51.58% <ø> (ø)
unittests-flyteadmin 54.10% <ø> (ø)
unittests-flytecopilot 30.99% <ø> (ø)
unittests-flytectl 62.29% <ø> (-0.05%) ⬇️
unittests-flyteidl 7.23% <ø> (-0.01%) ⬇️
unittests-flyteplugins 53.82% <ø> (+0.09%) ⬆️
unittests-flytepropeller 42.59% <27.53%> (-0.04%) ⬇️
unittests-flytestdlib 57.57% <ø> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
@hamersaw hamersaw marked this pull request as ready for review December 3, 2024 17:25
@hamersaw hamersaw requested a review from pvditt December 3, 2024 17:25
pvditt
pvditt previously approved these changes Dec 4, 2024
Copy link
Contributor

@pvditt pvditt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes look great. This reminded me of this issue https://linear.app/unionai/issue/COR-2212/[bug]-setting-task-metadata-for-the-parent-array-node-causes-incorrect where the timeout gets applied to the parent node. I'll get a small PR up to fix this on the flytekit end.

@hamersaw
Copy link
Contributor Author

hamersaw commented Dec 4, 2024

changes look great. This reminded me of this issue https://linear.app/unionai/issue/COR-2212/[bug]-setting-task-metadata-for-the-parent-array-node-causes-incorrect where the timeout gets applied to the parent node. I'll get a small PR up to fix this on the flytekit end.

This example workflow in ^^ works here. I think the timeout gets applied to the parent and subNodes. I know I looked through flytekit and convinced myself of it before implementing this, but can't seem to find the line now.

Copy link
Contributor

@pvditt pvditt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confirmed this works as expected w/ flyteorg/flytekit#2979

@hamersaw hamersaw merged commit 76c7f76 into master Dec 5, 2024
51 of 52 checks passed
@hamersaw hamersaw deleted the daniel/cor-2213-support-arraynode-subnode-timeouts-in-oss branch December 5, 2024 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants