-
Notifications
You must be signed in to change notification settings - Fork 665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support ArrayNode subNode timeouts #6054
Support ArrayNode subNode timeouts #6054
Conversation
Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6054 +/- ##
========================================
Coverage 37.08% 37.09%
========================================
Files 1318 1318
Lines 132284 132399 +115
========================================
+ Hits 49062 49117 +55
- Misses 78950 79010 +60
Partials 4272 4272
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes look great. This reminded me of this issue https://linear.app/unionai/issue/COR-2212/[bug]-setting-task-metadata-for-the-parent-array-node-causes-incorrect where the timeout gets applied to the parent node. I'll get a small PR up to fix this on the flytekit end.
This example workflow in ^^ works here. I think the timeout gets applied to the parent and subNodes. I know I looked through flytekit and convinced myself of it before implementing this, but can't seem to find the line now. |
Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
confirmed this works as expected w/ flyteorg/flytekit#2979
Tracking issue
N/A
Why are the changes needed?
Currently, timeouts are not supported on ArryaNode subNodes. If there is a timeout set in the task decorator, that is applied only to the overall ArrayNode and not individual subNodes. This change ensures subNodes are executed with timeouts as a standalone execution would be.
What changes were proposed in this pull request?
We store a
deltaTimestamp
to trackstartedAt
for subNode executions. This delta is computed from the overall ArrayNode start time. This is stored using abitarray
(just like other ArrayNode subNode metadata), with a preset 18 bits of precision. This means for 5k subNodes there will be approximately 90kb added to the FlyteWorkflow CR - reducing scalability by a small amount. The 18 bits gives us around 3 days of delta time, meaning subNodes start times can be tracked up to 3 days after the overall ArrayNode starts. If this proves to be too imprecise, we can increase this value and incur further reductions to scale.It is important to note that we do not have a task phase for
TIMED_OUT
. So in the UI this will display asFAILED
. This is a restriction of the current eventing scheme for map tasks.How was this patch tested?
This was tested locally under a variety of failure scenarios (ex. retryable failures).
Setup process
Screenshots
N/A
Check all the applicable boxes
Related PRs
N/A
Docs link
N/A