Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] AbstractRetryableWorkflowStep doesn't handle edge case where initial task is already complete. #493

Closed
dbwiddis opened this issue Feb 5, 2024 · 0 comments · Fixed by #494
Assignees
Labels
bug Something isn't working

Comments

@dbwiddis
Copy link
Member

dbwiddis commented Feb 5, 2024

What is the bug?

When deploying a model, if it happens quickly and the initial response is COMPLETED, the abstract retryable step never executes the retry task.

The future is never completed (although the model deploys successfully) and eventually results in a timeout message for that step even though the model successfully deployed.

How can one reproduce the bug?

Deploy a model on a cluster where it happens fast enough that the initial response is COMPLETED.

Probably the source of this test failure:
#465 (comment)

What is the expected behavior?

Successful model deployment should check the deploy task at least once.

What is your host/environment?

FGAC-enabled domain.

Do you have any screenshots?

[2024-02-04T23:47:44,028][INFO ][o.o.f.w.ProcessNode      ] [fb9bced73eac9fc85c53cd87ca0e8665] Starting deploy_model_3.
[2024-02-04T23:47:44,029][INFO ][o.o.m.a.d.TransportDeployModelAction] [fb9bced73eac9fc85c53cd87ca0e8665] Will deploy model on these nodes: xgPngcVmTBailSli5PbIww
[2024-02-04T23:47:44,052][INFO ][o.o.f.w.DeployModelStep  ] [fb9bced73eac9fc85c53cd87ca0e8665] Model deployment state COMPLETED

Do you have any additional context?

The while (!future.isDone()) { } loop should be a do { } while().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant