-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for PipelineRuns getting stuck in the running state in the cluster #6095
Fix for PipelineRuns getting stuck in the running state in the cluster #6095
Conversation
|
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
/kind bug |
9360336
to
7feb9d2
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, I think it looks good and the tests are really nice.
I have a couple questions / comments on the logic.
return err | ||
} | ||
|
||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than returning nil
I think we could return a "PermanentError" instead, which means that the key won't be re-queued to the controller but we'll still track this in an error situation and we can provide a context in the error about what is going on - i.e. that the pipeline run condition is still not updated after 2*timeout so we apply a different logic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I'll change it, but should we consider returning a controller.NewRequeueImmediately()
? Since we are testing for !isPipelineRunTimeoutConditionSet(pr)
, it would not be in an infinite loop.
Thanks, @RafaeLeal for this PR! Could you provide a bit more context about the issue in the commit message / PR description, and fill in the PR checklist from the template? Also, tide does not automatically squash commits before merging, so I would ask you to please squash them at least once the PR has been approved. Thank you! |
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
45f2c27
to
b86e0b0
Compare
The following is the coverage report on the affected files.
|
Sure, I was not done yet actually. I added another test on HasTimedOutForALongTime func. I had some ideas on how to avoid arbitrary thresholds, that I shared in the issue, which I quote here.
|
The following is the coverage report on the affected files.
|
3e69fb8
to
4215107
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
/retest |
|
||
// this limit is just enough to set the timeout condition, but not enough for extra metadata. | ||
etcdRequestSizeLimit := 650 | ||
prt.TestAssets.Clients.Pipeline.PrependReactor("update", "pipelineruns", withEtcdRequestSizeLimit(t, etcdRequestSizeLimit)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
4cf4627
to
0f3b776
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RafaeLeal do you think there is any work remaining to address #6076 once this is merged? (If not I'll link the issue directly to the PR)
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lbernick The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I don't think so, you can link the issue. |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this fix @RafaeLeal, left some suggestions to make the code cleaner
0f3b776
to
d38d349
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
d38d349
to
34e8778
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Thanks @RafaeLeal!
Changes
Closes: #6076
This PR tries to fix scenarios where in the same reconciliation cycle the PipelineRun status gets too big to be updated and the status is changed to timeout. This means that the UpdateStatus fails, so the PipelineRun can't get out from the running status, even after the timeout.
The PR introduces an arbitrary threshold (2*timeout), and if the reconciliation notice that threshold is reached, it skips any other update, and updates only the status, making the request as small as possible, avoiding etcd request size errors.
Submitter Checklist
As the author of this PR, please check off the items in this checklist:
functionality, content, code)
/kind <type>
. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tepRelease Notes