-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Conversation
In campaigns we need the ability to retry jobs multiple times. (See https://github.com/sourcegraph/sourcegraph/issues/12700#issuecomment-671798531 for additional context.) This is what I think is the easiest-to-understand and simplest solution. I did have another solution that involved a PreDequeue hook (that returned the custom conditions you see here now) and boolean in the StoreOptions to switch between AND'ing or OR'ing the custom conditions to the selectCandidateQuery. This felt a bit hacky. It was less code, but also easier to miss and misudnerstand. What do you think of this?
Codecov Report
@@ Coverage Diff @@
## mrn/worker-retry-after #13478 +/- ##
==========================================================
- Coverage 51.51% 51.50% -0.01%
==========================================================
Files 1496 1496
Lines 83315 83316 +1
Branches 6798 6798
==========================================================
- Hits 42918 42911 -7
- Misses 36805 36811 +6
- Partials 3592 3594 +2
*This pull request uses carry forward flags. Click here to find out more.
|
OrderByExpression: sqlf.Sprintf("reconciler_state = 'errored', changesets.updated_at DESC"), | ||
|
||
StalledMaxAge: 60 * time.Second, | ||
MaxNumResets: 60, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will a specific error be logged that tells that retry won't happen anymore? If not, should we Wrap
the last error in a crashloop error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean logged to the failure_message
column, I assume? No. But I think that's possible and I'll take a look at it in when I fix the updated_at
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks!
This is based on and requires #13457.
In combination with #13457 this PR fixes the first and most important part of https://github.com/sourcegraph/sourcegraph/issues/12700 by adding automatic retrying of failed changesets to the reconciler.
It retries failed changesets every 5 seconds, if there are no newer, non-errored changesets to be dequeued.
The new order clause is strictly speaking not necessary yet, but I think it's a bug that we don't update a changeset's
UpdatedAt
when saving it back to the database. I want to tackle that in another PR.