-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Allow retrying of errored changesets #12700
Comments
Discovered this while investigating #12700.
@sourcegraph/campaigns after looking into this, I think https://github.com/sourcegraph/sourcegraph/pull/12905 is as much as we can do for now. My goal for this ticket was to make sure that users are never stuck with a failed changesets and #12905 solves that. So: The current state of retrying failed changesetsUsers can retry the publishing/updating of changesets by applying a new Ideas on the future of retrying failed changesetsIdea 1: Retry on re-applyIf we end up introducing caching in into something like this: if campaign.CampaignSpecID == campaignSpec.ID {
err := s.store.EnqueueChangesets(ctx, EnqueueChangesetOpts{
CampaignSpecID: campaignSpec.ID,
ReconcilerState: campaigns.ReconcilerStateErrored,
})
return campaign, err
} (That method, Idea 2: Automatic retry by reconcilerThe reconciler could simply pick up changesets where There's a few twists that we can make here:
All of this would require changes to the underlying I think we should approach this by going from the easiest to the most sophisticated approach:
The last one is hard, because it's hard to distinguish between ephemeral and non-ephemeral errors, especiall across service boundaries (i.e. we know that a call to GitHub with a 5xx response code is retryable, but what if I think both of these ideas should be tackled in the next iteration, especially since the first one goes hand in hand with https://github.com/sourcegraph/sourcegraph/issues/12827 and the easiest version of the second one should be ™️ relatively easy to implement. |
) Discovered this while investigating #12700.
Dear all, This is your release captain speaking. 🚂🚂🚂 Branch cut for the 3.19 release is scheduled for tomorrow. Is this issue / PR going to make it in time? Please change the milestone accordingly. Thank you |
Just an addition to my previous comment : we also need to make sure that the retrying of importing a changeset works. Right now, if it fails, you can apply a new campaign spec but it doesn't get re-tried, since imported changesets don't have a changeset spec. |
In campaigns we need the ability to retry jobs multiple times. (See https://github.com/sourcegraph/sourcegraph/issues/12700#issuecomment-671798531 for additional context.) This is what I think is the easiest-to-understand and simplest solution. I did have another solution that involved a PreDequeue hook (that returned the custom conditions you see here now) and boolean in the StoreOptions to switch between AND'ing or OR'ing the custom conditions to the selectCandidateQuery. This felt a bit hacky. It was less code, but also easier to miss and misudnerstand. What do you think of this?
* Add RetryAfter to dbworker.StoreOptions In campaigns we need the ability to retry jobs multiple times. (See https://github.com/sourcegraph/sourcegraph/issues/12700#issuecomment-671798531 for additional context.) This is what I think is the easiest-to-understand and simplest solution. I did have another solution that involved a PreDequeue hook (that returned the custom conditions you see here now) and boolean in the StoreOptions to switch between AND'ing or OR'ing the custom conditions to the selectCandidateQuery. This felt a bit hacky. It was less code, but also easier to miss and misudnerstand. What do you think of this? * Add tests for RetryAfter in dbworker.Store
* Add RetryAfter to dbworker.StoreOptions In campaigns we need the ability to retry jobs multiple times. (See https://github.com/sourcegraph/sourcegraph/issues/12700#issuecomment-671798531 for additional context.) This is what I think is the easiest-to-understand and simplest solution. I did have another solution that involved a PreDequeue hook (that returned the custom conditions you see here now) and boolean in the StoreOptions to switch between AND'ing or OR'ing the custom conditions to the selectCandidateQuery. This felt a bit hacky. It was less code, but also easier to miss and misudnerstand. What do you think of this? * Add tests for RetryAfter in dbworker.Store * Use RetryAfter in campaigns workers * Order changesets by reconciler_state, then updated_at
Closing this because #13457 and #13478 add automatic interval-based retrying to the reconciler and that solves 90% of the problems that this issue aim to address. What we can/will do later:
What doesn't make sense:
|
This is an add-on to #12700 and makes sure that when we apply a new campaign spec to an existing campaign all the failed changesets are retried and properly resets (including NumResets and FailureMessage).
Right after writing the previous comment I realised that I missed something. Here it is: https://github.com/sourcegraph/sourcegraph/pull/13591/files |
* Re-enqueue and reset failed changesets in ApplyCampaign This is an add-on to #12700 and makes sure that when we apply a new campaign spec to an existing campaign all the failed changesets are retried and properly resets (including NumResets and FailureMessage). * Introduce ResetQueued method on Changeset
I'm not sure whether we handle retrying correctly. We need to check that retrying publishing works if gitserver fails, GitHub is down, etc. And we need to check that the update works.
We should also probably allow retrying by simply re-applying the same campaign spec.
The text was updated successfully, but these errors were encountered: