Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

Allow retrying of errored changesets #12700

Closed
mrnugget opened this issue Aug 4, 2020 · 5 comments
Closed

Allow retrying of errored changesets #12700

mrnugget opened this issue Aug 4, 2020 · 5 comments
Assignees
Labels
batch-changes Issues related to Batch Changes estimate/4d planned/3.19
Milestone

Comments

@mrnugget
Copy link
Contributor

mrnugget commented Aug 4, 2020

I'm not sure whether we handle retrying correctly. We need to check that retrying publishing works if gitserver fails, GitHub is down, etc. And we need to check that the update works.

We should also probably allow retrying by simply re-applying the same campaign spec.

@mrnugget mrnugget added campaigns batch-changes Issues related to Batch Changes labels Aug 4, 2020
@mrnugget mrnugget added this to the 3.19 milestone Aug 4, 2020
@mrnugget mrnugget self-assigned this Aug 4, 2020
@mrnugget mrnugget changed the title Check that retrying of changeset create/update works correctly in reconciler Allow retrying of errored changesets Aug 6, 2020
mrnugget added a commit that referenced this issue Aug 11, 2020
@mrnugget
Copy link
Contributor Author

@sourcegraph/campaigns after looking into this, I think https://github.com/sourcegraph/sourcegraph/pull/12905 is as much as we can do for now.

My goal for this ticket was to make sure that users are never stuck with a failed changesets and #12905 solves that.

So:

The current state of retrying failed changesets

Users can retry the publishing/updating of changesets by applying a new CampaignSpec with new ChangesetSpecs. That means the src-cli could simply create new changesetSpecs/campaignSpecs and apply them every time the user runs src campaign apply. Users can then just re-run the command to retry.

Ideas on the future of retrying failed changesets

Idea 1: Retry on re-apply

If we end up introducing caching in src-cli (which we should at some point, because creating new changesetSpecs/campaignSpecs with the same diff is wasteful) we could say that the applyCampaign mutation receives a new side-effect: if you re-apply the same campaign spec to the same campaign, we re-reenqueue all failed changesets. That's easy enough to do by changing this

https://github.com/sourcegraph/sourcegraph/blob/c6ecbcc6a7c6435ee8c30ea0e5f2f67d2e143849/enterprise/internal/campaigns/service.go#L277-L279

into something like this:

	if campaign.CampaignSpecID == campaignSpec.ID {
		err := s.store.EnqueueChangesets(ctx, EnqueueChangesetOpts{
			CampaignSpecID: campaignSpec.ID,
			ReconcilerState: campaigns.ReconcilerStateErrored,
		})
		return campaign, err
	}

(That method, EnqueueChangesets, doesn't exist yet, but would go well with the two-column-enqueueing-approach outlined in https://github.com/sourcegraph/sourcegraph/issues/12827)

Idea 2: Automatic retry by reconciler

The reconciler could simply pick up changesets where reconciler_state = 'errored' every 60s and retry them (a) a fixed number of times (b) forever until they're completed.

There's a few twists that we can make here:

  1. Retrying uses exponential backoff (it doesn't make sense to keep on hammering the code host if we're rate limited)
  2. Retrying is only done for non-ephemeral errors (this is sketched out here: https://github.com/sourcegraph/sourcegraph/issues/12518).

All of this would require changes to the underlying workerutil package.

I think we should approach this by going from the easiest to the most sophisticated approach:

  1. Retry every X seconds
  2. Retry with exponential backoff
  3. Retry ephemeral errors

The last one is hard, because it's hard to distinguish between ephemeral and non-ephemeral errors, especiall across service boundaries (i.e. we know that a call to GitHub with a 5xx response code is retryable, but what if gitserver responsd with an error? Is gitserver running into it because GitHub is down, or is it because the patch has an invalid format?).


I think both of these ideas should be tackled in the next iteration, especially since the first one goes hand in hand with https://github.com/sourcegraph/sourcegraph/issues/12827 and the easiest version of the second one should be ™️ relatively easy to implement.

mrnugget added a commit that referenced this issue Aug 11, 2020
@marekweb
Copy link
Contributor

Dear all,

This is your release captain speaking. 🚂🚂🚂

Branch cut for the 3.19 release is scheduled for tomorrow.

Is this issue / PR going to make it in time? Please change the milestone accordingly.
When in doubt, reach out!

Thank you

@mrnugget
Copy link
Contributor Author

Just an addition to my previous comment : we also need to make sure that the retrying of importing a changeset works. Right now, if it fails, you can apply a new campaign spec but it doesn't get re-tried, since imported changesets don't have a changeset spec.

mrnugget referenced this issue Aug 28, 2020
In campaigns we need the ability to retry jobs multiple times. (See
https://github.com/sourcegraph/sourcegraph/issues/12700#issuecomment-671798531
for additional context.)

This is what I think is the easiest-to-understand and simplest solution.

I did have another solution that involved a PreDequeue hook (that
returned the custom conditions you see here now) and boolean in the
StoreOptions to switch between AND'ing or OR'ing the custom conditions
to the selectCandidateQuery.

This felt a bit hacky. It was less code, but also easier to miss and
misudnerstand.

What do you think of this?
mrnugget referenced this issue Sep 1, 2020
* Add RetryAfter to dbworker.StoreOptions

In campaigns we need the ability to retry jobs multiple times. (See
https://github.com/sourcegraph/sourcegraph/issues/12700#issuecomment-671798531
for additional context.)

This is what I think is the easiest-to-understand and simplest solution.

I did have another solution that involved a PreDequeue hook (that
returned the custom conditions you see here now) and boolean in the
StoreOptions to switch between AND'ing or OR'ing the custom conditions
to the selectCandidateQuery.

This felt a bit hacky. It was less code, but also easier to miss and
misudnerstand.

What do you think of this?

* Add tests for RetryAfter in dbworker.Store
mrnugget referenced this issue Sep 1, 2020
* Add RetryAfter to dbworker.StoreOptions

In campaigns we need the ability to retry jobs multiple times. (See
https://github.com/sourcegraph/sourcegraph/issues/12700#issuecomment-671798531
for additional context.)

This is what I think is the easiest-to-understand and simplest solution.

I did have another solution that involved a PreDequeue hook (that
returned the custom conditions you see here now) and boolean in the
StoreOptions to switch between AND'ing or OR'ing the custom conditions
to the selectCandidateQuery.

This felt a bit hacky. It was less code, but also easier to miss and
misudnerstand.

What do you think of this?

* Add tests for RetryAfter in dbworker.Store

* Use RetryAfter in campaigns workers

* Order changesets by reconciler_state, then updated_at
@mrnugget
Copy link
Contributor Author

mrnugget commented Sep 2, 2020

Closing this because #13457 and #13478 add automatic interval-based retrying to the reconciler and that solves 90% of the problems that this issue aim to address.

What we can/will do later:

What doesn't make sense:

  • retrying when the same campaign spec has been applied. That doesn't work with the current src-cli and I'm not sure whether it's something that should work.

@mrnugget mrnugget closed this as completed Sep 2, 2020
mrnugget added a commit that referenced this issue Sep 2, 2020
This is an add-on to #12700 and makes sure that when we apply a new
campaign spec to an existing campaign all the failed changesets are
retried and properly resets (including NumResets and FailureMessage).
@mrnugget
Copy link
Contributor Author

mrnugget commented Sep 2, 2020

Right after writing the previous comment I realised that I missed something. Here it is: https://github.com/sourcegraph/sourcegraph/pull/13591/files

mrnugget added a commit that referenced this issue Sep 3, 2020
* Re-enqueue and reset failed changesets in ApplyCampaign

This is an add-on to #12700 and makes sure that when we apply a new
campaign spec to an existing campaign all the failed changesets are
retried and properly resets (including NumResets and FailureMessage).

* Introduce ResetQueued method on Changeset
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
batch-changes Issues related to Batch Changes estimate/4d planned/3.19
Projects
None yet
Development

No branches or pull requests

3 participants