Allow retrying of errored changesets #12700

mrnugget · 2020-08-04T09:52:48Z

I'm not sure whether we handle retrying correctly. We need to check that retrying publishing works if gitserver fails, GitHub is down, etc. And we need to check that the update works.

We should also probably allow retrying by simply re-applying the same campaign spec.

Discovered this while investigating #12700.

mrnugget · 2020-08-11T08:11:19Z

@sourcegraph/campaigns after looking into this, I think https://github.com/sourcegraph/sourcegraph/pull/12905 is as much as we can do for now.

My goal for this ticket was to make sure that users are never stuck with a failed changesets and #12905 solves that.

So:

The current state of retrying failed changesets

Users can retry the publishing/updating of changesets by applying a new CampaignSpec with new ChangesetSpecs. That means the src-cli could simply create new changesetSpecs/campaignSpecs and apply them every time the user runs src campaign apply. Users can then just re-run the command to retry.

Ideas on the future of retrying failed changesets

Idea 1: Retry on re-apply

If we end up introducing caching in src-cli (which we should at some point, because creating new changesetSpecs/campaignSpecs with the same diff is wasteful) we could say that the applyCampaign mutation receives a new side-effect: if you re-apply the same campaign spec to the same campaign, we re-reenqueue all failed changesets. That's easy enough to do by changing this

https://github.com/sourcegraph/sourcegraph/blob/c6ecbcc6a7c6435ee8c30ea0e5f2f67d2e143849/enterprise/internal/campaigns/service.go#L277-L279

into something like this:

	if campaign.CampaignSpecID == campaignSpec.ID {
		err := s.store.EnqueueChangesets(ctx, EnqueueChangesetOpts{
			CampaignSpecID: campaignSpec.ID,
			ReconcilerState: campaigns.ReconcilerStateErrored,
		})
		return campaign, err
	}

(That method, EnqueueChangesets, doesn't exist yet, but would go well with the two-column-enqueueing-approach outlined in https://github.com/sourcegraph/sourcegraph/issues/12827)

Idea 2: Automatic retry by reconciler

The reconciler could simply pick up changesets where reconciler_state = 'errored' every 60s and retry them (a) a fixed number of times (b) forever until they're completed.

There's a few twists that we can make here:

Retrying uses exponential backoff (it doesn't make sense to keep on hammering the code host if we're rate limited)
Retrying is only done for non-ephemeral errors (this is sketched out here: https://github.com/sourcegraph/sourcegraph/issues/12518).

All of this would require changes to the underlying workerutil package.

I think we should approach this by going from the easiest to the most sophisticated approach:

Retry every X seconds
Retry with exponential backoff
Retry ephemeral errors

The last one is hard, because it's hard to distinguish between ephemeral and non-ephemeral errors, especiall across service boundaries (i.e. we know that a call to GitHub with a 5xx response code is retryable, but what if gitserver responsd with an error? Is gitserver running into it because GitHub is down, or is it because the patch has an invalid format?).

I think both of these ideas should be tackled in the next iteration, especially since the first one goes hand in hand with https://github.com/sourcegraph/sourcegraph/issues/12827 and the easiest version of the second one should be ™️ relatively easy to implement.

) Discovered this while investigating #12700.

marekweb · 2020-08-13T13:36:03Z

Dear all,

This is your release captain speaking. 🚂🚂🚂

Branch cut for the 3.19 release is scheduled for tomorrow.

Is this issue / PR going to make it in time? Please change the milestone accordingly.
When in doubt, reach out!

Thank you

mrnugget · 2020-08-26T11:46:45Z

Just an addition to my previous comment : we also need to make sure that the retrying of importing a changeset works. Right now, if it fails, you can apply a new campaign spec but it doesn't get re-tried, since imported changesets don't have a changeset spec.

In campaigns we need the ability to retry jobs multiple times. (See https://github.com/sourcegraph/sourcegraph/issues/12700#issuecomment-671798531 for additional context.) This is what I think is the easiest-to-understand and simplest solution. I did have another solution that involved a PreDequeue hook (that returned the custom conditions you see here now) and boolean in the StoreOptions to switch between AND'ing or OR'ing the custom conditions to the selectCandidateQuery. This felt a bit hacky. It was less code, but also easier to miss and misudnerstand. What do you think of this?

* Add RetryAfter to dbworker.StoreOptions In campaigns we need the ability to retry jobs multiple times. (See https://github.com/sourcegraph/sourcegraph/issues/12700#issuecomment-671798531 for additional context.) This is what I think is the easiest-to-understand and simplest solution. I did have another solution that involved a PreDequeue hook (that returned the custom conditions you see here now) and boolean in the StoreOptions to switch between AND'ing or OR'ing the custom conditions to the selectCandidateQuery. This felt a bit hacky. It was less code, but also easier to miss and misudnerstand. What do you think of this? * Add tests for RetryAfter in dbworker.Store

* Add RetryAfter to dbworker.StoreOptions In campaigns we need the ability to retry jobs multiple times. (See https://github.com/sourcegraph/sourcegraph/issues/12700#issuecomment-671798531 for additional context.) This is what I think is the easiest-to-understand and simplest solution. I did have another solution that involved a PreDequeue hook (that returned the custom conditions you see here now) and boolean in the StoreOptions to switch between AND'ing or OR'ing the custom conditions to the selectCandidateQuery. This felt a bit hacky. It was less code, but also easier to miss and misudnerstand. What do you think of this? * Add tests for RetryAfter in dbworker.Store * Use RetryAfter in campaigns workers * Order changesets by reconciler_state, then updated_at

mrnugget · 2020-09-02T07:47:59Z

Closing this because #13457 and #13478 add automatic interval-based retrying to the reconciler and that solves 90% of the problems that this issue aim to address.

What we can/will do later:

exponential backoff of retrying (see dbworker: Add exponential backoff to retry mechanism #13548)
add a message when retries have been exhausted (see dbworker: Log when all retries/resets have been exhausted #13549)
retry only on ephemeral errors (that takes a bit more work and I'm not sure it's worth it right now)

What doesn't make sense:

retrying when the same campaign spec has been applied. That doesn't work with the current src-cli and I'm not sure whether it's something that should work.

This is an add-on to #12700 and makes sure that when we apply a new campaign spec to an existing campaign all the failed changesets are retried and properly resets (including NumResets and FailureMessage).

mrnugget · 2020-09-02T08:17:42Z

Right after writing the previous comment I realised that I missed something. Here it is: https://github.com/sourcegraph/sourcegraph/pull/13591/files

* Re-enqueue and reset failed changesets in ApplyCampaign This is an add-on to #12700 and makes sure that when we apply a new campaign spec to an existing campaign all the failed changesets are retried and properly resets (including NumResets and FailureMessage). * Introduce ResetQueued method on Changeset

mrnugget added campaigns batch-changes Issues related to Batch Changes labels Aug 4, 2020

mrnugget added this to the 3.19 milestone Aug 4, 2020

mrnugget self-assigned this Aug 4, 2020

This was referenced Aug 4, 2020

Campaigns 3.19 Tracking issue #12328

Closed

Backend: New campaign flow #10985

Closed

mrnugget changed the title ~~Check that retrying of changeset create/update works correctly in reconciler~~ Allow retrying of errored changesets Aug 6, 2020

mrnugget added a commit that referenced this issue Aug 11, 2020

Ensure that retrying published changesets resets failure message

c6ecbcc

Discovered this while investigating #12700.

mrnugget mentioned this issue Aug 11, 2020

Ensure that retrying published changesets resets failure message #12905

Merged

mrnugget added a commit that referenced this issue Aug 11, 2020

Ensure that retrying published changesets resets failure message (#12905

64f4d38

) Discovered this while investigating #12700.

chrispine modified the milestones: 3.19, 3.20 Aug 13, 2020

chrispine added the planned/3.19 label Aug 13, 2020

chrispine mentioned this issue Aug 17, 2020

Campaigns 3.20 Tracking issue #13068

Closed

52 tasks

mrnugget added the estimate/4d label Aug 19, 2020

chrispine removed the campaigns label Aug 27, 2020

This was referenced Aug 28, 2020

RFC: Add RetryAfter to dbworker.StoreOptions #13457

Merged

Use RetryAfter in campaigns workers #13478

Merged

mrnugget closed this as completed Sep 2, 2020

mrnugget mentioned this issue Sep 2, 2020

Re-enqueue and reset failed changesets in ApplyCampaign #13591

Merged

mrnugget mentioned this issue Sep 14, 2020

Validate reconciler works correctly #13648

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow retrying of errored changesets #12700

Allow retrying of errored changesets #12700

mrnugget commented Aug 4, 2020

mrnugget commented Aug 11, 2020

marekweb commented Aug 13, 2020

mrnugget commented Aug 26, 2020

mrnugget commented Sep 2, 2020

mrnugget commented Sep 2, 2020

Allow retrying of errored changesets #12700

Allow retrying of errored changesets #12700

Comments

mrnugget commented Aug 4, 2020

mrnugget commented Aug 11, 2020

The current state of retrying failed changesets

Ideas on the future of retrying failed changesets

Idea 1: Retry on re-apply

Idea 2: Automatic retry by reconciler

marekweb commented Aug 13, 2020

mrnugget commented Aug 26, 2020

mrnugget commented Sep 2, 2020

mrnugget commented Sep 2, 2020