Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: avoid hanging forever on intents #51624

Merged
merged 1 commit into from
Aug 9, 2020
Merged

Conversation

dt
Copy link
Member

@dt dt commented Jul 20, 2020

BACKUP, like any read, cannot export a range if it contains an unresolved intent.
If we hit an intent during iteration, we typically return a WriteIntentError like
any scan and let the store resolve it and retry the request before replying to the
sender.

However blocking intent resolution can mean a single txn left open can cause backups
to hang, sometimes for hours or even days in some observed cases. While operators
typically want bulk, background jobs like BACKUP to have as little impact on their
foreground traffic as possible, even more importantly they want their BACKUPs to
actually complete to maintain their RPO SLA. So while you typically would not want
a BACKUP to immeidately run at high priority and abort any transactions from OLTP
foreground traffic as soon as it starts, you do, eventually, want it to do whatever
it needs to do to finish.

Backdating BACKUPs with AS OF SYSTEM TIME '-30s' is a good best practice to reduce
the number of still-running transactions they may encounter, but an abandoned or very
long-running transaction can still hang such a backup. This change extends BACKUP to
initially start asking individual ranges to backup but bail out if they encounter an
intent, rather than attempting to resolve it and holding up the whole BACKUP while
doing so. If any range replies that it bailed out, it is put at the back of the queue
to come back to later, and BACKUP continues to work on other ranges, in the hopes that
by the time it is done with them, the intent may have already resolved itself.

When it comes back to a range that it has already attempted to backup up to the limit
(default 3), it instead asks the range to backup but not bail out if it hits an intent
and instead do the usual behavior of trying to resolve it. It addtionally sets the
priority to high, so that during that resolution, other transactions will be aborted
so that the backup can finish.

Release note (enterprise change): BACKUP takes priority over other transactions if its initial attempts to export a range fail.

@dt dt requested review from pbardea, ajwerner and a team July 20, 2020 21:54
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@dt
Copy link
Member Author

dt commented Jul 20, 2020

I'm still working on the test for this, but wanted to get it out for an early look in the meantime.

Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it very much. :lgtm_strong:

I can take another look when you get tests going (though, I assume, it shouldn't be that bad)

Reviewed 5 of 7 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @dt, and @pbardea)


pkg/ccl/backupccl/backup_processor.go, line 180 at r1 (raw file):

					header.UserPriority = roachpb.MaxUserPriority
				}
				log.Infof(ctx, "sending ExportRequest for span %s (attempt %d, priority %d)", span.span, span.attempts+1, header.UserPriority.String())

wrap very long line(s)?


pkg/ccl/backupccl/backup_processor.go, line 183 at r1 (raw file):

				if pErr != nil {
					if err := pErr.Detail.GetTransactionRetry(); err != nil && err.Reason == roachpb.RETRY_REASON_UNKNOWN {

Should this check be moved & encapsulated in some sort of utility function?
Perhaps move it to export.go -- where it is used?

Copy link
Member Author

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was making the test much harder than it needed to be with multiple goroutines, but realized all we need is 4 things: begin, update, backup (with timeout), commit. �Added a nice simple test.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @miretskiy, and @pbardea)


pkg/ccl/backupccl/backup_processor.go, line 180 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

wrap very long line(s)?

Done.


pkg/ccl/backupccl/backup_processor.go, line 183 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…
				if pErr != nil {
					if err := pErr.Detail.GetTransactionRetry(); err != nil && err.Reason == roachpb.RETRY_REASON_UNKNOWN {

Should this check be moved & encapsulated in some sort of utility function?
Perhaps move it to export.go -- where it is used?

eh, I realized it could just be simplified in place.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: with some nits. I'd like somebody from KV to chime in on the Reason.

Reviewed 1 of 7 files at r1, 3 of 4 files at r2.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @ajwerner, @dt, @miretskiy, and @pbardea)


pkg/ccl/backupccl/backup_processor.go, line 184 at r2 (raw file):

				rawRes, pErr := kv.SendWrappedWith(ctx, flowCtx.Cfg.DB.NonTransactionalSender(), header, req)
				if pErr != nil {
					if err := pErr.Detail.GetTransactionRetry(); err != nil {

I'd love to see a brief comment here pointing at the FailOnIntents field of the request and how it lead to this retry error.


pkg/ccl/backupccl/backup_test.go, line 4367 at r2 (raw file):

	// observe that the backup aborted our txn.
	require.Error(t, tx.Commit())

nit: pick this error apart and validate that it has the right code inside.


pkg/ccl/storageccl/export.go, line 160 at r2 (raw file):

		if err != nil {
			if err, ok := err.(*roachpb.WriteIntentError); ok && args.FailOnIntents {
				wrapped := roachpb.NewTransactionRetryError(roachpb.RETRY_REASON_UNKNOWN, err.Error())

I'd support creating a new retry reason though I'd like to have somebody on the KV team aware and supportive of that too and so I'm nominating @andreimatei to help decide if that's a good idea.

Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @dt and @pbardea)


pkg/ccl/backupccl/backup_test.go, line 4359 at r2 (raw file):

	// limit how long we'll try to backup so if it hangs we find out.
	timeoutCtx, fn := context.WithTimeout(ctx, time.Second*10)

perhaps a bit higher? particularly w/ stressrace it might take longer?
Or, just disable under stress/race?

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like somebody from KV to chime in on the Reason

I'd encourage us to avoid a TransactionRetryError here. We're not running in a transaction so it's entirely unexpected for a TransactionRetryError to be returned. I'm actually surprised this doesn't blow up on the client when passing through a non-transactional sender. TransactionRetryError has a specific meaning, it's not just a generic RetryError, so if we're going to go with this approach, introducing a new error type would be preferable.

That said, I still don't think I fully understand the approach here and what it's optimizing for. We discussed on slack the alternative of simply running all BACKUPS as high-priority, which avoids all of this complexity and is much easier to think about. Of course, this could be disruptive if users don't run BACKUP with AS OF SYSTEM TIME '-30s', which David mentioned was "a good best practice" but I assume is not something we actually expect most users to be doing (right?). But I don't actually think the approach we have here is guaranteed to be any less disruptive. We retry 3 times at normal priority before upgrading to high priority, but there's no lower bound on how long those retries will take or how long we'll wait between retries. So it's still possible given a small enough backup that we end up pushing transactions that are arbitrarily up-to-date even with all of this added complexity. Don't we want a hard guarantee that we don't disrupt transactions below some threshold of staleness?

I wonder if there's any way to salvage an "always backup with high priority" approach. If so, I think that's worth exploring. For instance, what if we just ... waited until the backup timestamp was more than 10 seconds stale before sending the first Export request? For users using the best practice of running BACKUP with AS OF SYSTEM TIME, this would be free. For users not doing this, this would place a 10-second penalty on the latency of a backup. Unless we're optimizing for backup speed on very small tables, which I don't think we are, this penalty is likely completely negligible. And even then, follow the best practices as a user and you'll be fine.

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @dt and @pbardea)

@dt
Copy link
Member Author

dt commented Jul 21, 2020

I don't particularly like slapping an blanket latency penalty on every non-backdated backup: in penalizes backups on tables that weren't seeing any traffic unnecessarily and it isn't even clear what the right amount would be -- 1s might be appropriate in some scenarios but 30s in others depending on the workload we're trying to accommodate. I'd much rather only penalize a backup that actually hits an intent. So to do that, we'd still need a way to signal "i hit an intent" back to the sender.

This change attempts to use the fact we usually have other work to do that we can do in the meantime, instead of just sitting on our hand and doing nothing while we wait. As you point out, very small backups we may get back to the range quickly and thus run at high priority quite soon, aborting foreground traffic -- I should have included a TODO documenting our plan there: we're thinking that on subsequent retries waiting until back-date is something like priorAttempts * 10s -- so if we're backdated by 10s, the initial attempt goes out immediately as does the first retry, but we'd then impose that 10s and then 20s latency penalty on reties 2 and then 3.

Another thing that motivated this approach, and I also should have left a TODO about it, is that we're trying to address a constant frustration with many bulk operations which is "why is this taking so long?" is often hard to determine. Right now if we're stuck waiting on an intent, there's no indication of that -- we just see the request and overall BACKUP taking longer / not making progress. Informing the job that we hit an intent lets us send that information back to the coordinator, so we can then surface that status, e.g. something like "waiting on uncommitted transactions in 6 ranges". That in turn gives the operator a hint that maybe they need to back-date more, or maybe they need to look for some lost/stuck txns.

I don't particularly like just slapping priority high on all backups without an imposed backdate either, since impact on foreground traffic is one of the top concerns we hear every time we talk to production operators.

Copy link
Contributor

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 4 files at r2.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @dt)


pkg/ccl/backupccl/backup_processor.go, line 42 at r2 (raw file):

	"kv.bulk_io_read.backup_priority_attempt_limit",
	"number of reads BACKUP should attempt before using high priority",
	3,

Alternatively (or maybe in addition) to having a backoff for every retry, I wonder if it would be useful to have a knob where the user could control how much time should pass before we start using high-priority transactions.

This would work similarly to defaulting to AOST -$cluster_setting seconds, but would be enforced by default as a cluster setting. Also, instead of reading in the past it would read now, but wait the period of the cluster setting if it gets blocked. This may help control the tradeoff between "I don't want my backups affecting my foreground traffic" and "But if my backup is taking really long I can sacrifice some retries on already long transactions".

@dt dt force-pushed the backup-hang branch 3 times, most recently from b576652 to dc427c4 Compare July 23, 2020 03:26
@dt
Copy link
Member Author

dt commented Jul 23, 2020

Okay, switched TransactionRetryError to the existing UnhandledRetryableError which is what we already return to callers when wrapping other retryable errors. I went ahead and added the time-based priority escalation instead of the simple count, and the min delay between tries.

@miretskiy miretskiy requested review from ajwerner and pbardea July 23, 2020 12:19
Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 7 of 13 files at r3, 4 of 8 files at r4.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @dt, @miretskiy, and @pbardea)


pkg/ccl/backupccl/backup_processor.go, line 187 at r4 (raw file):

				// check to see if it is time to switch to priority.
				if !priority && span.attempts > 0 {
					// Check if this is starting a new pass and we should delay first.

want to expand the comment to explain why it's okay to wait (everything else is lower priority anyway)?


pkg/roachpb/api.proto, line 1433 at r4 (raw file):

  int64 target_file_size = 10;

  // FailOnIntents causes the request to return a TransactionRetryError if it

s/TransactionRetryError/UnhandledRetryeError/?

@miretskiy miretskiy self-requested a review July 23, 2020 12:19
Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @dt, @miretskiy, and @pbardea)

@miretskiy miretskiy self-requested a review July 23, 2020 12:19
@dt dt requested a review from nvanbenschoten July 27, 2020 21:42
@dt dt force-pushed the backup-hang branch 2 times, most recently from 32cd26f to 19ead9f Compare July 27, 2020 21:54
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mechanics in the second commit now :lgtm:. Thanks for adding in the read_with_priority_after window.

The remaining question I have is whether we want to repurpose this error type.

Reviewed 2 of 13 files at r3, 8 of 8 files at r5, 8 of 8 files at r6.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (and 1 stale) (waiting on @dt)


pkg/ccl/backupccl/backup_processor.go, line 42 at r6 (raw file):

	)
	priorityAfter = settings.RegisterNonNegativeDurationSetting(
		"bulkio.backup.read_with_priority_after",

s/priority/high_priority/ here and in the sentence below.


pkg/ccl/backupccl/backup_processor.go, line 49 at r6 (raw file):

		"bulkio.backup.read_retry_delay",
		"amount of time since the read-as-of time, per-prior attempt, to wait before making another attempt",
		time.Second*5,

nit: 5*time.Second


pkg/ccl/backupccl/backup_processor.go, line 223 at r6 (raw file):

				if pErr != nil {
					if err := pErr.Detail.GetUnhandledRetry(); err != nil {
						span.lastTried = timeutil.Now()

Do we want to use the HLC clock here and talk in terms of HLC timestamps?


pkg/ccl/storageccl/export.go, line 160 at r6 (raw file):

		if err != nil {
			if err, ok := err.(*roachpb.WriteIntentError); ok && args.FailOnIntents {
				wrapped := &roachpb.UnhandledRetryableError{}

The fact that we're not wrapping an error is an indication that we don't want to repurpose this error type, but I'll still defer to @andreimatei on that.


pkg/roachpb/errors.proto, line 518 at r5 (raw file):

    LeaseRejectedError lease_rejected = 13;
    NodeUnavailableError node_unavailable = 14;
    UnhandledRetryableError unhandled_retry =  40;

@andreimatei do you mind taking a look at this? My intuition is that we don't want to repurpose this error type and certainly don't want to add it to this union because its purpose is to wrap these errors in this union when converted into a Go error (see Error.GoError). Adding it to the union forces us to ask the question of what an UnhandledRetryableError{Err: UnhandledRetryableError{...}} means. I think you have a clearer picture of what the vision of this error type was though.

Copy link
Member Author

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @dt)


pkg/ccl/storageccl/export.go, line 160 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The fact that we're not wrapping an error is an indication that we don't want to repurpose this error type, but I'll still defer to @andreimatei on that.

we are wrapping an error, aren't we? I set PErr on the line below to the actual WriteIntentError.


pkg/roachpb/errors.proto, line 518 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

@andreimatei do you mind taking a look at this? My intuition is that we don't want to repurpose this error type and certainly don't want to add it to this union because its purpose is to wrap these errors in this union when converted into a Go error (see Error.GoError). Adding it to the union forces us to ask the question of what an UnhandledRetryableError{Err: UnhandledRetryableError{...}} means. I think you have a clearer picture of what the vision of this error type was though.

I'm also eager to learn more about this history and motivation for this error type, but my brief spelunking made me think it was appropriate for this use -- As I understood it from the comments, it a type of error that callers should expect to see when they're returned a retryable error, wrapping some underlying error that explains why they're being asked to retry.

By my read, that is exactly what I want here -- my caller to see a retryable error, wrapping the underlying (WriteIntent) error. What I believe is different about my usage, and why I wanted to put it in the union, is that I wanted to give the server the choice to wrap it before even replying.

I don't expect distsender to wrap an error that is already UnhandledRetryableError in another layer of UnhandledRetryableError, so I don't think we're any more forced to consider what UnhandledRetryableError(UnhandledRetryableError(x)) means than we were before.

@dt dt requested a review from nvanbenschoten July 30, 2020 03:52
@dt dt requested a review from andreimatei July 30, 2020 03:52
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it just intents that exports need to worry about, or is it also unreplicated locks? Unless there's something special about ExportRequests, those should block you just as much. If so, I think you want to expand the handling of the new flag and consider it when evaluation looks through the lock table (not sure where exactly, but I can dig or we can ask Nathan). And even intents, I think, can be discovered through the lock table, not only through scans, depending whether the Export is the first guy to run into a specific intent, or if other requests ran into it recently. I'm sorry if I'm off.

Another thought I have is that I think it's pretty brutal how the patch currently hides the WriteIntentError. Part of handling that error is putting the intent into the lock table. I think it'd be nice to structure the code such that that part still happens, even if the request then doesn't actually wait for the intent to be resolved. It's not a big deal if Nathan didn't feel the same, but I'm surprised he didn't. This dove-tails with my other comment about how maybe we don't want to wrap the WriteIntentError in anything.

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @dt, and @nvanbenschoten)


pkg/ccl/backupccl/backup_processor.go, line 218 at r6 (raw file):

					req.FailOnIntents = true
				}
				log.Infof(ctx, "sending ExportRequest for span %s (attempt %d, priority %s)",

did you want to leave this at Info level?


pkg/roachpb/api.proto, line 1433 at r6 (raw file):

  int64 target_file_size = 10;

  // FailOnIntents causes the request to return a UnhandledRetryeError if it

UnhandledRetry*e*Error


pkg/roachpb/errors.proto, line 518 at r5 (raw file):

As I understood it from the comments, it a type of error that callers should expect to see when they're returned a retryable error, wrapping some underlying error that explains why they're being asked to retry.

Not really, depending on what "callers" we're talking about. Generally code under the TxnCoordSender deals with specific retriable errors (e.g. TransactionRetryError), and code above the TxnCoordSender deals with TransactionRetryWithProtoRefreshError (which corresponds to a "handled" retriable error - the name cracks me up every time).

I don't think UnhandledRetryableError belongs here. UnhandledRetryableError is used by DistSQL leaf transactions to communicate with the DistSQL root transaction (it is produced on kvclient side); it's not meant for communication between kvserver and kvclient. There's also this assertion here that verifies that generally the user of a transaction doesn't get this error.

log.Fatalf(ctx, "unexpected UnhandledRetryableError at the txn.exec() level: %s", err)

I also don't think that UnhandledRetryableError should ever wrap a WriteIntentError; a write intent is not retriable in the same sense as the errors that implement transactionRestartError - those are the retriable ones.

If ExportRequest needs to return some new error type, then I'd much rather we introduce a brand new error type in this enum. But I think a better idea might be to just return WriteIntentError to the client. In fact, have you considered making fail_on_intents a field on the RequestHeader? It could apply in the same way to any request - it would inhibit the call to handleWriteIntentError() in replica.executeBatchWithConcurrencyRetries(). FWIW, if we were to do that, we'd change the character of WriteIntentError a bit - I think currently this error is never returned from the server to the client, and is only defined as a proto because we had the bad idea to standardize this Sender interface that returns pErr and use it both on the client side and on the server side. I'm suggesting we start returning WriteIntentError to the client - but that makes sense to me.

@nvanbenschoten
Copy link
Member

If so, I think you want to expand the handling of the new flag and consider it when evaluation looks through the lock table (not sure where exactly, but I can dig or we can ask Nathan). And even intents, I think, can be discovered through the lock table, not only through scans, depending whether the Export is the first guy to run into a specific intent, or if other requests ran into it recently. I'm sorry if I'm off.

Great point, I completely missed this, but @andreimatei's correct. The lock table needs to be made aware of this flag as well. As is, this is only a partial solution.

This is a pretty strong indication that we should generalize this concept. Specifically, I think it's time to consider adding a WaitPolicy option to the BatchRequest header that will start off with two options, Block and Error. Block would be the default and requests would behave as they do today. Error would return a new form of error when it encounters any locks that it would need to block on. The lock table would make this easy to implement, we'd need just a single check here.

WaitPolicy.Error would be used for this PR and it would also be used to implement the SELECT ... FOR UPDATE NOWAIT option that is starting to come up in #40476. We already support the syntax for this, but don't support it in the KV layer:

// LockingWaitPolicy represents the possible policies for dealing with rows
// being locked by FOR UPDATE/SHARE clauses (i.e., it represents the NOWAIT
// and SKIP LOCKED options).
type LockingWaitPolicy byte

I'm a little less eager to support SELECT ... FOR UPDATE SKIP LOCKED because it's a complete violation of serializability, but that's a discussion we can have at a later point.

Anyway, @dt this would require a bit of KV work but then we could remove all of the pkg/kv changes from this PR. I'd be happy to work on this, but won't be able to get to it until later next week. Does that work for your timeline?

@dt
Copy link
Member Author

dt commented Jul 30, 2020

Anyway, @dt this would require a bit of KV work but then we could remove all of the pkg/kv changes from this PR. I'd be happy to work on this, but won't be able to get to it until later next week. Does that work for your timeline?

@nvanbenschoten More than happy to wait (har har) to rebase this if you think it's feasible that it'll be in the KV API in time to get this in for 20.2. A generalized WaitPolicy sounds like exactly what we want, and is on the path, I believe, to some of our other desires as well (e.g. limit how long we wait trying to contact unavailable ranges or other places we currently see hangs). Thanks!

@dt
Copy link
Member Author

dt commented Jul 30, 2020

Oh, I meant to say: I might suggest WaitPolicy be a message type with this enum nested in field within it, so we can add other fields along side later (like my desired dailtimeout)?

@nvanbenschoten
Copy link
Member

@dt I just pushed #52388, which should provide you with a better interface to build this PR on top of. You should be able to set a WaitPolicy: lock.WaitPolicy_Error on the ExportRequest's Batch.Header and catch WriteIntentErrors without any of the associated KV changes in this PR.

Oh, I meant to say: I might suggest WaitPolicy be a message type with this enum nested in field within it, so we can add other fields along side later (like my desired dailtimeout)?

I didn't end up going along this route. The nested enum approach is tricky to keep a) free in the common case and b) backwards compatible when expressed as protobufs. I'm also not convinced that it's needed. The idea of a dailtimeout is well out of scope for this work and I don't think it would be a good idea to conflate it with what we're talking about here. The lock.WaitPolicy is very specifically talking about behavior under contention and when encountering conflicting locks. It's analogous to the "locking clause options" in Postgres: https://www.postgresql.org/docs/current/sql-select.html.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 5, 2020
Informs cockroachdb#40476.
Informs cockroachdb#51624.

This commit introduces the concept of request "wait policies". A
WaitPolicy specifies the behavior of a request when it encounters
conflicting locks held by other active transactions. The default
behavior is to block until the conflicting lock is released, but other
policies can make sense in special situations.

Within this new formalization, the commit creates two initial wait
policy variants:
```
// Block indicates that if a request encounters a conflicting locks held by
// another active transaction, it should wait for the conflicting lock to be
// released before proceeding.
Block = 0;

// Error indicates that if a request encounters a conflicting locks held by
// another active transaction, it should raise an error instead of blocking.
Error = 1;
```

`Block` is equivalent to the current behavior, so there's no further
action needed for that variant.

However, the `Error` policy is new, and this commit teaches the
`lockTableWaiter` about it. It ensures that when a request with the
`Error` wait policy encounters a conflicting lock, it uses a PUSH_TOUCH
PushRequest to determine whether the lock is abandoned or whether its
holder is still active. If the holder is abandoned, the request cleans
up the lock and proceeds. If the holder is still active, a
WriteIntentError is returned to the client.

This will unblock both of the referenced issues.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 5, 2020
Informs cockroachdb#40476.
Informs cockroachdb#51624.

This commit introduces the concept of request "wait policies". A
WaitPolicy specifies the behavior of a request when it encounters
conflicting locks held by other active transactions. The default
behavior is to block until the conflicting lock is released, but other
policies can make sense in special situations.

Within this new formalization, the commit creates two initial wait
policy variants:
```
// Block indicates that if a request encounters a conflicting locks held by
// another active transaction, it should wait for the conflicting lock to be
// released before proceeding.
Block = 0;

// Error indicates that if a request encounters a conflicting locks held by
// another active transaction, it should raise an error instead of blocking.
Error = 1;
```

`Block` is equivalent to the current behavior, so there's no further
action needed for that variant.

However, the `Error` policy is new, and this commit teaches the
`lockTableWaiter` about it. It ensures that when a request with the
`Error` wait policy encounters a conflicting lock, it uses a PUSH_TOUCH
PushRequest to determine whether the lock is abandoned or whether its
holder is still active. If the holder is abandoned, the request cleans
up the lock and proceeds. If the holder is still active, a
WriteIntentError is returned to the client.

This will unblock both of the referenced issues.
@dt
Copy link
Member Author

dt commented Aug 5, 2020

Thanks for jumping on this @nvanbenschoten !

I've rebased this PR on yours and am happy to see it working exactly as expected. I'll rebase again after yours merges but otherwise I think this is all set for a final review from bulk-folks just looking at the second commit.

@nvanbenschoten just to clarify, my API suggestion: I certainly agree anything beyond the error-vs-block are all out of scope for now -- i was just thinking that if we add more details later, like a dial timeout or an alternative block-up-to-limit versus just block-or-error, then it might be nice to have a message type wrapper to add them to. But I'm also completely happy with this as-is -- it was just a suggestion.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 6, 2020
Informs cockroachdb#40476.
Informs cockroachdb#51624.

This commit introduces the concept of request "wait policies". A
WaitPolicy specifies the behavior of a request when it encounters
conflicting locks held by other active transactions. The default
behavior is to block until the conflicting lock is released, but other
policies can make sense in special situations.

Within this new formalization, the commit creates two initial wait
policy variants:
```
// Block indicates that if a request encounters a conflicting locks held by
// another active transaction, it should wait for the conflicting lock to be
// released before proceeding.
Block = 0;

// Error indicates that if a request encounters a conflicting locks held by
// another active transaction, it should raise an error instead of blocking.
Error = 1;
```

`Block` is equivalent to the current behavior, so there's no further
action needed for that variant.

However, the `Error` policy is new, and this commit teaches the
`lockTableWaiter` about it. It ensures that when a request with the
`Error` wait policy encounters a conflicting lock, it uses a PUSH_TOUCH
PushRequest to determine whether the lock is abandoned or whether its
holder is still active. If the holder is abandoned, the request cleans
up the lock and proceeds. If the holder is still active, a
WriteIntentError is returned to the client.

This will unblock both of the referenced issues.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 7, 2020
Informs cockroachdb#40476.
Informs cockroachdb#51624.

This commit introduces the concept of request "wait policies". A
WaitPolicy specifies the behavior of a request when it encounters
conflicting locks held by other active transactions. The default
behavior is to block until the conflicting lock is released, but other
policies can make sense in special situations.

Within this new formalization, the commit creates two initial wait
policy variants:
```
// Block indicates that if a request encounters a conflicting locks held by
// another active transaction, it should wait for the conflicting lock to be
// released before proceeding.
Block = 0;

// Error indicates that if a request encounters a conflicting locks held by
// another active transaction, it should raise an error instead of blocking.
Error = 1;
```

`Block` is equivalent to the current behavior, so there's no further
action needed for that variant.

However, the `Error` policy is new, and this commit teaches the
`lockTableWaiter` about it. It ensures that when a request with the
`Error` wait policy encounters a conflicting lock, it uses a PUSH_TOUCH
PushRequest to determine whether the lock is abandoned or whether its
holder is still active. If the holder is abandoned, the request cleans
up the lock and proceeds. If the holder is still active, a
WriteIntentError is returned to the client.

This will unblock both of the referenced issues.
craig bot pushed a commit that referenced this pull request Aug 7, 2020
52388: kv: add WaitPolicy option to BatchRequest r=nvanbenschoten a=nvanbenschoten

Informs #40476.
Informs #51624.

This commit introduces the concept of request "wait policies". A
WaitPolicy specifies the behavior of a request when it encounters
conflicting locks held by other active transactions. The default
behavior is to block until the conflicting lock is released, but other
policies can make sense in special situations.

Within this new formalization, the commit creates two initial wait
policy variants:
```
// Block indicates that if a request encounters a conflicting locks held by
// another active transaction, it should wait for the conflicting lock to be
// released before proceeding.
Block = 0;

// Error indicates that if a request encounters a conflicting locks held by
// another active transaction, it should raise an error instead of blocking.
Error = 1;
```

`Block` is equivalent to the current behavior, so there's no further
action needed for that variant.

However, the `Error` policy is new, and this commit teaches the
`lockTableWaiter` about it. It ensures that when a request with the
`Error` wait policy encounters a conflicting lock, it uses a PUSH_TOUCH
PushRequest to determine whether the lock is abandoned or whether its
holder is still active. If the holder is abandoned, the request cleans
up the lock and proceeds. If the holder is still active, a
WriteIntentError is returned to the client.

This will unblock both of the referenced issues.

Co-authored-by: Nathan VanBenschoten <[email protected]>
@nvanbenschoten
Copy link
Member

I'll rebase again after yours merges but otherwise I think this is all set for a final review from bulk-folks just looking at the second commit.

Just merged, so you should be all set!

BACKUP, like any read, cannot export a range if it contains an unresolved intent.
If we hit an intent during iteration, we typically return a WriteIntentError like
any scan and let the store resolve it and retry the request before replying to the
sender.

However blocking intent resolution can mean a single txn left open can cause backups
to hang, sometimes for hours or even days in some observed cases. While operators
typically want bulk, background jobs like BACKUP to have as little impact on their
foreground traffic as possible, even more importantly they want their BACKUPs to
actually complete to maintain their RPO SLA. So while you typically would not want
a BACKUP to immeidately run at high priority and abort any transactions from OLTP
foreground traffic as soon as it starts, you do, eventually, want it to do whatever
it needs to do to finish.

Backdating BACKUPs with AS OF SYSTEM TIME '-30s' is a good best practice to reduce
the number of still-running transactions they may encounter, but an abandoned or very
long-running transaction can still hang such a backup. This change extends BACKUP to
initially start asking individual ranges to backup, but bail out if they encounter an
another transaction, rather than attempting to resolve it and holding up the whole BACKUP
while doing so. If any range replies that it bailed out, it is put at the back of the queue
to come back to later, and BACKUP continues to work on other ranges in the hopes that
by the time it is done with them, the intent may have already resolved itself.

When it comes back to a range that it has already attempted to backup, if the time
since the backup's read-time is greater than a configurable limit, it instead sets the
priority to high to mean any transaction encountered will be pushed so the backup can complete.

Release note (enterprise change): BACKUP takes priority over other transactions if its initial attempts to export a range fail.
@dt
Copy link
Member Author

dt commented Aug 9, 2020

Thanks @nvanbenschoten, working like a charm!

And I realized the bulk-reviewers already LGTM'ed so I guess this is now all good to go.

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 9, 2020

Build succeeded:

@craig craig bot merged commit fdf54eb into cockroachdb:master Aug 9, 2020
@dt dt deleted the backup-hang branch August 11, 2020 02:02
vxio pushed a commit to vxio/cockroach that referenced this pull request Aug 11, 2020
Informs cockroachdb#40476.
Informs cockroachdb#51624.

This commit introduces the concept of request "wait policies". A
WaitPolicy specifies the behavior of a request when it encounters
conflicting locks held by other active transactions. The default
behavior is to block until the conflicting lock is released, but other
policies can make sense in special situations.

Within this new formalization, the commit creates two initial wait
policy variants:
```
// Block indicates that if a request encounters a conflicting locks held by
// another active transaction, it should wait for the conflicting lock to be
// released before proceeding.
Block = 0;

// Error indicates that if a request encounters a conflicting locks held by
// another active transaction, it should raise an error instead of blocking.
Error = 1;
```

`Block` is equivalent to the current behavior, so there's no further
action needed for that variant.

However, the `Error` policy is new, and this commit teaches the
`lockTableWaiter` about it. It ensures that when a request with the
`Error` wait policy encounters a conflicting lock, it uses a PUSH_TOUCH
PushRequest to determine whether the lock is abandoned or whether its
holder is still active. If the holder is abandoned, the request cleans
up the lock and proceeds. If the holder is still active, a
WriteIntentError is returned to the client.

This will unblock both of the referenced issues.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants